Don’t Drown in the Data Lake: Remember to Data Profile
Do you need to gain a better understanding of more than just your own transactional system information? Trying to integrate your own transactional information with social media, news feeds, product reviews, or maybe machine based logs? Do you feel the pressure to deliver analytic solutions more rapidly? If so, then you are traveling down the path of developing a Big Data Analytics solution.
Big Data platforms, like Hadoop, facilitates the construction of the data lake: a repository of high volume, rapidly changing, and minimally integrated data, both structured and unstructured. Data lakes provide value by minimizing the need for data integration, thereby exposing critical information to the business that may not have been accessible before. Data lakes differ from classic data warehouses not only in the ability to manage unstructured information, but also by loosely integrating data. Data warehouses require proper data integration through ETL (Extract-Transform-&-Load) and data integrity through data modeling, whereas data lakes do not. By diminishing the data modeling and data integration efforts, data lakes promise a quicker time to value by more rapidly exposing data to the analytical user. Big Data analytics still require data modeling and integration, data lakes merely delay this exercise.
To gain true value from your Big Data platform with a data lake, as with any analytics platform, we need to discover actionable intelligence. For example, in a recent solution I was a part of building for the pharmaceutical industry, we built a data lake on an EMC Isilon and Hadoop platform with clinical study data from the National Institute of Health (NIH), adverse event data from the Food & Drug Administration (FDA), and a wealth of news articles from various websites. We leveraged Informatica Big Data Edition to acquire and ingest the data into Hadoop to build our data lake. We delivered actionable intelligence, such as displaying the primary suspected drug causing an adverse reaction and new research recommendations. The rise of Big Data platforms and data lakes imply that data will be acquired from more various sources and in a greater variety of formats than ever before. Two challenges arose when delivering actionable intelligence from a data lake: first (1) data integration, such as merging FDA, NIH, and news article data and then (2) data discovery, for insightful patterns, trends, and correlations, such as the leading cause of adverse events for Parkinson’s Disease patients and future research recommendations for the field.
We need to integrate the data as a prerequisite for data discovery and visualization, and to effectively integrate the data, we need to understand it. Data diversity, as found in Big Data, elevates the importance of understanding, or in other words, profiling the data. Continuing on our example, if we are not able to fully assign adverse events, clinical studies, and news articles to Parkinson’s Disease, we can not begin to discover insightful correlations, patterns, and trends because we won’t be reviewing a full data set.
Profiling the data involves understanding it for completeness, conformity, and accuracy. In our use case, we need to ensure that we can consistently bring data around Parkinson’s Disease together from the NIH, FDA, and the various web site sources. These various sources refer to Parkinson’s Disease by its different names such as “Parkinson’s”, “Parkinsonism”, “PD”. Clearly, these names do not match and will not naturally integrate together. We needed to standardize the disease name of Parkinson’s Disease across the data from the NIH, FDA, and various website articles. Profiling the data lake enabled us to discover the data rules we needed to build to prepare it for data discovery and visualization. Big Data data profiling on a Hadoop data lake would require writing numerous complicated Map Reduce programs and a great deal of time. We quickly profiled the data on Hadoop using Informatica Big Data Edition, the same tool that was used for preparing and integrating the data for later discovery in our visualization tool. Informatica BDE data profiling helped us find the best columns to integrate together and the actions we needed to take to deliver the most accurate analysis for our visualizations.