I have all the data…how do I find it? A case for Enterprise Data Catalog

The era of Big Data has been growing over the last 10 years. Many enterprises have deployed a Hadoop distribution and have put in as much data as possible hoping to solve business problems and drive growth. Data is the new currency is what we all hear, and we hoard all data we can get our hands on.

The rush to store as much data as possible has brought about several issues, and the newly coined term “Data Swamp” has emerged. In the race to deploy a new technology to store data, governance and organization have taken a back seat. An enterprise now has data everywhere and when a business need starts to drive a need for deeper insight, an analyst or data scientist have to rely on institutional knowledge employees that know where to find things, or just manual searching to find the information they need. If you are new to an organization, then the problem is compounded since you don’t even know where to start.

People have realized that all that valuable data sprawl everywhere in uncatalogued Hadoop, data warehouses, data marts, and other data repositories becomes a very difficult currency to cash in. A new trend that has emerged is a data catalog. Either people create them on their own or they use a software tool to help comprehensively capture it. I have seen some interesting catalog builds with Graph databases that not only capture the data asset, but also captures the relationship of that data with other data assets so you can quickly and easily locate what you need. It’s very clever, but not easily accessible without specialized skills.

We have been using Informatica’s Enterprise Data Catalog (EDC) over the last year, and the product has really evolved. It delivers the capabilities in a clear user interface eliminating the need to create a custom code solution. The technical features are nice, and it enables an Analyst or a Data Scientist to independently get the information they need without having to rely on a lot of others.

A few of the capabilities it provides the analyst or data scientist include:

Reuse of data assets –If there are a lot of analysts creating analytical data sets, it’s often unnecessarily duplicated, usually because it’s not known that the information is already available. EDC intelligently allows someone to navigate data sets through lineage or through metadata to find assets that are already build. Not having to rebuild data assets is a huge time saver.
Fit for Use – EDC scans and profiles data resources, and provides that information to the user. The profiling information allows the analyst to understand if the data is fit for purpose. Having this information provided is a time saver. This saves the analyst the effort of having to query the data tables to see if the information is fit for use.
Data Domains – EDC has built-in Artificial Intelligence that can classify data columns to data domains such as BirthDate, FirstName, SSN, Address, and many others. Out of the box, the Informatica product comes with a lot, but EDC allows you to create custom data domains based on your business. The real value comes when you can visually navigate related data assets that have the same data domains. This allows the analyst to find all data with related data domains to help find associations they would not have been obvious.
Business Glossary – EDC integrates with Axon and allows the capability to link business glossary terms to data fields. This provides that extra layer of context and field definitions to the catalog making it easier for the analyst to have a deeper understanding of the data.

EDC is a satisfactory answer to a growing problem. CTI has developed a demonstration to showcase EDC for the HiMSS18 Conference in Las Vegas, March 5 -9, 2018 at the Sands Expo Center. Please join us or contact us for an appointment to see an example of how a data catalog can create clarity and use that data currency to solve a real-world problem.