Choose the Right ETL Technology for Your Business
When it comes to ETL Technologies, there are a variety of tools you can use for your business. Choosing the right one however can be tricky. There are a lot of third party vendors who offer ETL solutions, the two most frequently used are PowerCenter and SSIS (SQL Server Integration Services). I am very well versed with Informatica’s PowerCenter. Recently though, I used Microsoft SSIS for a project I was on. The following compressions are based on what I experienced while learning SSIS and implementing it for my client. If you these two tools have made it on your shortlist then hopefully my comparative discussion will help you narrow it down.
There are basic concepts to ETL development that need to be taken under consideration when finding the right tool for what you are trying to achieve. PowerCenter and SSIS are both similar yet vastly different in their sole capabilities. Here are some of the concepts broken down by how each tool handles them.
In data warehousing, a staging area is a temporary location in which data from different source systems is stored. These temporary staging tables are what your ETL (extract, transform and load) process will pull all the data from. Staging tables are the first essential step to the transformation process since this is where all the business rules come into play and the massaging, cleansing takes please.
SSIS – The Connection Manager is used to generate a connection to the physical location of the data. Whether it is staging tables or a flat file, the Connection Manager will get you to your source. If you have data coming from multiple locations, with SSIS you will have to create multiple connections. Depending on the version of SSIS, you can also create a configuration file that can be updated dynamically during run-time.
PowerCenter- with Informatica the connection process can be handled in a couple different ways. If your data is coming from different locations but the flow of the data from source to target is the same across all sources then you can create one mapping in PowerCenter that encompasses an overall data load into your targets. Once you have create that mapping, you go into Workflow Manager and create workflow tasks that run the standard mapping and define your difference connections within those tasks.
One of the biggest requirements of a data integration project is the capability to develop reusable solutions on one platform and be able to test and deploy it in separate environments. This can only be achieved seamlessly if your ETL tool allows dynamic updates the properties using a parameter file or configuration. SSIS and Informatica both possess this capability but the execution of them is different.
SSIS – Connections can be set by using the Configuration wizard. The wizard saves the configured data into an XML file. SSIS however, only allows you to define a single connection that applies to all tasks in a package.
Informatica – Sessions can have multiple sources and targets. The connections are mostly relational and can be changed by altering the parameters set within a parameter file. Unlike SSIS, you can have multiple connections in one workflow.
Is the reusability of a process, it is the manner in which the structure that makes up the life cycle of the ETL.
SSIS – Solutions are built using the “top-down” approach. Meaning, a general sequence of tasks is defined before determining their purpose of how they are carried out. In order to achieve reusability in SSIS, you would have to create libraries of packages and combined them into one master package. This can sometimes however cause issues if you have multiple connections.
Informatica- everything built in Informatica is built using a “bottom-up” framework. Meaning, the primary components in the process are mappings, mapplets, transformations, sources and targets, these mappings can be utilized by various workflows and worklets in a solution.
A panicle part of making mappings that provide real time or update frequently changing data is to be able to track updates made to your target data. This can be achieved by the use of slowly changing dimensions wizards in both SSIS and Informatica. Modifications to the data can be recorded using this wizard.
SSIS – Slowly changing dimensions in SSIS can be tracked using the wizard as your target. This keeps track of updates and also recognizes if fixed attributes have changed or been modified. You can set the wizard to fail the load if fix attributes have changed.
Informatica – Similar to the SSIS wizard, this wizard allows you to track changes and monitor all updates made to the data.
Package/Mapping run-time diagnostics and Log
When a workflow or package runs, one of the most important pieces of information an ETL developer needs is the error log. If your ETL run fails, the error log allows you to trace back to the cause of the failure. One of the hardest things I went through during my transition into SSIS was reading the error log.
SSIS – Once your package has run, you are able to go into the error log to see the steps it went through. Initially while I was still learning the tool, a lot of my packages failed. I found it very hard to debug the problem using the error message in SSIS because it was mostly the same error for all the failures. I would constantly go back and make changes to transformations without knowing if it was right or wrong. The messages were mostly very generic. I did however learn the hard way that SSIS is very sensitive with its data types. There may be an easier way to break down these error messages, for me though it was quite frustrating.
Informatica – After you have executed your mapping, you can go into the error log in Informatica Monitor and get specific details on each step. I like the error log in Informatica a bit better since it actually tells you what is causing the failure. The error logs in Informatica, in my experience guide you to the problem a lot better than the SSIS logs. Informatica PowerCenter also has a feature called Debugging, with this feature you can set start and end points in your mapping to track where the failure will occur before even running the mapping.
There were some other things that I like about Informatica that I didn’t find in SSIS such as: Versioning, we can make a mapping or workflow open the team and other developers can check in/check out the mapping and make their edits. Traceability, you can select a port and trace it back through a mapping with ease in PowerCenter. Metadata repository, you can search and access different things in a centralized place. Reusability, probably my favorite convenience in Informatica. You can alter sources and targets are easily reusable for other mappings, once you bring in a source and target, you can utilize it for other mappings. Same goes for some expressions.
In conclusion, both tools have their pros and cons. Both tools also can achieve most ETL needs. As a business you need to decide what works best for you. Besides the points I discussed above, there are a multitude of other factors you need to analyze and then chose the best fit for you.