Wednesday, May 17, 2006

Open Source Catalog: ETL

Open Source Catalogue: ETL
Posted 5/9/2006 by Alex Fletcher (Technology Analyst)


Extraction Transformation and Loading (ETL) Software is responsible for doing exactly what its name implies...it extracts data from its source location(s), transforms it and loads it into its target location(s). What makes good ETL software is the ease with which it allows you to set parameters for that process, view the status, handle errors and gracefully catch runtime errors that may occur. Also the number and type of supported data sources is an important factor when considering software in this arena, i.e. does it support Oracle databases and flat XML files? Until rather recently the availability of open source tools in this arena has been rather sparse, but lately it has received an infusion of interest for not only using but developing new tools and functionality.

1. Kettle - At the beginning of December 2005, Kettle 2.2 was released publicly under the LGPL license and has since proved itself to be a justifiably complete open source solution to ETL needs. Kettle is 100% metadata based, without requiring any code generation in order to run properly. Metadata driven ETL tools are worth their worth in gold because they don't require code changes in order to fully manage and control the tool. Since Kettle was originally created in order to create and populate data warehouses both junk and slowly changing dimensions are supported. An extensible plug-in mechanism makes it possible to define any number of complex data removals and transformations. Just barely 4 months after it was released into the wild, Kettle was announced to have been merged into the Pentaho Business Intelligence (BI) platform. As the leading open source ETL offering the move made sense for Pentaho who have their sights set on creating a comprehensive BI platform from open source software components. Kettle's development community should greatly benefit from the added visibility of being associated with the oft profiled Pentaho. In the near future it should continue to expand adding better support for its toolset, performance upgrades, more documentation, etc.

2. CloverETL - a Java based data transformation framework for structured data. Capable of running as a standalone application or being embedded in an application. Some of its features: handles all databases with a JDBC driver available for it, XML based transformation graphs/metadata descriptions of records, supports NULL values, can run on multiple CPU's using a strategy called pipeline-parallelism. Distributed under the LGPL license, the latest version (1.8.2) was just released last Wednesday (05/03/2006). Out of the box CloverETL supports four different data types: String, Numeric, Date, Bytes. Architecturally, it is conceptually broken into logical units called transformation units that encapsulate transformation functionality and intelligence, each of which can be used as standalone components in other applications/services. When the framework is running each component runs as a separate thread, creating a better fail safe environment that won't be as negatively affected in general when a singular operations fail.

3. Octopus - An Enhydra ETL tool for JDBC Data Transformations. Since Octopus only supports data sources that come with a JDBC driver, it includes special drivers that enable connectivity to CSV, XML, MS-SQL and property files. Octopus uses load job XML files in order to define the parameters of given transformations. The main issue with Octopus is the requirement that in order to access a data source using it, there must be a JDBC driver available. On the other hand it remains a powerful tool that is nonetheless capable of (among other things): normalizing data, creating artificial keys, tables, primary keys. Plus, all ETL jobs run using Octopus are database vendor independent.

4. KETL - an ETL tool by Kinetic Networks. Includes job scheduling and alert capabilities. KETL is a Java-based integration platform that, like CloverETL, is also a multi-threaded server that manages various job executors. Jobs are
defined using an XML definition language. The heart of KETL is characterized by its Parallel Java Kernel which consists of at least four standard exectuors: XML, SQL, Operating System, Log Sessionizer, while also allowing any other number of executors to be added as well. Other components within the kernel are responsible for handling duties such as metadata access, scheduling, routing, error handling, profiling and resource pooling. KETL is also capable of operating within a clustered environment where a set of KETL servers pass jobs to a configurable number of available executors. Output can be directed to an alert mechanism as well as a management console.

0 Comments:

Post a Comment

<< Home