Metadata: the elephant in the data integration room
Cloud might be making life easier and cheaper for application users, but for developers who have to cope with the intricate and sometimes messy world of data integration, it is introducing new complexities.With Cloud applications, which can be distributed across diverse client and server environments and end user scenarios, it is impossible to predict what types of data sets people will want to upload and analyse. These data integration known unknowns call for new approaches but fortunately open source developer tools are available to ease the burden.</p>
Cloud might be making life easier and cheaper for application users, but for developers who have to cope with the intricate and sometimes messy world of data integration, it is introducing new complexities. With Cloud applications, which can be distributed across diverse client and server environments and end user scenarios, it is impossible to predict what types of data sets people will want to upload and analyse. These data integration ‘known unknowns’ call for new approaches but fortunately open source developer tools are available to ease the burden. A key step in any data integration effort is defining metadata, which most developers understand as “data about data”. Serious data geeks will know about metadata’s four levels of abstraction: data, metadata, meta-models and Metadata Object Facilities (MOF). These levels of data abstraction help make it easier for people working with the data or metadata to describe a piece of data to others or to a piece of software. Similarly in the ETL (Extract Transform & Load) process there are also levels of abstraction, the first being ETL software. Developers still do lots of ETL work manually reasoning that it is a one-off, inexpensive process. In reality it requires regular maintenance and becomes expensive due to the ever changing IT landscapes in which applications live. This reality gave rise to a second abstraction level, ETL tools that automatically generate code based on pre-defined parameters, or ‘ETL metadata’. This level involves much less coding, but still requires compilation, packaging and deployment, hence the third level of ETL abstraction. This level requires no code, compilation, packaging nor deployment and is comprised of two separate components: ETL metadata, which describes the workload and an engine to execute it. ETL tools like the Pentaho Kettle open source project I founded have transformed the way companies perform data integration. Low-level coding can now be replaced by graphical user interfaces in which ETL metadata can be entered. The low-level programming APIs have been replaced by visual tools, including debuggers, which can be managed centrally making it very easy for developers to maintain and monitor the various workloads. However this is no time to rest on our laurels. We haven’t yet eliminated the need to program data integration workloads, but just made it easier. With Cloud, data sources and targets are becoming fully automated, hosted in some cases across hundreds of servers in different locations on virtual machines making those who use classical ETL tools fight an uphill battle. And then there are those ‘known unknowns’. For example, when a user uploads a spreadsheet to a SaaS application, IT might know the name of the file and the database table in which to store the information in advance, but not the actual format of the file and the calculations and manipulations that need to take place. These types of workloads call for new fourth and fifth levels of ETL abstraction. The fourth is intended for the use case just described. We create a template with an incomplete set of ETL-metadata describing the workload and inject the missing parts just before we execute the work. In this way the same template can be used for not just one spreadsheet but for hundreds or thousands of similar workloads, avoiding the need for manual work. Kettle started to support level four abstraction in the form of ETL metadata injection a few years back and this is becoming more and more popular. Developers no longer have to undertake the extremely tedious job of cobbling together hundreds of transformations to migrate as many database tables. They just create one template and one transformation to inject the details. The fifth level of abstraction returns to programming but only for the ETL-metadata that describes the work that needs to be done. This offers enormous potential but at the cost of re-introducing some programming complexity. The golden rule of metadata is that it needs to provide a return on investment like reduced maintenance, transparency and so on. For levels four and five we’re lowering the metadata investment by getting the metadata from readily available sources like database catalogs, web services, XML or JSON data sources. However, we’re also increasing the investment because we increase the complexity or re-introduce the need for coding. The Kettle project also supports Level five ETL abstraction in the form of a Java API for ETL-metadata programming. You can read more about it on our SDK landing page. In conclusion, if you’re facing tough integration challenges and if haven’t done so already, take a look at Kettle and make your work life easier by abstracting it out. Only you can determine what will give you the best ROI but hopefully this overview will help you weigh up your options before you embark on your data integration adventure.