Skip to main content

Extract, Load, Transform?

Even as technology changes, many things stay the same. Not least, in the world of enterprise-scale data processing, where massive volumes of data often need to be reformatted, restructured, and moved from one place to another (for a broader introduction, you can read our Key Criteria report on data pipelines).

The Extract-Transform-Load (ETL) model emerged with traditional, data-centric architectures in which large-scale, structured databases were king. Cloud computing is changing the game, for several reasons:

  • ‘Elastic’ infrastructure (which scales according to need) reduces the constraints on what data can be stored where and when
  • Serverless compute models work on an event-driven basis, which is more conducive to (potentially infrequent or nonlinear) streaming data sources
  • Cloud-based data architectures offer greater flexibility, for example, by not requiring data to be fully structured in advance of storing it

As a result of such drivers, models are emerging that are more akin to Extract-Load-Transform, i.e., ELT, than ETL. Data can be loaded into a cloud-based data warehouse directly from the source, either as database exports, content, or data streams, and then can be treated ‘in situ.’ While the difference in naming may not appear huge, its impact is profound.

First, and perhaps most obviously, it streamlines the process overall: data can be loaded in raw form even as you decide how to pre-process and manipulate it. In addition, however, data only needs to be transformed according to need. Rather than having to process all data into a new format before it can be analyzed or used, for example, you can apply transformations to a subset of the data. Note: this is different from schema-on-read models, which bring transformation and query closer together, but do not reduce the gap between transformation and load.

ELT also reduces risk. Historically, a transformation could only be checked when the data was loaded. By loading the data first, transformations can be tested in situ (likely against a sample subset). Potentially, the overall result brings more effective data pipelines, particularly in situations where data structures are harder to lockdown. You should also see higher levels of cost efficiency: Rather than building multiple small bridges between data islands, you are creating a single data ‘mainland’ as a basis for data transformation at scale.

However, these benefits come with a number of ramifications and considerations:

  • Potentially, the sky is the limit in terms of which data sources could be loaded. You will still need to decide what you require to store.
  • In the case of serverless data use, you can define and configure events to trigger how data is loaded and transformed.
  • You will need to define (in advance) a cloud-based data architecture that is suited to the multiplicity of data types that you are looking to manage.
  • Compliance and security considerations remain as important as ever, with policies applying to the data pool as a whole, rather than to isolated data silos.

Overall, ELT changes the equation. Rather than needing to apply fixed data manipulation rules according to specific requirements (as was the case with ETL), you can be much more granular about defining and applying transformations according to your needs. At the same time, you will need to consider cloud-based data architecture strategically, taking into account current and future application and analytics needs.

This brings to it a final point: As we all know well, BI and analytics are not standing still. With machine learning increasing its impact, with new data architectures and use cases moving closer to the infrastructure edge, how we manipulate and use data will also have to evolve. Key to this evolution will be how we can manage data both flexibly and at a massive scale, minimizing any bottlenecks to data delivery, however data is to be used.



from Gigaom https://gigaom.com/2020/02/28/extract-load-transform/

Comments

Popular posts from this blog

Who is NetApp?

At Cloud Field Day 9 Netapp presented some of its cloud solutions. This comes on the heels of NetApp Insight , the annual corporate event that should give its user base not just new products but also a general overview of the company strategy for the future. NetApp presented a lot of interesting news and projects around multi-cloud data and system management. The Transition to Data Fabric This is not the first time that NetApp radically changed its strategy. Do you remember when NetApp was the boring ONTAP-only company? Not that there is anything wrong with ONTAP of course (the storage OS originally designed by NetApp is still at the core of many of its storage appliances). It just can’t be the solution for everything, even if it does work pretty well. When ONTAP was the only answer to every question (even with StorageGrid and EF systems already part of the portfolio), the company started to look boring and, honestly, not very credible. The day the Data Fabric vision was announced