Extract-Transform-Load (ETL) has been used to support data integration, migration, and warehousing for decades. Regardless of whether that data is being moved in batches or in real-time, the basic concept remains the same:
Step 1: Extract
First, you import different data types (ex: structured, unstructured, CSV files, etc.) from your data source into a single repository.
Step 2: Transform
Next, you perform some process on that data to convert it into a usable format for the destination system.
Step 3: Load
Finally, you send that processed data to the destination system in either a full or incremental data load.
Batch ETL used to be the mainstay of ETL processes, but in recent years more and more businesses are converting to streaming ETL frameworks. Today, let’s clarify what batch ETL and streaming ETL are, the benefits and uses of each, and where ETL is headed in the future.
What is Batch ETL?
Batch ETL is also referred to as traditional ETL. It is the original technique used for extracting data from a source system. The data is gathered in batches at set intervals (hourly, daily, weekly, etc.), transformed appropriately, and then loaded into the destination system. A batch ETL process could also be scheduled based on a triggering event.
When is Batch ETL most useful?
Batch ETL is most useful when getting data updated in real-time is not essential. For example, a chain of restaurants may use batch ETL to run a daily report on revenues at each location. Or a company may trigger batch ETL processes to incrementally load data into a data lake or warehouse.
What is Streaming ETL?
Streaming ETL is often referred to as real-time ETL or stream processing because it moves data immediately instead of using triggers or timed intervals. With streaming ETL, data is constantly flowing, so as soon as new information is added to the data stream, it is entered in the ETL process and updated to the specifications of the data project.
When is Streaming ETL most useful?
Streaming ETL is most useful when immediate insights are important. For example, a bank would use streaming ETL for fraud detection on purchases or real-time payment processing. An airline would apply streaming ETL to reflect accurate information on how many tickets are currently available for a flight. Or a weather person may rely on the real-time insights of streaming ETL for up-to-the-minute weather data.
Benefits & Shortfalls of Streaming and Batch ETL
Streaming ETL is increasingly becoming more popular than batch ETL, with more than 60% of companies using it. However, each approach has its own benefits, shortfalls, and ideal applications, so it’s important for every organization to examine their specific goals when deciding which one to use.
|Batch ETL||Streaming ETL|
Simple to implement and monitor
Compatible with legacy systems
Can batch huge volumes of data over time
Continuous, low-latency ETL
Data processing is immediate
Compatible with new technologies
Real-time data insights
Failure in one data set could cause failure in entire batch process
New data types will not be recognized automatically and will cause inaccuracies
Live data means outages and performance issues are more urgent than with batch
Instantaneous processing makes it harder to guarantee data quality and accuracy
Requires highly capable platforms to properly perform and reduce latency
The sheer quantity of data plus the continued adoption of big data and AI processes are inhibiting many organizations from defaulting to batch ETL processes. Businesses are moving away from single-server databases to more complicated and scattered architectures. They’re acquiring data from more varied sources across the web, social media, mobile, IoT, and other devices. Plus, the desire to capitalize on real-time analytics is making the draw of streaming ETL all the more appealing across the board.
While batch is still useful in specific scenarios and with compatible legacy systems, the majority of ETL processes are evolving to the streaming model. Businesses can, however, mix and match real-time and traditional ETL methods to meet their use cases—it doesn’t have to be mutually exclusive. iPaaS solutions like StarfishETL enable these capabilities so businesses can create dynamic and scalable data relationships.