Rise in usage of Streaming Platforms as Traditional ETL becomes obsolete

Photo by ev on Unsplash

The Global Big Data market was estimated at $23.5 billion in 2015 and is expected to reach $118.52 billion by 2022 growing at a CAGR of 26% from 2015 to 2022. (1) The last decade has seen a sharp increase in the usage of ‘data’ which in turn is responsible for creating a billion-dollar valuation for companies, driving industry growth, creating new revenue streams and giving birth to “New Economy” all together where companies are constantly acquiring capabilities to harness massive volumes of data for their business growth. With underlying importance around data, the challenge ahead is how can companies manage so much of data which are not only voluminous but are also complex when it comes to collecting data from disparate sources like – ERP, CRM, and Social Channels. Each of the companies’ departments has different needs and would have to query the data differently. Under such prevailing circumstances, companies would have to standardize the data and obtain an overall meaning behind them.

ETL, a concept popularized since the 1970s, found its extensive application in data warehousing over the last couple of decades. There was a constant need for a system that can gather data, standardize it, store it, and make it available for end-users to query.  ETL (Extract, Transform and Load) did this job, but in recent times requirements around data usage have changed dramatically calling for the need for a more advanced solution to address increasing complexities. Some recent data trends such as the replacement of single server databases with a myriad of distributed data platforms operating at a company-wide scale, more types of data sources being present, such as logs and sensors beyond just the transactional data created need of the traditional ETL architecture to be replaced. The need for a global schema, data cleansing and curation being manual and error-prone were the other drawbacks with traditional ETL. It also came with a high operational cost and was capable of processing data only in batch fashion. To overcome these shortcomings, Enterprise Application Integration, EAI (a different class of data integration technology for connecting applications in real-time) was invented. The EAI employed Enterprise Service Buses and Messaging Queues, and so were not scalable. The difference that lies between the two is, ETL was scalable but not capable of doing batch processing and on the other hand, EAI was real-time but not scalable, hence both ETL and EAI were found to be outdated to solve the latest requirement of the tech industry. Data Integration and ETL in the modern world needed a complete revamp, and that is where Streaming Platforms such as Apache Spark, Amazon Kinesis came into the picture as they can perform real-time analytics and are also scalable.

Fig. 1 Graph denoting the relationship between ETL, EAI, and Streaming Platforms

The modern event-driven world has a new set of requirements for data integration. It should have the ability to process not just high volume but also high diversity data, it needs to be real-time which involves transitioning to event-centric thinking, built to enable forward-compatible data architecture and has the ability to add more applications that need to process the same data but differently. In summary, the needs of the modern data integration solutions are scalability, diversity, latency and forward compatibility. Streaming platforms are increasingly being used as they solve challenges with traditional ETL or EAI. The streaming platform serves as the central nervous system for the company’s data. It serves as the source-of-truth pipeline for feeding all data-processing destinations; Hadoop, DWH, NoSQL systems and more. 

If interested in learning more about streaming platforms, please check out ENFUSE.IO official blog page where Enfuse engineers have provided an in-depth explanation through their articles. ENFUSE.IO is a professional services company that helps clients deploy production-scale data processing pipelines using paired programming and test-driven development.