Dec 13, 2022

Many asset managers have developed skepticism of the myriad modern data platform tools available today and the tangible benefits they promise to deliver over existing traditional data stacks. This extends to data pipelines, where there’s a view in some quarters that the emerging suite of next generation pipeline tools are functionally no different and essentially just repackaged versions for marketing purposes.

This skepticism is no doubt driven to an extent by past experiences with other emerging technologies that have failed to deliver on their promised hype. Are these next gen pipeline tools truly “emergent” ─ or just incremental improvements on existing capabilities?

Traditional Approaches to Moving Data

The primary goal of provisioning access to quality data in a timely and efficient manner still stands today. The fundamental purpose of data pipelines in moving data from a source to a destination also remains unchanged and can be thought of as the “glue’ that binds a data platform together. However, this basic data transportation process has developed over time into an expanded set of data processing activities that incorporate both operational and business logic to perform advanced data sourcing, filtering, cleaning, aggregating, enriching, transformation, and loading.

Over the past two decades, asset managers have built their data platforms using enterprise data management (EDM) or extract, transform, load (ETL) tools for ingestion and transformation. This EDM/ETL approach well suited the use cases at the time, when asset managers needed to extract structured batch processed investment data out of operational transaction systems for cleansing, merging, transformation, and loading into data warehouse tables for analysis.

Until now, these pipelines have served asset managers well and remain a good option for structured batch data sets. Therefore, in this context, questioning the value of next gen pipeline tools is understandable. However, in practice, many firms today wrestle with the same issues that prompted their initial investments in these EDM/ETL pipeline tools. So, what’s changed?

Drivers for Next Generation Pipeline Solutions

The basic need to ingest and transform structured investment data sets won’t be disappearing anytime soon. That said, business requirements continue to evolve, and new use cases have surfaced that increase the demand for processing more and varied datasets. Once again, this creates significant challenges and pain points for many firms’ data teams.

The adoption of cloud services has propelled this evolution. Increasing costs and specialist resource constraints associated with maintaining on-premises computing and storage have compelled organizations to replace their outdated legacy IT infrastructure by migrating to the cloud and using SaaS applications. The emergence of these multi-environment configurations (e.g., on-premises, cloud, hybrid, and multi-cloud) has spawned use cases requiring that asset managers source data from a proliferation of disparate locations.

Data sources and types have also extended past traditional transactional-based investment data, and the demand to ingest and analyze unstructured data has progressed beyond the realm of marketing. Today, a growing appetite for alternative data analytics now comes from the investment side. These alternative datasets (e.g., satellite weather data, geolocation foot traffic, customer credit card transactions, social, and sentiment data) require pipelines that can readily handle large volume, unstructured, real-time, streamed, and limited lifespan data.

Benefits of Next Generation Data Pipelines

So what distinguishes the emerging crop of next gen pipeline tools?

The biggest difference is that these next gen tools are changing the way pipelines have traditionally been built. Traditional pipelines have leveraged the same approach and technology advancements that have driven the evolution of software application development practices over the past 20 years. Modern data pipelines, by contrast, are built on scalable cloud-based architecture, leveraging the elasticity and agility of the cloud to readily scale and work around traditional on-premises infrastructure bottlenecks and to address the latency issues associated with processing real-time streaming and large unstructured data volumes.

Another key trend in modern data pipeline design is a change in where and when the transformation stage takes place. More agile data pipeline processes are supported through the adoption of ELT. This defers execution of transformation processes until after the data has been loaded into the data warehouse or data lake in order to take advantage of the increased scalability and performance of these modern cloud-based platforms.

Moreover, next gen pipeline tools support self-service management, allowing teams to easily create and maintain data pipelines, without needing the assistance of skilled IT professionals. These tools also can leverage a semantic layer, mitigating the need to move large volumes of data between sources and locations. And the tools support the use of simple declarative SQL statements to implement parts of the pipeline, democratizing data access, and avoiding workload backlog queues that teams typically experience with traditional pipeline development. This self-service management goes further by employing data observability tools to facilitate simplified monitoring and management of any pipeline problem.

Where to Next?

The pipeline’s basic purpose of data transportation and primary goal of provisioning access to quality data in a timely, efficient manner has not changed. However, data pipeline tools have significantly evolved in line with changing business requirements and the adoption of cloud services. The emerging next gen pipeline tools have been purposefully designed to address the latency, bespoke, complicated single-use builds, lack of automation and standardization, and reliance on specialized skills and knowledge pain points associated with traditional data pipeline development.

So, if your firm’s use cases have evolved beyond the need to ingest, transform, and load traditional structured batch processed investment data and into the realm of large volume, unstructured, real-time, and streamed data, these next gen pipeline tools will likely prove invaluable.

Looking at the bigger picture, modern data platforms also have evolved over the past decade. With the latest incarnation is the emergence of data fabric, a framework that aims to support automated, flexible, and reusable pipelines, while leveraging ML/AI capabilities.

What Is a Data Fabric?

A data fabric is a combination of architecture and technology that reduces the complexity of managing different kinds of data across multiple sources and systems. It provides a unified data environment with tools and services for accessing, integrating, cleaning, governing, and sharing data.

Want to know more about what’s coming down the line with modern data platforms? Check out Cutter’s Data Fabric and Data Mesh: An Introduction whitepaper.