What is a Data pipeline?
A data pipeline is a system where data is transferred in chunks in a serial and systematic manner (Messages, records) between systems. These flows are well defined, audited and might contain sensitive information, which needs to be secured. These pipelines can be application queues, transfers to an archive store, to a data lake or standard communication between different organisational systems. But how is data pipeline is different from data ingestion?
Data pipeline vs data ingestion pipeline ? How are these used together?
Data Ingestion typically means to absorb/analyse some data, make it simpler or accessible. Ingestion is a sophisticated process where data from systems is processed and refined to be able to utilize. For example, there is an enterprise database and we need a copy of this data into a data lake or a cold storage. How would we do that?
The typical answer would be CDC or Change Data Capture or a full snapshot (For static use cases involving less updates). We would be discussing more on this as it’s a very commonly used approach.
Debezium is one of the products specializing in streaming DB. It defines itself as
“Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.”
Oracle Golden gate
OGG is a proprietary tool from Oracle having capability to generate CDC trail files from Databases using a sub component called OGG Data Pump. This is a preferable option for folks having an Oracle enterprise license or a similar arrangement (Oracle provides lucrative bundling options).
For API based, trail files, snapshots based systems
For integrating sophisticated or non-standard systems, one can use this brilliant ingestion tool, Apache NiFi (Which is of course open-source). It has extreme capabilities to ingest data from almost all sources. It is a distributed production ready tool, generally used in integration and automation use-cases.
Ingestion & Data Pipelines when put together
An ingestion tool like Apache NiFi when put together with a streaming platform like Apache Kafka brings extensive capability to integrate almost all real-world systems seamlessly on the table.
Some data ingestion pipelines use cases
- CDC data from enterprise databases to a data lake such as HDFS, S3 etc. Data can be extracted from databases using Debezium or Oracle Golden Gate and produced to Kafka (Using OGG Kafka Handler), then consumed by Confluent HDFS sink connector or S3 sink connector, ultimately dumped in HDFS or S3. These data sets can further be analysed via ETL.
- For Snapshot based systems data which are generally in XML, JSON, delimited formats, one can use Apache Nifi and push them to Apache kafka or Kinesis and finally to consumers, these can further be processed/interpreted to generate interactive dashboards & visualizations.
- For efficient processing and enhanced performance Apache Avro can be used as the intermediate format, Hadoop extensively supports Apache Avro format. Read more about Apache Avro here.
- Click stream and IOT based solution work like pro with Apache Kafka at their core. Low latency, distributed architecture of Kafka compliment here. Here is one example of Clickstream data analytics with Kafka KSQL db, elasticsearch and Grafana for visualizations and clickstream dashboard generation.
- Kafka is used at Uber at a very high scale.The use cases included, are not limited to real time aggregation of geospatial time series, computing key metrics as well as forecasting of marketplace dynamics, and extracting patterns from various event streams.Read more about it here.