Machine learning with Kafka – The modern Stream Processing platform for all data preprocessing needs
Some bits about Kafka
Kafka is a popular messaging platform based on pub-sub mechanism. It is a highly available, fault tolerant & distributed system. Most organisations are using Kafka in different use-cases, but the best part of Kafka is its service oriented architecture, it makes its language agnostic, and gives it wide usability.
One of the Kafka use-case involves integration with Python. Existing Python based applications can communicate with each other using Kafka message queue. Or Kafka can be used as an Enterprise Service Bus (ESB) where all data transfers happen via Kafka.
This is a standard example of Enterprise Streaming Data Pipeline Architecture.
These can be extended for Big Data needs as well where Kafka is used in Big Data Pipeline Architectures by tuning Kafka implementation to accommodate high throughput performance (Or batching). In most of these use cases, kafka is a vertical used as an integration layer. Apache Avro is used as a standard format in Big Data Pipelines as Avro is a native data type in Hadoop, it is a binary form, serialized(Takes less bandwidth over wire) & is highly efficient in storage, transfers, schema, processing and operations (file management in HDFS).
Kafka with TensorFlow
Tensorflow is an open-source library for machine learning and artificial intelligence. It is used for inference and training of deep neural networks. It was developed by a team of Google Brain.
Most ML/AI architectures are not well equipped to handle continuous data streams (Which is the need of the hour), as most predictive analysis loads are executed on IOT based data sets. These are huge continuous streams of information generated by live-like electronic components. Generating insights, trends and monitoring comprise the essential essence of modernisation in industries.
A Kafka based data processing platform can provide a scalable, fault tolerant mission critical machine learning infrastructure ingesting, preliminary processing, data modelling, training data, deploying analytical models. A well designed Kafka platform can be used for richful monitoring of machine learning and artificial intelligence work loads.
This modern streaming architecture would enable competencies like preprocessing scalable data for training and predictions, amalgam of divergent deep learning frameworks, highly available data replication between data centers (Disaster Recovery (DR)), intelligent real time microservices deployed on Kubernetes(Preferably on cloud) and in house deployment of analytic models for offline predictions.
Kafka enables flexibility of being capable of exceeding expectations in both cloud and on-premise deployments. This makes it adept to all leading cloud services providers (Machine learning as a service (MLaaS)) not limited to Amazon ML & Sagemaker, Google ML Engine, Microsoft Azure AI platform & IBM Watson machine learning.
Kafka is best suited as a provision for distributed machine learning data pipeline along with a distributed persistent storage for ML as microservices use case. Now when organisations are shifting from monolithic form to distributed microservices style architectures, need for a distributed platform compatible with machine learning is much needed.
Some references for start understanding machine learning with Kafka and Tensorflow