21: Real-Time Data Streaming with Apache Kafka 📡

Hey data folks! 👋

Apache Kafka offers a robust ecosystem for handling high-speed data streams and is widely used in various industries for applications such as event-driven architectures, real-time analytics, and data integration. In this blog, we will explore Kafka’s core concepts, a practical use case using Confluent Cloud, and how to implement producers and consumers programmatically to process and store data efficiently.

Definitions and Setup

Kafka allows the consumers to collect and store data coming at a very high speed and process this data using Kafka Streams.
Apache Kafka revolves around the following core concepts:
- Producer: Any application that produces data, for instance, X.com.
- Consumer: Let’s say a Spark streaming application that consumes data.
- Broker: A software application installed on nodes that collectively form a cluster.
- Topic: A unique store that contains a particular kind of data.
- Partitions: When data grows larger, the topic gets divided into multiple partitions.
- Partition Offset: Serves as a sequence ID for messages arriving into a particular partition.
- Consumer Groups: A group of consumers who share the load of fast-paced data produced by multiple producers.
Confluent.io provides a managed Kafka environment on cloud and serves as a great way of practicing Kafka.

We create a ‘dev‘ environment in Confluent.io, in which ‘my_kafka_cluster‘ serves as our cluster.

Kafka Use Case

We create a new topic ‘retail_data‘ which, as the name suggests, will store the retail data. Note that we specify 6 partitions for the topic, to handle the case where data grows larger.

Through the UI, we produce a new message by providing a key and the corresponding JSON value. This first message gets produced in partition 0 and offset 0.

After producing four such messages, we observe that these get stored in 4 different partitions, decided on the basis of an underlying, consistent hash function.

If we produce a new message with the same key as we produced an earlier message with, the message gets produced within the same partition due to consistent hashing.

Note that the partition remains the same and we get a new offset of 1. Grouping messages with the same key into a single partition reduces the shuffling of data required while aggregation.

Programmatic approach

In order to programmatically implement a use-case using Kafka, we install ‘confluent-kafka‘ package into our environment’s libraries.

Consider a retail-chain use case where the retail-store sends the transaction to Kafka topic whenever a purchase is made.
We create a new topic named ‘retail-data-new‘ for our use case.

For producer configurations, we provide the IP addresses of a few servers inside bootstrap servers. The client can later connect to one of these servers.

The producer.produce(topic, key, value, callback) is an asynchronous method used to produce the data to a given topic. Being asynchronous, it does not wait for the message to be delivered, so the callback function may not get executed.
Due to the reason mentioned above, we use the poll() method using which the producer checks for any asynchronous events that need to be processed.
Subsequently, producer.flush() method is called to ensure that all outstanding messages get delivered.

After execution, the console states that the message is produced successfully.

We can also verify the newly produced message inside Kafka UI.

The producer can also read data from a JSON file by loading each key-value pair as follows.

We can verify within the Kafka UI that all messages from the file do get produced by the topic.

Kafka Consumer

Now, we need to read data from the Kafka topic and persist the data into storage layer.
In order to achieve this, we create a cluster in Databricks, wherein we install the confluent-kafka package.

For connecting to the Kafka cluster, we require the bootstrap servers, the API key, the API secret and the topic name.
Thereafter, we can read from Kafka using Spark.

After reading the data in batch mode as above, the dataframe looks as follows.

The keys and values in the dataframe above are in the binary format. In order to cast binary data into string, we use the following syntax.

So far, we have consumed the data in batch mode. In order to persist the streaming data from Kafka source, we can use the readStream operation as follows.
maxOffsetsPerTrigger option ensures that the micro-batches of streaming data are not uneven.

We can write the data into the delta table using writeStream operation. Note that initially, the select operation does not show any data.

After a few minutes, the data gets persisted and displayed in the table as follows.

Conclusion

Whether using the Confluent Cloud for simplified management or implementing programmatic solutions, Kafka empowers businesses to build efficient data pipelines with minimal latency. By leveraging Kafka’s capabilities, organizations can ensure seamless data flow, optimize analytics, and enhance decision-making processes.

Thanks for reading! Stay tuned for more. 🔔