Nowadays log processing has become a critical component of the data pipeline for consumer internet companies as there is a large amount of “log” is getting generated.
This data typically includes.
- User activity events corresponding to logins, page views, clicks, “likes”, sharing, comments, and search queries.
- The system matrix logs such as CPU utilization, memory utilization, network usage, disk usage, and so on.
- Ad targeting and reporting.
- The applications which are developed for security system generate user logs such as unauthorized activity and rude behaviors.
- The newsfeed features of social media are capturing the user's activity such as the new post feed, status update, their friend's activity, and so on.
To overcome these challenges Apache Kafka was introduced as a distributed messaging system that was developed for collecting and delivering high volumes of log data with low latency.
What is Apache Kafka?
Apache Kafka is a distributed, open-source, and event streaming platform which is written in Scala and Java and used for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications and used by many companies. Kafka aims to provide a combined, high-throughput, low-latency system that can handle real-time data. Kafka can be connected to an external system to perform data import/export using Kafka Connect and provides Kafka Streams, a Java stream processing library.
Apache Kafka organizes messages as a partitioned write-ahead commit log on persistent storage and provides a pull-based messaging abstraction to allow both real-time subscribers such as online services and offline subscribers such as Hadoop and data warehouse to read these messages at a random pace.
Apache Kafka Advantages
The following are some of the advantages of Apache Kafka.
- Apache Kafka can handle high-velocity and high-volume data with idle Hardware. It supports a message throughput of thousands of messages per second.
- Apache Kafka can handle these messages with very low latency which is in the range of milliseconds.
- Data and messages are stored on a disk which makes Kafka durable and messages are replicated due to which it never lost.
- Apache Kafka doesn’t get affected in case of node/machine failures within a cluster.
- Kafka can scale online as there is no downtime required for adding any additional nodes.
- Apache Kafka provides a distributed architecture that makes it scalable using capabilities like replication and partitioning.
Apache Kafka History
Let us see a year-by-year evaluation of Apache Kafka.
2010: Apache Kafka was developed by LinkedIn and released as an open-source project on GitHub.
2011: It was accepted as an open-source project by Apache Software Foundation.
2012: Apache Kafka was graduated from the incubator.
Apache Kafka Core Capabilities
Apache Kafka provides the following core capabilities.
1. High Throughput
Using a cluster of machines Apache Kafka delivers messages at network limited throughput. The latencies in which it delivers are as low as 2ms.
2. Scalable
Apache Kafka can scale production clusters up to a thousand brokers which can process trillions of messages per day, petabytes of data, and hundreds of thousands of partitions. It can elastically expand and contract storage and processing.
3. Permanent Storage
Kafka store streams of data safely in a distributed, durable, fault-tolerant cluster.
4. High Availability
Kafka provides high availability by stretching clusters efficiently over availability zones or connecting separate clusters across geographic regions.
5. Built-In Stream Processing
Kafka can process streams of events with joins, aggregations, filters, transformations, and more, using event-time and exactly-once processing.
6. Connect to Many Data Source
Kafka provides connect the interface to integrate with hundreds of event sources and event sinks including Postgres, JMS, Elasticsearch, AWS S3, and many more.
7. Rich Set of Libraries
Kafka provides a rich set of libraries to perform read, write, and process streams of events in multiple programming languages.
When to use Apache Kafka?
Apache Kafka can be used in the following condition.
- If we need a highly distributed messaging system.
- If we need a messaging system that can scale out exponentially.
- If we need high throughput on publishing and subscribing.
- If we need a fault tolerance operation.
- If we need durability in message delivery.