Apache Kafka is designed for resilience, but in a production environment, performance degradation and failure are inevitable without robust monitoring and a solid error-handling strategy. This guide moves past the basics, focusing on the practical tools and metrics required to keep your Kafka cluster healthy and your data streams flowing reliably.
Essential Kafka Monitoring Metrics
Monitoring Kafka effectively requires tracking metrics across three major components: the Broker, the Topic, and the Client (Producers and Consumers). These metrics are typically exposed via JMX and collected using tools like Prometheus/Grafana.
| Component | Metric | Practical Meaning |
| Consumer | records-lag (or records-lag-max) | The number of messages the consumer group is behind the producer. Critical for SLA. |
| Consumer | fetch-rate | How quickly the consumer is pulling data from the broker. (Should track incoming-byte-rate). |
| Producer | request-rate | The rate at which the producer is sending requests to the broker. High correlation with application throughput. |
| Broker | request-handler-avg-idle-percent | Percentage of time the broker’s request handler threads are idle. Low value = broker overload. |
| Broker | UnderReplicatedPartitions | Number of partitions that do not have their full replication factor. Critical sign of failure. |
| Topic | MessageInPerSec | Total throughput of the topic. Key for capacity planning and detecting traffic spikes. |
Practical Tip: The single most important metric is Consumer Lag. High lag indicates consumers cannot process data fast enough.
Management via Kafka Admin Client for Kafka Monitoring
The Admin Client is the programmatic tool for managing and inspecting cluster state, replacing many of the legacy CLI scripts in production applications. It provides real-time cluster metadata without requiring direct access to the broker filesystem.
Listing Topics Programmatically
Instead of using kafka-topics.sh --list, a Java (or equivalent language) application uses the Admin Client:
try (Admin admin = Admin.create(props)) {
ListTopicsResult topics = admin.listTopics();
System.out.println("Topics: " + topics.names().get());
}
Increasing Partition Count
To horizontally scale a busy topic, you must increase its partition count using the Admin Client (kafka-topics.sh --alter equivalent):
// Define the new partition count
int newPartitions = 8;
Map<String, NewPartitions> assignments = Map.of(
"busy-topic", NewPartitions.increaseTo(newPartitions)
);
// Execute the partition increase
admin.createPartitions(assignments).all().get();
Note: Partitions can only be increased, never decreased. Planning is crucial.
Error Handling and Troubleshooting
Effective error handling is paramount. Errors can manifest at the client (application) level or at the broker/cluster level.
Client-Side Error Handling (Producer & Consumer)
Producer Failures:
NotEnoughReplicasException: Occurs when the acknowledgment setting (acks) cannot be satisfied due to broker failure.- Solution: Check broker health (
UnderReplicatedPartitionsmetric). Wait for brokers to recover or investigate network issues.
- Solution: Check broker health (
RecordTooLargeException: The message size exceeds the configured limit (max.request.sizeon producer,message.max.byteson broker).- Solution: Increase the broker and producer configuration limits, or implement client-side message splitting.
Consumer Failures:
- Lag Spike: A sudden jump in
records-lag-max.- Solution: Check the consumer processing logic. Is there a blocking call (e.g., waiting on a slow database)? Scale up the consumer group by adding more instances (up to the topic’s partition count).
UnknownTopicOrPartitionException: The topic or partition doesn’t exist, likely due to a typo or recent deletion.- Solution: Verify topic names and check the cluster state using the Admin Client.
Broker and Cluster Troubleshooting
1. Broker Crash Loop:
- Problem: A broker fails to start repeatedly.
- Troubleshooting: Check the broker logs (often
server.log) immediately. Look for OutOfMemoryError (OOM) or configuration errors (e.g., incorrectlistenersor ZK connection string). - Solution (OOM): Increase the JVM heap size allocated to the broker (via
KAFKA_HEAP_OPTS).
2. Network Latency:
- Problem: High P99 (99th percentile) request latency.
- Troubleshooting: Use the
kafka-configs.shscript to check broker-level network settings and quotas. Run a simple network ping/traceroute between the clients and the broker. - Example (Check Topic Quota):Bash
./bin/kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type topics --entity-name my-topic
3. Disk Issues:
- Problem: The broker runs out of disk space or disk I/O is maxed out.
- Troubleshooting: Monitor the disk usage and disk write throughput metrics on the broker host.
- Solution: Adjust the retention policy (
log.retention.hoursorlog.retention.bytes) using thekafka-configs.shtool to reduce disk consumption.
Kafka REST Proxy
While the Admin Client is essential for programmatic Java/Scala management, the Confluent Kafka REST Proxy provides a simple, HTTP-based interface for interacting with the cluster. This is invaluable for non-JVM applications (like Node.js or Python microservices) that need to perform simple management tasks or produce/consume messages without dealing with complex Kafka protocol libraries.
Example: Producing a message via REST Proxy (using curl)
# Assuming the REST Proxy is running on port 8082
curl -X POST -H "Content-Type: application/vnd.kafka.json.v2+json" \
--data '{"records": [{"value": "Hello from REST!"}]}' \
"http://localhost:8082/topics/rest-topic"
This simple HTTP POST replaces the need for a full Producer client setup, making diagnostics and quick integration tests extremely straightforward.
Mastering these monitoring techniques, Admin Client operations, and error-handling strategies will enable you to confidently run and scale Apache Kafka for your most demanding applications.
Try it at home!
