kafka monitoring

Some notes I jotted down while watching this presentation by Gwen Shapira from Confluent.

At a minimum: are the brokers healthy?
A good “check engine light” is having under-replicated partitions. this issue can be related to many issues within the system. For example:
- Broker issues: broker down, broker has a problem (i.e. network issue, resource problem, misconfiguration)
- Systemwide issues: brokers are out of balance (i.e. one broker is the leader of too many topics and is doing too much work)
Capacity monitoring is important. Adding brokers means moving partitions which means network traffic, CPU usage and disk I/O. This can be transparent if done with enough headroom (below 70%)

This basic technique leads to metrics such as:

under-consumption (consumer is missing messages)
over-consumption (consumer reading messages twice) which can lead to latency over time
consumers are falling behind. consumers far behind force Kafka to read from disk instead of memory - this can impact the speed of the whole system

Kafka receives request
request sent to the request queue
request picked up by i/o thread which does writing or reading of data
wait for remote response - wait for other brokers to respond (i.e. acks)
create response
send to response queue
network thread picks up response and sends out the message to the OS for network dispatch