kafka monitoring
Some notes I jotted down while watching this presentation by Gwen Shapira from Confluent.
Kafka Metrics
Cluster Health
- At a minimum: are the brokers healthy?
- A good “check engine light” is having under-replicated partitions. this issue can be related to many issues within the system. For example:
- Broker issues: broker down, broker has a problem (i.e. network issue, resource problem, misconfiguration)
- Systemwide issues: brokers are out of balance (i.e. one broker is the leader of too many topics and is doing too much work)
- Capacity monitoring is important. Adding brokers means moving partitions which means network traffic, CPU usage and disk I/O. This can be transparent if done with enough headroom (below 70%)
The basic Monitoring technique
- Count messages produced
- Count messages consumed
- Check timestamps
This basic technique leads to metrics such as:
- under-consumption (consumer is missing messages)
- over-consumption (consumer reading messages twice) which can lead to latency over time
- consumers are falling behind. consumers far behind force Kafka to read from disk instead of memory - this can impact the speed of the whole system
Kafka Request Lifecycle
- Kafka receives request
- request sent to the request queue
- request picked up by i/o thread which does writing or reading of data
- wait for remote response - wait for other brokers to respond (i.e. acks)
- create response
- send to response queue
- network thread picks up response and sends out the message to the OS for network dispatch