rethinking data streams
This week I heard two separate conference talks mention the same concept. Got me thinking and I wanted to capture it.
First, at the AWS Chicago Summit, John Bennett from Netflix mentioned almost as an afterthought that updates to a database over time are really just an event stream. I canât recall the context but that thought really hit me over the head as something I already knew, but just didnât realize. (the deepest insights are usually of this type, similar to a curtain being drawn back and revealing something that was there the whole time)
Then, today I was watching a talk given by Jay Kreps, one of the founders of Confluent and of Apache Kafka, and he mentioned a similar idea. He went much deeper into the concept of a database table as an event stream. The gist I took away was, a table is a snapshot of data that changes over time as CRUD operations are performed. So, conceptually, the database table as viewed by a user can be understood to be the latest in a sequence of tables.
Taking it further, the sequence of tables can be snapshots of the entire table over time, or, much more simply, it can be just the sequence of âeventsâ that changed the state of the table. With knowledge of this sequence, the entire table could be âreplayedâ and thus recreated. This could be used as a clever mode of database backup.
Apparently this is known as the âtable/stream dualityâ. And this type of thinking is at the heart of log-based âstreaming platformsâ like Kafka and Kinesis.
As we move away from the world of âbig dataâ and its batch and micro-batch processing, and into the world of âreal timeâ data streaming with real time processing, it makes sense to think this way.
SoâŚwhere else could this event stream logic be applied?
- git
- application/server logs
- network activity log