First of all, thank you so much for the clear and detailed explanation. I started by reading through the Warpstream logic, which is amazing, and now this solution is even better.
I do have one technical question, though. As always, there are trade-offs. Based on my understanding and the information in the article, I believe the latency for produce requests will likely be lower than Warpstream. However, in traditional Kafka, with a replication factor (RF) of 3, you can choose to get the acknowledgment (ACK) only when 1, 2, or all replicas have received the data. This means the data is in memory, and you’d need to be extremely unlucky to lose it—specifically, all 3 brokers holding the record in buffer cache would have to fail. In this case, the data will be written asynchronously to disk.
Given this scenario, my question is: are we writing to the Write-Ahead Log (WAL) in a synchronous way? If so, does that mean it would be slower than traditional Kafka?
That said, I can clearly see that this solution will be faster than Warpstream, and even with this potential downside, I think AutoMQ is a great approach overall.
Hi Sergio. I'm glad to see your thoughts, and I'm happy to discuss further with you. AutoMQ's write to WAL is synchronous. We fully utilize the persistence of EBS itself to ensure data persistence, which is different from Kafka's use of the ISR mechanism to ensure data persistence. Compared to Kafka's write to Page Cache, many people may worry that AutoMQ's write latency will be slower than Kafka's, but that's not true. AutoMQ and Kafka have the same level of write latency, and AutoMQ's latency is more stable. Kafka's write involves ISR replica replication and Page Cache. Kafka's ISR replica replication will bring additional latency overhead, and Page Cache will have Cache Miss, which will affect Kafka's write performance. AutoMQ uses Direct I/O to write to raw devices bypassing the file system, greatly improving write performance. In addition, since the persistence of data is handed over to EBS, and it does not involve the copying of partition data replicas, this greatly improves the performance of AutoMQ's write. In actual scenarios, due to the absence of Page Cache, AutoMQ's write latency is more stable than Kafka's. You can see the benchmark we conducted on AWS, AutoMQ can achieve P99 single-digit millisecond write latency: https://docs.automq.com/automq/benchmarks/benchmark-automq-vs-apache-kafka#fixed-scale.
First of all, thank you so much for the clear and detailed explanation. I started by reading through the Warpstream logic, which is amazing, and now this solution is even better.
I do have one technical question, though. As always, there are trade-offs. Based on my understanding and the information in the article, I believe the latency for produce requests will likely be lower than Warpstream. However, in traditional Kafka, with a replication factor (RF) of 3, you can choose to get the acknowledgment (ACK) only when 1, 2, or all replicas have received the data. This means the data is in memory, and you’d need to be extremely unlucky to lose it—specifically, all 3 brokers holding the record in buffer cache would have to fail. In this case, the data will be written asynchronously to disk.
Given this scenario, my question is: are we writing to the Write-Ahead Log (WAL) in a synchronous way? If so, does that mean it would be slower than traditional Kafka?
That said, I can clearly see that this solution will be faster than Warpstream, and even with this potential downside, I think AutoMQ is a great approach overall.
Disclose: I work for AutoMQ.
Hi Sergio. I'm glad to see your thoughts, and I'm happy to discuss further with you. AutoMQ's write to WAL is synchronous. We fully utilize the persistence of EBS itself to ensure data persistence, which is different from Kafka's use of the ISR mechanism to ensure data persistence. Compared to Kafka's write to Page Cache, many people may worry that AutoMQ's write latency will be slower than Kafka's, but that's not true. AutoMQ and Kafka have the same level of write latency, and AutoMQ's latency is more stable. Kafka's write involves ISR replica replication and Page Cache. Kafka's ISR replica replication will bring additional latency overhead, and Page Cache will have Cache Miss, which will affect Kafka's write performance. AutoMQ uses Direct I/O to write to raw devices bypassing the file system, greatly improving write performance. In addition, since the persistence of data is handed over to EBS, and it does not involve the copying of partition data replicas, this greatly improves the performance of AutoMQ's write. In actual scenarios, due to the absence of Page Cache, AutoMQ's write latency is more stable than Kafka's. You can see the benchmark we conducted on AWS, AutoMQ can achieve P99 single-digit millisecond write latency: https://docs.automq.com/automq/benchmarks/benchmark-automq-vs-apache-kafka#fixed-scale.
This is really well written. I enjoyed it more than I'm willing to admit.
Thank you again for such fantastic content
great article as always, was wondering if you also got insight on what was the performance improvement in numbers due to this.
Thank for you reaching out, here is the benchmark from AutoMQ: https://docs.automq.com/automq/benchmarks/benchmark-automq-vs-apache-kafka
wow, went crazy on efficiency and cost.