Last week Systems Distributed ‘24, was personally the most valuable conference I’ve attended!
The highlight of the week was meeting people in person whom I’ve engaged with for months on Twitter! But, the conference didn’t just provide a venue for people I sorta kinda knew to gather, but was filled with very high quality speakers and intellectually stimulating talks. I left with many lessons, new mental models, and too many technical curiosities to now explore. Here are a few.
Systems Thinking and Engineering Culture
Additional Resources:
The largest recurring theme in Systems Distributed was the importance of systems thinking. While many great talks highlighted the depth of this concept, three stood out to me.
Systems Engineering is designing & using abstractions that help make a problem computationally tractable (subject reasoning & analysis).
Amod Malviya, in his talk, distinguished between programming and engineering. A programmer being someone primarily hired to write and maintain code for specific tasks or features, while an engineer considers the broader interconnections and interactions within the entire system.
What caught my attention was Malviya's emphasis on the commoditization of programming, which is occurring from various angles. One factor is the industry demand for programmers. Many Fortune 2000 companies simply need professionals who can work on internal tooling and help automate processes. This demand has largely fueled the rise of coding bootcamps, which often succeed in helping individuals secure six-figure salaries in the USA but fall short in creating professionals who understand beyond certain higher levels of abstraction—these are the now harder-to-find engineers.
Another factor is the emergence of Large Language Models (LLMs) capable of generating substantial amounts of code with minimal human input, further reducing the need for traditional programming skills.
As the value of programming decreases, the value of systems thinking to create scalable and reliable systems increases, creating opportunities for those interested in understanding complex systems. Critical thinking becomes crucial, and the harder-to-find systems skills offer significant opportunities for those who possess them. Individuals who can understand why a company like Wells Fargo experienced technical glitches that caused some customers’ deposit transaction to disappear from their accounts—and can prevent such issues—become highly valuable in the industry.
Richard Feldman, in his talk, challenged the idea of blindly following best practices by examining the costs of parallel computing across multiple cores and across multiple machines. He demonstrated that in many scenarios, the overhead of setting up parallelism is so costly that the best single-threaded implementation outperforms them. Of course, if data doesn't fit on a single node, there's no other option, but the message was clear: go deep, test for yourself, and determine what makes sense for your specific system requirements rather than automatically "always going parallel." This point again emphasized the difference between a programmer who follows best practices and maintains themselves at a certain level of abstraction in the stack they know, versus an engineer with a systems mindset who questions everything and plunges down layers of abstraction to reason about and analyze the reality of the system.
Lastly, Andrew Kelley playfully walked through his experience building a music player in Zig. He demonstrated how even a music player can be approached with the depth of systems thinking, showcasing how intellectually stimulating and rewarding this mindset can be.
The Rise of New Software Abstractions
Additional Resources:
Many speakers focused not just on the implementation of software, but also on creating new software abstractions to improve developer experience.
The app development ecosystem accretes complexity year over year as a result of decades of bad design patterns and attempts to solve problems on the wrong layers of the stack.
Dominik Tornow introduced the conceptual thinking behind a new durable executions framework, aiming to provide an abstraction layer between your application and the platform it runs on and interacts with, allowing your application to pretend that certain distributed systems problems don’t exist. With durable executions, not only is a large class of error handling for distributed systems errors (e.g., process crashes, network interruptions, and other temporary platform-level disruptions) reduced down to simple sequential logic, but observability is also enhanced by providing a detail view of the execution flow over time, including state transitions, retries, and recovery.
James Cowling and Sujay Jayakar were brave enough to challenge the status quo by questioning SQL's dominance as a query interface with their work on Convex. Cowling and Jayakar identify several weaknesses in SQL’s abstraction, including its limited type system, complexities in write operations (e.g., SELECT FOR UPDATE
), and unpredictable performance in transactional scenarios due to query planner decisions. Their most intriguing solution to me addressed the problem with unexpected query planner decisions by making index usage explicit, thus ensuring predictable performance by allowing developers to clearly see which parts of their queries are utilizing indexes for efficient lookups. Effectively, this approach moves some of a traditional query planner’s responsibilities (deciding which indexes to use for a query) up an abstraction level.
I found it refreshing to see developer experience discussed as more than a marketing slogan, but rather as concrete technological innovations.
Ensuring Safe and Correct Software
Additional Resources:
These presentations collectively emphasized a crucial point: while it's becoming easier to build distributed systems, proving their correctness (a.k.a testing) remains a significant challenge.
Brian Lagoda and John Murray, from Antithesis, showcased their reimagined testing approach for distributed systems using a deterministic hypervisor. This deterministic property allows non-deterministic systems to run on top of the hypervisor for simulation testing purposes. If the simulations uncover a bug, the determinism enables replay of the exact execution path that led to that bug, effectively making heisenbugs easy to reproduce and fix.
While the engineering feat of building a deterministic hypervisor is impressive, the more interesting aspect of the product is their intelligent exploration. Instead of relying on complete randomness, their approach searches the state space by focusing on paths most likely to lead to bugs.
Besides the product itself, the talk explored how the company uses its own solution internally to enhance the product and boost team confidence and ambition. Such increased self-assurance even emboldened them to undertake the ambitious task of building a database! Although the empowering effect of robust testing infrastructure seems self-evident, many developers I know still question the value of software testing. The reasons for this skepticism remain puzzling, particularly in light of this talk, which clearly demonstrates testing as a significant long-term productivity booster.
Traditional randomized testing, both in the field of Databases and outside of it, might be a somewhat inefficient way to hunt for bugs. By controlling the inputs of random number generation, we can make our software try less of the same thing and instead, explore new, creative ways to misbehave.
Alex Petrov shared another similar testing approach that applies his knowledge of model checking techniques to testing distributed systems like Apache Cassandra, focusing on deeply hidden (non heisenbugs) bugs that are difficult to find due to the infinite possible execution paths. Petrov challenges the effectiveness of random fuzz testing by proposing a more intentional and systematic searching strategy. The core concept involves a deterministic simulator, a generator, and a test oracle.
Kyle Kingsbury shared case studies through Jepsen that primarily underscored the ambiguity in database vendors' documentation, particularly regarding the isolation levels they offer. Kingsbury points out that many issues he finds are largely due to the lack of a standard definition for isolation levels like "repeatable read." This causes many database vendors to make claims in their documentation that don't match developers' expectations. In the past, Kingsbury has even made a plea to standards bodies, pointing out that ANSI SQL's isolation levels are ambiguous and incomplete and need to be fixed.
The increased focus on testing reflects a maturing field, where a growing number of people are dedicating substantial effort to tackling this problem space. This trend is underscored by the announcement of the DST Alliance at the conference.
Lessons from Building Distributed Databases
Additional Resources:
Unexpectedly, there were many case studies of building distributed databases.
Joran Dirk Greef from TigerBeetle spoke extensively about consensus and its role in durability. A couple of points from this talk caught my attention. First, he acknowledged that traditional storage fault models in academia have limitations, leading to many applications that don’t recover well from storage faults. Greef discussed more modern recovery strategies that can correctly handle storage faults such as data corruption or errors. Much of this knowledge is embedded in TigerBeetle's database.
The second point was the concept of logical availability and how the CAP theorem is too restrictive a model when considering physical availability. Overall, this talk was packed with distributed systems insights, and it will take me some time to read through all the research papers and resources highlighted.
Sammy Steel from Figma discussed her team's approach to implementing an in-house AWS RDS (Postgres) with horizontal sharding. The most interesting takeaway for me was her team's decision to build rather than buy a solution. Steel explained that they had tried numerous off-the-shelf solutions but consistently encountered write throughput issues. She went on to assert that running databases at scale remains one of the largest unsolved problems in cloud computing. Steel emphasized that when operating at truly large scales, managed solutions almost never "just work."
Certain early company decisions are very hard to walk back. Your data model, your ORM, and your consistency model are extremely hard to change.
Gwen Shapira from Niles discussed their implementation of distributed Data Definition Language (DDL) operations. The highlight of her talk for me was the team's methodical approach to problem-solving. They began by mapping all potential failure scenarios into a comprehensive table. This table included columns such as state (e.g., normal state, started transaction, pre-commit, committing transaction), event (e.g., database crashes, originating database crashes, coordinator crashes, database becomes unreachable), and corresponding actions.
Lastly, my favorite talk was the least technical one (in a good way) by Deepti Srivastava. She added a unique dimension to systems thinking by urging engineers to adopt a product-oriented mindset. Srivastava challenged common fallacies such as "if you build it, they will come," emphasizing the importance of keeping end users in mind. She argued that technical excellence is now table stakes in the industry and that users care most about ease of adoption, use, and maintainability. In particular, Srivastava shared that, during her time working on Spanner at Google, even the technological feat that was Spanner was not always sufficient to overcome the friction caused by data migrations for potential customers.
Notes from Water Cooler Chats
Raft is simpler to implement than Paxos but at scale has limitations in handling certain failure modes.
Data migration remains a significant pain point for many database vendors.
Andy Pavlo is the man.
SIMD, SIMD, SIMD.
The seL4 Microkernel is cool.
The idea of WASM is good, the reality is shit.
Most systems folks have adverse reactions when they hear the word AI or JavaScript (this one is funny because systems folks end up selling to JavaScript developers).
Madsim-rs is a neat deterministic simulator for distributed systems in Rust.
SIGMOD 2022 Tutorial: “Dissecting, Designing, and Optimizing LSM-based Data Stores” is useful to start learning how to build databases.
Side eye looks promising.
Blockchain startups still cooking.
Looking Ahead
I highly recommend this conference to all engineers. It was, by far, the best conference I have attended. It opened my eyes to numerous challenging problems in the space and will keep me very busy in the coming months as I explore the ideas discussed throughout the conference in depth.
For next year, I hope to see more OLAP (Online Analytical Processing) discussions and case studies, especially some state-of-the-art GPU-accelerated data processing talks from folks like Wes McKinney. Looking forward to Systems Distributed '25!
Andy Pavlo is the man, indeed.
Thank you for the great summary. no talk of GPUs or FPGAs ?