Designing Data-Intensive Applications (Book Review)

Designing Data-Intensive Applications by Martin Kleppmann

I find this book to be an excellent, comprehensive guide to architecting data systems in the modern stack, with a healthy dose of history to illustrate how we arrived here. Building data systems that fit the business models and evolve as the company scales is an art requiring we not only know the otions available to us, but have meditated on historic implemenetations that confirm or refute our assumptions about how these things behave in the wild.

1. Reliable, Scalable, and Maintainable Applications

The author jumps right in by setting the stage with the principal challenge facing modern applications - data. How much data, it’s coplexity, and the rate at which it is changing. Data systems typically are painted from a palette of databases, caches, search indexes, stream processing and batch processing. Our job as system architects (which, I believe every engineer has a role in playing, at least in part) asks that we paint the most suitable picture from these tools to meet the challenge at hand we are building software for. Exploring the art of data system design is the aim of this book.

The first chapter centers us on the fundamental aims we are trying to achieve in the design and evolution of any data system - reliability, scalability, and maintinabilty. Designing a data system that delivers consistent quality to customers while scaling prompts some interesting questions. The author shares a few, “How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for the service look like?”. In particular, the author gives an example of an API layer that directs to a cache for read requests, a database for cache misses and writes, and to an index (like Elasticsearch) for search requests. That’s a lot to keep track of when developing that API. What should we consider when designing systems of such complexity?

Well, “the system should continue to work correctly even in the face of adversity”, which would make it reliable.

And, “as the system grows there should be reasonable ways of dealing with that growth”, which means it is scaleable.

And finally, over time lots of folks are going to be contributing to and working within the system, and should be able to do so productively, which means the systeme should be maintainable.

And there we have it. Our north star. Or stars. A compass we can use to guide our conversations as we architect and implmeent our data system. The author then proceeds to shine some extra light on each.

My family is reliable, so I have a good somatic feeling of that word, but it’s nice to extend that somatic sensation to include the systems I design and steward. I’m not sure how other technologists do it, but I like to build with my systems as kin, as I was first taught when I washed upon these Pacific shores (from NYC) and Keith Hennessey so graciously caught this queer runt that had been all bandied about by life and asked me to reflect on James Broughton at an event at the San Francisco Art Institute. I pulled out my cello and tried my darndest to celebrate the triumph of Broughtons' The Bliss of With, a gorgeous poem I sing on repeate to Carl Tashian. I hope you all read that poem and find someone in this world to sing it to, as that kind of education is invaluable.

It’s nice to stop, drop, and roll on how we might operationalize reliability, because, like all ideals, being precise will allow us to callibrate just how reliable a system needs to be to serve our purpose. That was a big moment for me personally when I realized I could rely on that fact that some people in my life were not reliable. It helped me feel the fabric of reliability as a mesh of expectations and results undulating and evolving over time, as oppossed to some metallic mass of an idea.

“The things that can go wrong are called faults”, and I find myself wondering if perhaps we can describe the intersection of reality and our expectations with more fidelity. More linguistic and structural clarity. For the love of earth, trans technologists know a lot about this rich estuary lol. Perhaps we can invoke the geological view of fault, as my friend Angal Field mentioned this morning when I mentioned fault tolerance. This treratment of a planar fracture or discontinuity in a volume of rock. That starts to get interesting. Fault as invitation to examine the consequenece of material and stressors. What would be the materiality of a data system, if you were to imagine its corrolary in the physical world? And how might its system stressors be represented in that corrolary system? And would fault analysis look like?

The author reminds us that faults are really just those system behaviors that deviate from spec, not all the things that could conceiveably go wrong in a systeme. And these faults are distinct from failures, where the system no longer provides the service the customer expects. The principle errors we concern ourselves with our hardware errors (often addressed with multi-machine redundancy, software errors (like the notorious Linux kernel leap second mishandle of June 30, 2012), or straight up human errors. For the latter, the author advises comprehensive test coverage, installing roll-back protocols for quick recovery from error, comprehensive applicatoin performance monitoring (e.g. DataDog, New Relic, which monitor everything from cloud servers, through databases, middlewear, and 3rd party integrations), and decoupling the places where people make mistakes from the places where they can muck things up.

“There is no such thing as a generic, one-size-fits-all scalable architecture”, is the best sentence I’ve ever read in regards to enterprise software scalability. If you walk the halls of a startup, you can feel the tension caused by the surging imperative of scale. For my teams, I like to exercise our imagination on how elastic our systme feels at any given time. If some random tweet blows up tomoorrow, and we’ve got 100x more users on our site Monday morning, what will break first? Can our severs handle the swell?

The author’s discussion of maintainability is real yo. I remember at Landed when we were rewriting the core logic we’d MVP’d in Zapier (which our product visionary Norma Gibson leveraged for some heroic level boostrapping), mapping solutions would gum up our collective imagination because we had such a dim collective schemata of the Zaqpier webhook topology. We had stepped outside our canonical engineering abstractions and were playing in the land of a product’s specific idea, at that time, of how workflow management should function. From my vantage, this is what made our system hard to maintain for a period of time. Not the use of Zapier itself, but our inability to model out changes in test environment and easily rollout such workflow changes once we’d proved things to ourselves (and document and train our Ops teams!). So for me, maintainability is not just documentation, but building systems that stay as close to the paths of canonical system design patterns so all engineers on deck can leverage the robust architecture of engineering abstractions.

2. Data Models and Query Languages

Data modeling is vital to any enterprise software because it determines so much of our day to day development exerience as engineers. I priotize the layer of abstraction we’re most familiar with as application developers, but it’s nice to be reminded that database architects brew on a whole other set of considerations as do hardware engineers optimizing how bytes are represented by energy pulses in the system itself. Layering abstractions allows for teams to own their slice of pie without needing to mentally model the entire pie each time they want to make an API call or run a batch job. Thank goodness for that!

For much of the web, relational databases are the name of the game, the database approach that first properlly allowed for product developers to focus on business data processing (both transaction and batch processing) without needing to dig into the weeds of how their data was represented in the database. Still, we know there are entire roles and teams still dedicated to SQL query optimization, requiring depth of knowledge in query execution and database storage patterns, but for many developers this is territory seldom explored in their day to day lives.

3. Storage and Retrieval

The chapter kicks off with a cute example of a simple database written as a simple bash script defining set and get funcitons for a key-value store that is esseentially a log where new values are appended. This opens up a discussion of two core tenants of any data storage conversation - what happens at write and what happens at retrieval. From there the author expands to indexes, which we can define to faciliate speedy retrieval (although if we had tons of indexes this would add time to our writes as we would be updating them each time we added data).

There are several different index types, perhaps most notibly hash indexes - for each key-value pair you store, you can hash the key and use that as the byte offset location where you’ll store the value. The example of Bitcask is given which keeps all keys mapped in working memory, so you want to make sure you have enough RAM. If it’s possible to keep all your keys in memory and your use case finds you making lots of value updates, this solution might fit.

So what happens when our database starts to grow and we become worried we’re going to run out of space? One option is to start a new file and peform compaction on the old, which means you throw away duplicate keys in the log. You can then merge files that have undergone compaction and write them to a new file, keeping things tidy.