What I Learned This Week #4
Data Storage & Encoding, Measuring Disk, Apache Parquet, and Rust.
👋 Hi, this is Gabriel with this week’s learnings. I write about software, startups and things that interest me enough to learn about. Thank you for your readership.
This week I’m sharing my top learnings about data storage & encoding, measuring disk, Apache Parquet, Rust, and more. Hope it is helpful!
Learned while reading
Read Designing Data Intensive Applications Chapter 3 and 4, Understanding Software Dynamics Chapter 5 and 6, Apache Parquet’s specification, Encore’s blog on queueing strategies, and the Learning Rust With Entirely Too Many Linked Lists Chapter 2 and 3.
Storage engines fall into two categories: Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP), each optimized for different access patterns and performance characteristics. OLTP databases, such as MySQL and PostgreSQL, are designed to handle a high volume of transactions that access small amounts of data per query randomly. The result is that low latency to ensure fast response times is the primary design focus. This means that the primary bottleneck is often the disk seek time (the time it takes for the read/write head to move to the correct location on the disk to read or write data). On the other hand, OLAP systems, such as Hadoop Distributed File System (HDFS), are optimized for few but very complex read operations that analyze large volumes of data. The result is that high throughput is the primary design focus. This means that the primary bottleneck is often the disk bandwidth. To address the disk bandwidth bottleneck, many OLAP systems provide high throughput by enabling parallel data processing, allowing tasks to be distributed across multiple nodes.
Data structures used in-memory are not suitable for sending over networks or storing on disk. This is because the in-memory representation of data depends on machine-specific information like memory addresses (pointers), CPU architecture, and memory layout, which would not be meaningful or valid on another machine. To transfer or store data, you must encode it into formats that are platform independent and self-contained like JSON, Thrift, Protocol Buffers, or Avro, each with trade-offs in efficiency, human readability, and maintainability.
When you write data to a file system, the operating system doesn’t immediately write the data to disk. It lies to you! Instead, it buffers the data in memory to improve performance by reducing the number of expensive disk I/O operations. This buffering also means that there can be data loss if the machine crashes before the write buffer has been flushed to disk. To ensure that data is immediately written to the disk, you need to explicitly flush the file system buffers. On a POSIX system, you can call the fsync system call to flush the file system buffers for a specific file descriptor. Of course, explicitly flushing buffers will have a performance impact, as it forces the operating system to perform more expensive disk I/O operations more frequently rather than batching them.
Apache Parquet, a columnar storage format, optimizes data storage and query performance by leveraging columnar storage, compression, and encoding techniques. Data is stored by columns rather than rows, allowing selective reading of only the required columns for a query. This reduces the amount of data processed in analytical workloads. Parquet also encodes data at the column level, enabling the use of different encoding schemes for various data types, resulting in better performance. The encoded column data can then be compressed using algorithms like SNAPPY or GZIP, achieving higher compression ratios by leveraging the similarity of values within columns. The combination of columnar storage, compression, and encoding makes Parquet a standard for large-scale data storage and analytics.
Queueing strategies, such as First In, First Out (FIFO), Last In, First Out (LIFO), Priority, and Active Queue Management (AQM), impact the latency in request processing. In practice, it's common to see FIFO queues. Nevertheless, LIFO queues (a.k.a. stacks) offer an interesting property: by prioritizing the most recent requests, LIFO queues avoid wasting resources on processing timed-out requests, which can occur in FIFO queues when the queue starts to fill up. As a result, LIFO queues provide excellent median latency (50th percentile) performance. However, this comes at the cost of poor tail-end (99th percentile) performance, as older requests may experience significant delays.
In Rust, the compiler verifiably ensures that mutable references (write access) to each element in a data structure are returned exactly once within a particular scope. This restriction prevents simultaneous mutable accesses to the same data, eliminating data races. The "exactly once" semantics effectively allows for sharding access and ensuring exclusive access at any point in time.
Learned while listening
Apache DataFusion has the potential to fill the same role for databases as LLVM plays for programming languages. This could support the current trend of databases: building systems tailored for particular workloads which can yield substantial performance improvements, justifying the development effort.
Enterprises requirements are so different from Product-Led Growth (PLGs) companies, Small and Midsize Businesses (SMBs), and Mid-Market companies that you have to make a conscious choice to target enterprises.
The folks in the data systems space are starting to prioritize modularity and envisioning a future of reusable data components that will help streamline the creation of new data systems. A large motivation for this was the increasing trend of hardware accelerated data processing. It becomes very difficult to take advantage of the state-of-the-art as hardware evolves without clear API boundaries.