Write Behind Log

April 2020 Summary

April 30, 2020May 6, 2020 Minghui LiuLeave a comment

In April, I got up around 9:41AM everyday, and concentrated for 104 hours and 21 minutes in total – that is on average 3 hours and 21 minutes each day and 59 minutes shorter than last month. My most efficient time slots were 11AM and 5PM.

Online course wise, I finished CMU 15/445 and lesson 1 – 5 of CMU 15-721.

I read the following papers and articles:

Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, Stan Zdonik. C-store: A Column-oriented DBMS. SIGMOD, 2005.

Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, Mike Zwilling. Hekaton: SQL Server’s Memory-optimized OLTP Engine. SIGMOD, 2013.

Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, Michael Stonebraker. OLTP Through the Looking Glass, and What We Found There. SIGMOD, 2008.

Jeff Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004.

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. OSDI, 2008.

Atul Adya, Barbara Liskov, and Patrick O’Neil. Generalized Isolation Level Definitions. ICDE, 2000.

Kallman, Robert, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan PC Jones et al. H-store: a high-performance, distributed main memory transaction processing system. VLDB, 2008

Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. HotCloud, 2010

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s Highly Available Key-Value Store. SOSP, 2007.

Eric Brewer. CAP Twelve Years Later: How the “Rules” Have Changed. IEEE Computer, 45, 2 (2012).

Y. Wu, et al., An Empirical Evaluation of In-Memory Multi-Version Concurrency Control, in VLDB, 2017

D.R.K. Ports, et al., Serializable Snapshot Isolation in PostgreSQL, in VLDB, 2012

Y. Huang, et al., Opportunities for Optimism in Contended Main-Memory Multicore Transactions, in VLDB, 2020

T. Neumann, et al., Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems, in SIGMOD, 2015

H. Lim, et al., Cicada: Dependably Fast Multi-Core In-Memory Transactions, in SIGMOD, 2017

P.-A. Larson, et al., High-Performance Concurrency Control Mechanisms for Main-Memory Databases, in VLDB, 2011

J. Böttcher, et al., Scalable Garbage Collection for In-Memory MVCC Systems, in VLDB, 2019

J. Lee, et al., Hybrid Garbage Collection for Multi-Version Concurrency Control in SAP HANA, in SIGMOD, 2016

Z. Wang, et al., Building A Bw-Tree Takes More Than Just Buzz Words, in SIGMOD, 2018

S.K. Cha, et al., Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems, in VLDB, 2001

G. Graefe, A Survey of B-Tree Locking Techniques, in TODS, 2010

J. Levandoski, et al., The Bw-Tree: A B-tree for New Hardware, in ICDE, 2013

J. Faleiro, et al., Latch-free Synchronization in Database Systems: Silver Bullet or Fool’s Gold?, in CIDR, 2017

T. Kissinger, et al., KISS-Tree: smart latch-free in-memory indexing on modern architectures, in DaMoN, 2012

V. Leis, et al., The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases, in ICDE, 2013

High-Performance Concurrency Control Mechanism for Main-Memory Databases

April 21, 2020April 21, 2020 Minghui LiuLeave a comment

This paper describes the concurrency control mechanisms of the Hekaton storage engine of Microsoft SQL Server, one is optimistic and one is pessimistic. The authors also implemented a simple single-version locking concurrency control method optimized for in-memory databases, and report performance comparisons of the three.

The authors found that single-version locking is very fragile, in that it works well for short transactions and low contention, but its performance degrades rapidly under high contention, or if there is even one long running transaction. On the contrary, multi-version concurrency control method perform well under high contention and when there are long running transactions. The optimistic method consistently performs better than the pessimistic method.

The authors first experimented with homogeneous workload of short update transactions, which represents an extreme update-heavy scenario, and varied the multi-programming level of the system.

The pessimistic method, under low contention and read committed isolation level, has 30% lower performance than the optimistic method (because of extra writes for tracking dependencies and locks, which cause increased memory traffic. It takes pessimistic method 20% more cycles to execute the same number of instructions, and the additional control logic translates into 10% more instructions per transaction. Under high contention, the optimistic method performed the best throughput, with pessimistic method close behind. Single-verison locking achieved good throughput numbers initially but did not scale beyond 8 threads.

If we fix the contention level (low, 24 threads) and transaction workload type but vary the isolation level, optimistic MVCC’s performance degrades the fastest, followed by pessimistic MVCC and the single-version locking’s performance reduces only by 1.8%.

Screen Shot 2020-04-21 at 11.13.18 AM.png

Then the authors experimented with heterogeneous workload, fixing the multi-programming level and varying the percentage of short and long read-only transactions in the mix.

Starting with short read transactions, under low contention, single-version locking out-performed both MV until 90% read-only workload, at which point it was surpassed by both MV schemes. Under high contention MVCC schemes start to show more advantage as percentage of read-only transactions increases. When 80% of the transactions are read-only, two MVCC schemes achieve 63% and 73% high throughput than 1V.

If we mix in long-running read transactions instead, the advantage of MVCC schemes shows even more. 1V achieves twice the update throughput of the MV schemes when there is 0 long read transactions. However mixing in even one long read transaction, 1V’s update throughput drops by 75%, in contrast to only 5% drop of MV schemes. When 50% of the active transactions are long readers, MV has 80x high update throughput than 1V. In terms of throughput, MV schemes consistently outperform 1V as the gap widens as the number of long readers increases.

TATP benchmark simulates a telecommunications application. 80% of the transactions are queries, 16% of them are updates, 2% are inserts, and 2% are deletes. 1V achieved 4,220,119 txn/s, pessmisitic MV achieved 3,129,816 txn/s and optimistic MV achieved 3,121,494 txn/s.

I am only summarizing the performance results here, for detailed description of the two multi-version concurrency control implementations, I suggest that readers go to the original paper.

Serializable Snapshot Isolation in PostgreSQL

April 20, 2020April 20, 2020 Minghui LiuLeave a comment

This paper documents how the authors implemented a serializable snapshot isolation in PostgreSQL, and out performed strict-2PL on some workloads, while incurring low cost (less than 7%) relative to snapshot isolation.

PostgreSQL, like many other database systems, use the weakest ANSI SQL isolation level: READ COMMITTED, by default in order to get good performance. The highest isolation level that PostgreSQL provides, and what the user gets when she asks for SERIALIZABLE, is actually SNAPSHOT ISOLATION. Although SNAPSHOT ISOLATION avoids all three anomalies (dirty reads, non-repeatable reads, and phantom reads) defined in the ANSI standard it stills allows some anomalies that are not commonly known, two of which are simple write skew, and batch processing anomalies. A descriptions of these two anomalies is in section 2.1 of the paper.

Although there are techniques that can be applied to avoid these anomalies under snapshot isolation, the analysis required to identify them is often costly, especially when they inherently concerns the interactions between multiple transactions. In comparison, SERIALIZABLE avoids all problems because transactions can be treated as if they are running serially in complete isolation.

Instead of implementing serializability using Strict 2 Phase Locking (S2PL), the authors implemented Serializable Snapshot Isolation (SSI). I will very briefly explain SSI here. Curious readers can go read the original paper. In short, SSI runs transactions using snapshot isolation, but performs additional checks to determine whether anomalies can possibly happen. SSI is similar to other concurrency control protocols that uses serialization graph testing. But instead of detecting cycles, SSI looks for “dangerous structures” (at least two adjacent rw-antidependency edges) in a serialization graph. Using this method, SSI may give some false positives, but it is more efficient, because it is less restrictive: it allow some rw-conflicts as long as they do not for a “dangerous structure”. Therefore SSI offers better performance compared to S2PL.

To identify rw-antidependencies, SSI have transactions acquire special “SIREAD” locks on the data they read. These “SIREAD” locks do not block conflicting writes, but a conflict between write lock and a “SIREAD” lock will be flagged as a rw-antidependency. Because conflicts can occur even after a reader commits, “SIREAD” locks need to persist until all concurrent transactions commit.

The authors also documented two optimizations, safe snapshot and deferrable transactions, that they implemented for read-only transactions in section 4 of the paper. Details of SSI implementation can be found in section 5.

SSI does have potential unbounded memory usage problem, because a transaction can hold a large number of locks, and cannot release them until all concurrent transactions commit. Other transaction states such as the rw-antidependencies used to check for dangerous structures may also be held for a long time. A single long-running transaction can prevent thousands of other transactions from being cleaned up. The solution was to bound the memory usage of SSI and gracefully degrade (with potentially higher false positive abort rate). Additionally, four techniques that can limit the memory usage were given in section 6.

The authors used three benchmarks to evaluate the performance of SSI: a modified TPC-C, the SIBENCH microbenchmark and the RUBiS web application benchmark. On the SIBENCH, SSI has an overhead of 10-20% for tracking read dependencies without the read-only optimizations. On the TPC-C benchmark with “credit check” transactions from TPC-C++ variant, SSI causes 5% slowdown relative to snapshot isolation with the read-only optimizations turned on. On the RUBiS benchmark, SSI archieved throughput of 422 req/s, compared to 435 req/s of snapshot isolation and 208 req/s of S2PL. Details of experimental setup and benchmarks are in section 8 of the paper.

In summary, SSI is a novel method of achieving serializability on top of snapshot isolation, without using S2PL. It achieved better performance than S2PL while paying a small cost relative to snapshot isolation.

OLTP Through the Looking Glass, and What We Found There

April 13, 2020April 13, 2020 Minghui LiuLeave a comment

This paper from 2008 provides an instruction-level breakdown of the performance impact of each major component of a OLTP database, running a subset of TPC-C.

Since the invention of transactional databases in the 1970s and 1980s, a few things have changes:

Hardware improvements. CPUs are faster so that each transaction now only takes a few microseconds to complete. Memory is getting cheaper so that most OLTP databases fit in memory.
Modern internet applications and data intensive applications no long require full standard database features, and can live with varying levels of consistency, reliability, concurrency, replication and query-ability.

Some alternative database designs are now worth considering, such log-less databases, single-threaded databases, and transaction-less databases.

The authors started with a system called Shore, and progressively striped down each of the following components of it, to identify the overhead of each one. The benchmark they run were NewOrder and Payment transactions from TPC-C.

Logging
Locking
Latching
Buffer management

The detailed instruction count breakdown was given by figure 5 and figure 6 of the paper.

Screen Shot 2020-04-14 at 10.34.33 AM.png

Buffer management and locking operations are the most significant contributors to performance overhead, followed by logging and latching operations.

The authors concluded that unless one strips down all of these components, it is hard to get a main-memory optimized database that performs much better than a traditional database where most data fits into RAM. But if one does strip down all of the components, i.e, to design a single-threaded, in-memory, transaction-less databases that uses copy-state-over-network recovery, the performance will be orders of magnitude better than the system one starts with.

Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores

April 12, 2020April 12, 2020 Minghui LiuLeave a comment

This paper from 2014 argued that DBMSs were not ready to take advantage of the on-chip parallelism offered by many-core machines. As the number of cores increases, the problem of concurrency control schemes becomes more challenging to deal with. The authors implemented the following 7 concurrency control schemes from two families:

Two-Phase Locking
- 2PL with Deadlock Detection (DL_DETECT)
- 2PL with Non-waiting Deadlock Prevention (NO_WAIT)
- 2PL with Waiting Deadlock Prevention (WAIT_DIE)
Timestamp Ordering
- Basic T/O (TIMESTAMP)
- Multi-version Concurrency Control (MVCC)
- Optimistic Concurrency Control (OCC)
- T/O with Partition-level Locking (H-STORE)

and evaluated their performance and scalability under various OLTP workloads, using YCSB and TPC-C benchmarks. For details of the results please see the figures and Experimental Analysis section of the original paper. In summary, all of the seven current concurrency control schemes suffered from some kind of bottleneck when scaled to many cores, especially under contention. The different bottlenecks were summarized nicely in Table 2 of the paper.

Screen Shot 2020-04-13 at 11.32.26 AM.png

The authors thought that some of these bottlenecks could be addressed by new hardware support.

All T/O schemes suffer from the timestamp allocation bottleneck, which can be addressed by using synchronized clocks across CPU cores (currently only supported by Intel CPUs) or built-in hardware counter (no CPU currently supports this).
Memory-caused bottlenecks can be alleviated by hardware accelerators on the CPU that copies memory in the background (which eliminates the need to load all data through the CPU’s pipeline), and by using optimized memory allocation schemes.

Although distributed DBMSs can achieve better performance than a single-node DBMS, they suffer from a different bottleneck: the need of an Atomic commit protocol to support distributed transactions. The authors believe that a single many-core node with a large amount of DRAM might outperform a distributed DBMS for all but the largest OLTP applications.

March 2020 Summary

March 31, 2020April 12, 2020 Minghui LiuLeave a comment

According to my time management app, this month I woke up around 9:40 everyday, and stayed focused for an average of 4 hours and 20 minutes daily. My most efficient time slots were 10AM – 1PM and 4PM – 6PM.

I watched lesson 1- 20 of CMU 15/445 and finished homework 1, 2 and project 1. I also watched 5 lessons of the TLA+ video course.

Book wise, I finished Concurrency Control and Recovery in Database Systems, by P.A. Bernstein, read chapter 10 and 13 of the Database System Concepts 7th Edition, and chapter 10 and 11 of Principles of Distributed Systems 3rd Edition.

I read the following papers word by word:

Comer, Douglas. Ubiquitous B-tree. ACM Computing Surveys (CSUR) 11.2 (1979): 121-137.
Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007).
Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade, Vera Watson. System R: Relational Approach to Database Management. ACM Transactions on Database Systems, 1(2), 1976, 97-137.
Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price. Access path selection in a relational database management system. SIGMOD, 1979.
Jim Gray, Raymond A. Lorie, Gianfranco R. Putzolu, Irving L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Data Base. , IBM, September, 1975.
Rakesh Agrawal, Michael J. Carey, Miron Livny. Concurrency Control Performance Modeling: Alternatives and Implications. ACM Transactions on Database Systems, 12(4), 1987, 609-654.
C. Mohan, Bruce G. Lindsay, Ron Obermarck. Transaction Management in the R* Distributed Database Management System. ACM Transactions on Database Systems, 11(4), 1986, 378-396.
Katsarakis, Antonios, et al. Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.

I read some parts of the following papers:

Michael Stonebraker and Lawrence A. Rowe. The design of POSTGRES. SIGMOD, 1986.
David J. DeWitt, Shahram Ghandeharizadeh, Donovan Schneider, Allan Bricker, Hui-I Hsiao, Rick Rasmussen. The Gamma Database Machine Project. IEEE Transactions on Knowledge and Data Engineering, 2(1), 1990, 44-62.
C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, Peter M. Schwarz. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Transactions on Database Systems, 17(1), 1992, 94-162.

How to read a paper by Peter Carr

February 23, 2020February 25, 2020 Minghui LiuLeave a comment

Today I watched a video by Prof. Peter Carr of University of Minnesota on how to read research papers. Here are the notes I took.

Most of the papers are organized like this:

Title
Keywords
Abstract
Introduction
Experimental
Results and Discussion
Summary/Conclusion
References

For computer science papers, especially systems papers, the layout can be a little different. After the introduction section there is usually a section that explains the design of the system. The experimental and results section are usually merged into one.

But regardless of the format of a paper, the last thing you want to do is to read it from start to end. Do NOT read a paper linearly.

Prof. Carr divided his paper reading method into two phases.

Phase 1: survey the paper

This phase should takes 5 to 30 minutes.

read the title and keywords
read the abstract
read the conclusion

Phase 2: read the article

This phase might take between 1 hour to a few days

look at the tables and the figures, including the captions
read the introduction (for some background)
read the results discussion (this is the heart of the paper)
read the experimental/design (learn how the authors did the work)

At any point of the read process, if you feel that you have lost interest, feel free to stop reading. After reading the paper, it is a good idea to write down some notes (preferably in a notebook instead of in the margins) so you do not have to read the paper again.

Hello World!

February 21, 2020February 25, 2020 Minghui LiuLeave a comment

I have been subscribing to and reading a lot of blogs, most of them are acedemic blogs that post research ideas and paper reading notes. They got me interested in the field of distributed systems and databases, and eventually motivated me to apply to Ph.D programs.

Now that my application season is almost over, I have decided to start a blog to keep my own paper reading notes and to track my research progress in my future years as a Ph.D student.