High-Performance Concurrency Control Mechanism for Main-Memory Databases

This paper describes the concurrency control mechanisms of the Hekaton storage engine of Microsoft SQL Server, one is optimistic and one is pessimistic. The authors also implemented a simple single-version locking concurrency control method optimized for in-memory databases, and report performance comparisons of the three.

The authors found that single-version locking is very fragile, in that it works well for short transactions and low contention, but its performance degrades rapidly under high contention, or if there is even one long running transaction. On the contrary, multi-version concurrency control method perform well under high contention and when there are long running transactions. The optimistic method consistently performs better than the pessimistic method.

The authors first experimented with homogeneous workload of short update transactions, which represents an extreme update-heavy scenario, and varied the multi-programming level of the system.

The pessimistic method, under low contention and read committed isolation level, has 30% lower performance than the optimistic method (because of extra writes for tracking dependencies and locks, which cause increased memory traffic. It takes pessimistic method 20% more cycles to execute the same number of instructions, and the additional control logic translates into 10% more instructions per transaction. Under high contention, the optimistic method performed the best throughput, with pessimistic method close behind. Single-verison locking achieved good throughput numbers initially but did not scale beyond 8 threads.

If we fix the contention level (low, 24 threads) and transaction workload type but vary the isolation level, optimistic MVCC’s performance degrades the fastest, followed by pessimistic MVCC and the single-version locking’s performance reduces only by 1.8%.

Screen Shot 2020-04-21 at 11.13.18 AM.png

Then the authors experimented with heterogeneous workload, fixing the multi-programming level and varying the percentage of short and long read-only transactions in the mix.

Starting with short read transactions, under low contention, single-version locking out-performed both MV until 90% read-only workload, at which point it was surpassed by both MV schemes. Under high contention MVCC schemes start to show more advantage as percentage of read-only transactions increases. When 80% of the transactions are read-only, two MVCC schemes achieve 63% and 73% high throughput than 1V.

If we mix in long-running read transactions instead, the advantage of MVCC schemes shows even more. 1V achieves twice the update throughput of the MV schemes when there is 0 long read transactions. However mixing in even one long read transaction, 1V’s update throughput drops by 75%, in contrast to only 5% drop of MV schemes. When 50% of the active transactions are long readers, MV has 80x high update throughput than 1V. In terms of throughput, MV schemes consistently outperform 1V as the gap widens as the number of long readers increases.

TATP benchmark simulates a telecommunications application. 80% of the transactions are queries, 16% of them are updates, 2% are inserts, and 2% are deletes. 1V achieved 4,220,119 txn/s, pessmisitic MV achieved 3,129,816 txn/s and optimistic MV achieved 3,121,494 txn/s.

I am only summarizing the performance results here, for detailed description of the two multi-version concurrency control implementations, I suggest that readers go to the original paper.

Serializable Snapshot Isolation in PostgreSQL

This paper documents how the authors implemented a serializable snapshot isolation in PostgreSQL, and out performed strict-2PL on some workloads, while incurring low cost (less than 7%) relative to snapshot isolation.

PostgreSQL, like many other database systems, use the weakest ANSI SQL isolation level: READ COMMITTED, by default in order to get good performance. The highest isolation level that PostgreSQL provides, and what the user gets when she asks for SERIALIZABLE, is actually SNAPSHOT ISOLATION. Although SNAPSHOT ISOLATION avoids all three anomalies (dirty reads, non-repeatable reads, and phantom reads) defined in the ANSI standard it stills allows some anomalies that are not commonly known, two of which are simple write skew, and batch processing anomalies. A descriptions of these two anomalies is in section 2.1 of the paper.

Although there are techniques that can be applied to avoid these anomalies under snapshot isolation, the analysis required to identify them is often costly, especially when they inherently concerns the interactions between multiple transactions. In comparison, SERIALIZABLE avoids all problems because transactions can be treated as if they are running serially in complete isolation.

Instead of implementing serializability using Strict 2 Phase Locking (S2PL), the authors implemented Serializable Snapshot Isolation (SSI). I will very briefly explain SSI here. Curious readers can go read the original paper. In short, SSI runs transactions using snapshot isolation, but performs additional checks to determine whether anomalies can possibly happen. SSI is similar to other concurrency control protocols that uses serialization graph testing. But instead of detecting cycles, SSI looks for “dangerous structures” (at least two adjacent rw-antidependency edges) in a serialization graph. Using this method, SSI may give some false positives, but it is more efficient, because it is less restrictive: it allow some rw-conflicts as long as they do not for a “dangerous structure”. Therefore SSI offers better performance compared to S2PL.

To identify rw-antidependencies, SSI have transactions acquire special “SIREAD” locks on the data they read. These “SIREAD” locks do not block conflicting writes, but a conflict between write lock and a “SIREAD” lock will be flagged as a rw-antidependency. Because conflicts can occur even after a reader commits, “SIREAD” locks need to persist until all concurrent transactions commit.

The authors also documented two optimizations, safe snapshot and deferrable transactions, that they implemented for read-only transactions in section 4 of the paper. Details of SSI implementation can be found in section 5.

SSI does have potential unbounded memory usage problem, because a transaction can hold a large number of locks, and cannot release them until all concurrent transactions commit. Other transaction states such as the rw-antidependencies used to check for dangerous structures may also be held for a long time. A single long-running transaction can prevent thousands of other transactions from being cleaned up. The solution was to bound the memory usage of SSI and gracefully degrade (with potentially higher false positive abort rate). Additionally, four techniques that can limit the memory usage were given in section 6.

The authors used three benchmarks to evaluate the performance of SSI: a modified TPC-C, the SIBENCH microbenchmark and the RUBiS web application benchmark. On the SIBENCH, SSI has an overhead of 10-20% for tracking read dependencies without the read-only optimizations. On the TPC-C benchmark with “credit check” transactions from TPC-C++ variant, SSI causes 5% slowdown relative to snapshot isolation with the read-only optimizations turned on. On the RUBiS benchmark, SSI archieved throughput of 422 req/s, compared to 435 req/s of snapshot isolation and 208 req/s of S2PL. Details of experimental setup and benchmarks are in section 8 of the paper.

In summary, SSI is a novel method of achieving serializability on top of snapshot isolation, without using S2PL. It achieved better performance than S2PL while paying a small cost relative to snapshot isolation.

OLTP Through the Looking Glass, and What We Found There

This paper from 2008 provides an instruction-level breakdown of the performance impact of each major component of a OLTP database, running a subset of TPC-C.

Since the invention of transactional databases in the 1970s and 1980s, a few things have changes:

  1. Hardware improvements. CPUs are faster so that each transaction now only takes a few microseconds to complete. Memory is getting cheaper so that most OLTP databases fit in memory.
  2. Modern internet applications and data intensive applications no long require full standard database features, and can live with varying levels of consistency, reliability, concurrency, replication and query-ability.

Some alternative database designs are now worth considering, such log-less databases, single-threaded databases, and transaction-less databases.

The authors started with a system called Shore, and progressively striped down each of the following components of it, to identify the overhead of each one. The benchmark they run were NewOrder and Payment transactions from TPC-C.

  • Logging
  • Locking
  • Latching
  • Buffer management

The detailed instruction count breakdown was given by figure 5 and figure 6 of the paper.

Screen Shot 2020-04-14 at 10.34.33 AM.png

Buffer management and locking operations are the most significant contributors to performance overhead, followed by logging and latching operations.

The authors concluded that unless one strips down all of these components, it is hard to get a main-memory optimized database that performs much better than a traditional database where most data fits into RAM. But if one does strip down all of the components, i.e, to design a single-threaded, in-memory, transaction-less databases that uses copy-state-over-network recovery, the performance will be orders of magnitude better than the system one starts with.

Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores

This paper from 2014 argued that DBMSs were not ready to take advantage of the on-chip parallelism offered by many-core machines. As the number of cores increases, the problem of concurrency control schemes becomes more challenging to deal with. The authors implemented the following 7 concurrency control schemes from two families:

  • Two-Phase Locking
    • 2PL with Deadlock Detection (DL_DETECT)
    • 2PL with Non-waiting Deadlock Prevention (NO_WAIT)
    • 2PL with Waiting Deadlock Prevention (WAIT_DIE)
  • Timestamp Ordering
    • Basic T/O (TIMESTAMP)
    • Multi-version Concurrency Control (MVCC)
    • Optimistic Concurrency Control (OCC)
    • T/O with Partition-level Locking (H-STORE)

and evaluated their performance and scalability under various OLTP workloads, using YCSB and TPC-C benchmarks. For details of the results please see the figures and Experimental Analysis section of the original paper. In summary, all of the seven current concurrency control schemes suffered from some kind of bottleneck when scaled to many cores, especially under contention. The different bottlenecks were summarized nicely in Table 2 of the paper.

Screen Shot 2020-04-13 at 11.32.26 AM.png

The authors thought that some of these bottlenecks could be addressed by new hardware support.

  • All T/O schemes suffer from the timestamp allocation bottleneck, which can be addressed by using synchronized clocks across CPU cores (currently only supported by Intel CPUs) or built-in hardware counter (no CPU currently supports this).
  • Memory-caused bottlenecks can be alleviated by hardware accelerators on the CPU that copies memory in the background (which eliminates the need to load all data through the CPU’s pipeline), and by using optimized memory allocation schemes.

Although distributed DBMSs can achieve better performance than a single-node DBMS, they suffer from a different bottleneck: the need of an Atomic commit protocol to support distributed transactions. The authors believe that a single many-core node with a large amount of DRAM might outperform a distributed DBMS for all but the largest OLTP applications.