This post will discuss CWal: A Write Ahead log (WAL) implementation I wrote in C++ (tested on Ubuntu 22.04). Here is a direct link: https://github.com/sjay05/cwal.

In a database system, write ahead logs store operations in a sequential order within a log file which are flushed to the disk before any expensive database-wide commits are performed.

This preserves the durability of the data, as it is recoverable in the case of a power failure/system crash.

Log Entries:

Log entries consist of a byte_len representing the size of data in bytes, and a 4 byte CRC Checksum value (CRC_CHECKSUM).

struct LogEntry {
  uint64_t byte_length;
  std::string data;
};
----|-------------------|--------------------|-----------------------|-----
... | uint64_t byte_len | const string* data | uint32_t CRC_CHECKSUM | ... 
    | (8 bytes)         | (byte_len bytes)   | (4 bytes)             |
----|-------------------|--------------------|-----------------------|-----

fsync() system call:

In Linux systems, the traditional C-style write() or C++ std::ostream<CharT,Traits>::write() does not guarantee immediate write to disk due to Page Caches (or Disk Cache).

Page caches are implemented in order to optimize freqeuent system call operations that target the disk, and is held in the RAM. Thus, the disk updates are implemented with deferred evaluation, where they wait a few seconds for further read/write calls before the data is flushed to disk.

This is a volatile method of storage and will not work for Write Ahead Logs. However, using the fsync system call, we can force the OS to flush the page cache to the disk.

fsync(2) - Linux Manual Page

NAME
   fsync - sychronize a file's in-core state with that on disk

SYNOPSIS
   #include <unistd.h>

   int fsync(int fd);
   int fdatasync(int fd);

The fd (file descriptor) may be obtained by opening the file with open(). fsync() will return once all buffers have been flushed to permanent storage, which achieves durable storage for the Write Ahead Log.

This leads to several questions now. What is the exact expense of the fsync() syscall? Does every log append require a fsync or can they batched up into blocks and committed when ready?

Benchmarks

CWal uses two types of benchmarks. The first is benchmark_write(const int LOG_LENGTH, const int LOG_SIZE, bool RFLUSH, bool SYNC, const int SYNC_PERIOD), which appends a total of LOG_LENGTH logs with data of LOG_SIZE bytes each.

  • RFLUSH indicates if a routine flush will be performed after each append.
  • SYNC period indicates if a routine fsync, every SYNC_PERIOD operations will occur.

In particular, benchmark_write runs with $\mathcal{O}(\text{LOG_LENGTH} \cdot \text{LOG_SIZE})$ time. We set LOG_LENGTH * LOG_SIZE ~= 1e6.


Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? No | Sync? No
==> 2931 ms | 2.93 ms/log

Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? Yes | Sync? No
==> 3556 ms | 3.56 ms/log

Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? No | Sync? Yes | SYNC_PERIOD = 1
==> 107601 ms | 107.60 ms/log

Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? No | Sync? Yes | SYNC_PERIOD = 10
==> 63919 ms | 63.92 ms/log

It is clear that fsync() is very expensive, and causes a $35\%$ increase in time. However it is unreasonable that a database would wait for 1000 log entries to be flushed before it’s state is modified. So, we can try to batch the logs into segments of set size.

At the end of each batch, CWal would start overwriting over previous log, as this is more time efficient that truncating the file. The database would also then commit it’s changes to the disk, so the previous logs would not be required anymore.

The function batched_sync_benchmarks(const int LOG_LENGTH, const int LOG_SIZE, cont int BATCH_SIZE) has one new argument, BATCH_SIZE. We stick with the same specifications of LOG_LENGTH and LOG_SIZE. Two BATCH_SIZE’s we can experiment with are $\sqrt{\text{LOG_LENGTH}}$ and $\sqrt[3]{\text{LOG_LENGTH}}$.

Batched Sync Benchmark: 1000 entries | data_length = 1000 | BATCH_SIZE = 31
==> 98265 ms | 98.27 ms/log

Batched Sync Benchmark: 1000 entries | data_length = 1000 | BATCH_SIZE = 10
==> 101750 ms | 101.75 ms/log

We can see slight improvements with the times of the 3rd Regular Benchmark.

Hence, with continued tweaking a user can dictate how often the write ahead log performs fsync() operations, and batch the logs appropriately.

Extensions for CWal

  • Asynchronous write ahead logging
  • Copies of log files for redundancy
  • Experiment with memory mapped files for WalReader IO performance