On WALs (Write Ahead Logs) and fsync()
This post will discuss CWal: A Write Ahead log (WAL) implementation I wrote in C++ (tested on Ubuntu 22.04). Here is a direct link: https://github.com/sjay05/cwal.
In a database system, write ahead logs store operations in a sequential order within a log file which are flushed to the disk before any expensive database-wide commits are performed.
This preserves the durability of the data, as it is recoverable in the case of a power failure/system crash.
Log Entries:
Log entries consist of a byte_len
representing the size of data
in bytes, and a 4 byte CRC Checksum value (CRC_CHECKSUM
).
struct LogEntry {
uint64_t byte_length;
std::string data;
};
----|-------------------|--------------------|-----------------------|-----
... | uint64_t byte_len | const string* data | uint32_t CRC_CHECKSUM | ...
| (8 bytes) | (byte_len bytes) | (4 bytes) |
----|-------------------|--------------------|-----------------------|-----
fsync() system call:
In Linux systems, the traditional C-style write()
or C++ std::ostream<CharT,Traits>::write()
does not guarantee immediate write to disk due to Page Caches (or Disk Cache).
Page caches are implemented in order to optimize freqeuent system call operations that target the disk, and is held in the RAM. Thus, the disk updates are implemented with deferred evaluation, where they wait a few seconds for further read/write
calls before the data is flushed to disk.
This is a volatile method of storage and will not work for Write Ahead Logs. However, using the fsync
system call, we can force the OS to flush the page cache to the disk.
fsync(2) - Linux Manual Page
NAME
fsync - sychronize a file's in-core state with that on disk
SYNOPSIS
#include <unistd.h>
int fsync(int fd);
int fdatasync(int fd);
The fd
(file descriptor) may be obtained by opening the file with open()
. fsync()
will return once all buffers have been flushed to permanent storage, which achieves durable storage for the Write Ahead Log.
This leads to several questions now. What is the exact expense of the fsync()
syscall? Does every log append require a fsync
or can they batched up into blocks and committed when ready?
Benchmarks
CWal uses two types of benchmarks. The first is benchmark_write(const int LOG_LENGTH, const int LOG_SIZE, bool RFLUSH, bool SYNC, const int SYNC_PERIOD)
, which appends a total of LOG_LENGTH
logs with data of LOG_SIZE
bytes each.
RFLUSH
indicates if a routine flush will be performed after each append.SYNC
period indicates if a routine fsync, everySYNC_PERIOD
operations will occur.
In particular, benchmark_write
runs with $\mathcal{O}(\text{LOG_LENGTH} \cdot \text{LOG_SIZE})$ time. We set LOG_LENGTH * LOG_SIZE ~= 1e6
.
Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? No | Sync? No
==> 2931 ms | 2.93 ms/log
Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? Yes | Sync? No
==> 3556 ms | 3.56 ms/log
Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? No | Sync? Yes | SYNC_PERIOD = 1
==> 107601 ms | 107.60 ms/log
Reg. Benchmark: 1000 entries | data_length = 1000 | Flush? No | Sync? Yes | SYNC_PERIOD = 10
==> 63919 ms | 63.92 ms/log
It is clear that fsync()
is very expensive, and causes a $35\%$ increase in time. However it is unreasonable that a database would wait for 1000
log entries to be flushed before it’s state is modified. So, we can try to batch the logs into segments of set size.
At the end of each batch, CWal would start overwriting over previous log, as this is more time efficient that truncating the file. The database would also then commit it’s changes to the disk, so the previous logs would not be required anymore.
The function batched_sync_benchmarks(const int LOG_LENGTH, const int LOG_SIZE, cont int BATCH_SIZE)
has one new argument, BATCH_SIZE
. We stick with the same specifications of LOG_LENGTH
and LOG_SIZE
. Two BATCH_SIZE
’s we can experiment with are $\sqrt{\text{LOG_LENGTH}}$ and $\sqrt[3]{\text{LOG_LENGTH}}$.
Batched Sync Benchmark: 1000 entries | data_length = 1000 | BATCH_SIZE = 31
==> 98265 ms | 98.27 ms/log
Batched Sync Benchmark: 1000 entries | data_length = 1000 | BATCH_SIZE = 10
==> 101750 ms | 101.75 ms/log
We can see slight improvements with the times of the 3rd Regular Benchmark.
Hence, with continued tweaking a user can dictate how often the write ahead log performs fsync()
operations, and batch the logs appropriately.
Extensions for CWal
- Asynchronous write ahead logging
- Copies of log files for redundancy
- Experiment with memory mapped files for WalReader IO performance