Scattered notes on MariaDB redo log internals and file checkpoints.

07 Jul 2025

Redo log structure

The database log (aka redo log) is used to ensure data integrity in the case of a crash or unexpected shutdown. It records changes made to the database before they are actually written (?) so that they can be replayed during recovery. Redo log is stored in ib_logfileX files. These files consist of the header (512 bytes and 2 LSN checkpoints overall up to START_OFFSET = FIRST_LSN = 12288 bytes) and the ring buffer of the log items. The header starts with 512 bytes dedicated to the purposes of upgrading and preventing downgrading containing:

+-------------------------------------------------------------------------------+
|                                 Redo Log Header (512 bytes)                   |
+-------------------------------------------------------------------------------+
| Offset   | Length | Field Name                          | Description         |
|----------|--------|-------------------------------------|---------------------|
| 0x0000   |   4    | Format Version Identifier           | Log format version  |
| 0x0008   |   4    | Start LSN                           | First LSN           |
| 0x0010   | max 32 | Creator String                      | eg "MariaDB 11.6.2" |
| 0x01FC   |   4    | CRC-32C Checksum                    | Header checksum     |
+-------------------------------------------------------------------------------+
| 0x0200   | 3584   | Stub Area                           | Reserved space      |
+-------------------------------------------------------------------------------+
|                                 Redo Log Checkpoint Markers                   |
+-------------------------------------------------------------------------------+
| 0x1000   |   8    | Checkpoint LSN #1                   | 1st checkpoint LSN  |
|----------|--------|-------------------------------------|---------------------|
| 0x2000   |   8    | Checkpoint LSN #2                   | 2nd checkpoint LSN  |
+-------------------------------------------------------------------------------+
|                                 Mini-Transactions Log                         |
+-------------------------------------------------------------------------------+
| 0x3000   | ...    | Log of Mini-Transactions chains     | Ring-Buffer of MTRs |
+-------------------------------------------------------------------------------+

where Checkpoint LSN entries are used to indicate the coordinates of 2 latest checkpoint entries in the redo log and known end of the log. Normally, checkpoint record consists of a single MTR with the FILE_CHECKPOINT op type, data LSN, redo log end position LSN and a checksum:

Version 10.8 (0x5068_7973) and later.

+---------------------------------+
|          Checkpoint LSN         |
+---------------------------------+
| 0x0  | 8  | FILE_CHECKPOINT LSN |
+---------------------------------+
| 0x8  | 8  | Redo Log End LSN    |
+---------------------------------+
| 0x10 | 44 | Reserved            |
+---------------------------------+
| 0x3C | 4  | CRC-32C Checksum    |
+---------------------------------+

64 bytes total.

More details at https://jira.mariadb.org/browse/MDEV-14425.

LSN to Redo log physical position

The redo log is a ring buffer with a header in front.

The formula for calculating the physical position of a log sequence number (LSN) when it exceeds the header size is:

physical_position = START_OFFSET + (lsn - START_OFFSET) % RING_CAPACITY.

where RING_CAPACITY = FILE_SIZE - START_OFFSET.

START_OFFSET = fixed HEADER_SIZE.

Log items or mini-transactions (mtr_t) chains

A mini-transaction (mtr_t) is a unit of work in the InnoDB storage. It represents a single data modification operation, such as an memset X bytes.

Mini-Transaction Record (mtr_t)

+-------+
| mtr_t |
+-------+

An MTR chain is a sequence of mini-transactions that are grouped together to form a single logical operation. In MariaDB, it is rather the MTRs themselves are the stream of records, but for simplicity it makes sense to separate the concept into 2 different entities: MTR and MTR chain.

MTR Chain

+-------+-------+-------+-------+---+----------+
| mtr_t | mtr_t | mtr_t | ..... |gen| checksum |
+-------+-------+-------+-------+---+----------+

Redo log items are composed of the mini-transactions (mtr_t objects) chains that contain the information about the changes made to the tablespaces or some special metadata. The first byte of the record of an MTR would contain a record type, flags, and a length if it is less than 16. Then it is usually a tablespace ID and page number, followed by the record payload, a termination byte if it is the last record in the chain, and a CRC32C checksum.

+-------------------------------+
|   Mini-Transaction Structure  |
+-------------------------------+
| First Byte                    |
| - Bit  7: same page as prev f |
| - Bits 6..4: redo log type    |
| - Bits 3..0: payload len      |
+-------------------------------+
| Optional Length Bytes (0-3)   |
| - Only present if len > 15    |
+-------------------------------+
| Optional Tablespace ID/Page # |
| - if same_page=0              |
| - or FILE_ records            |
| - or FREE_PAGE / INIT_PAGE    |
| - or EXTENDED + Subtype       |
+-------------------------------+
| Subtype Code (Optional)       |
| - if EXTENDED and OPTION      |
+-------------------------------+
| Record Payload (Variable)     |
| - Depends on redo log type    |
+-------------------------------+
| ... other MTR in the chain .. |
+-------------------------------+
| Terminator Byte               |
| - Either 0x00 or 0x01         |
+-------------------------------+
| CRC32C checksum               |
+-------------------------------+

The examples of a mini-transaction (mtr) record types are:

FREE_PAGE (0): corresponds to MLOG_INIT_FREE_PAGE
INIT_PAGE (1): corresponds to MLOG_INIT_FILE_PAGE2
EXTENDED (2): extended record; followed by subtype code @see mrec_ext_t
WRITE (3): replaces MLOG_nBYTES, MLOG_WRITE_STRING, MLOG_ZIP_*
MEMSET (4): extends the 10.4 MLOG_MEMSET record
MEMMOVE (5): copy data within the page (avoids logging redundant data)
…

The termination byte may be 0x00 or 0x01 depending on the LSN wrap around generation. See log_sys.get_sequence_bit() for details:

!(((lsn - first_lsn) / capacity()) & 1)

Redo log record examples

INIT could be logged as 0x12 0x34 0x56, meaning “type code 1 (INIT), 2 bytes to follow” and “tablespace ID 0x34”, “page number 0x56”. The first byte must be between 0x12 and 0x1a, and the total length of the record must match the lengths of the encoded tablespace ID and page number.

WRITE could be logged as 0x36 0x40 0x57 0x60 0x12 0x34 0x56, meaning “type code 3 (WRITE), 6 bytes to follow” and “tablespace ID 0x40”, “page number 0x57”, “byte offset 0x60”, data 0x34,0x56.

A subsequent WRITE to the same page could be logged 0xb5 0x7f 0x23 0x34 0x56 0x78, meaning “same page, type code 3 (WRITE), 5 bytes to follow”, “byte offset 0x7f”+0x60+2, bytes 0x23,0x34,0x56,0x78.

The end of the mini-transaction would be indicated by the end byte 0x00 or 0x01; @see log_sys.get_sequence_bit(). If log_sys.is_encrypted(), that is followed by 8 bytes of nonce (part of initialization vector). That will be followed by 4 bytes of CRC-32C of the entire mini-transaction, excluding the end byte.

Redo log file checkpoint record

FILE_CHECKPOINT is a special type of mini-transaction record that marks the end of a checkpoint in the redo log. It is used to indicate that all changes up to that point have been successfully flushed to disk and that the database is in a consistent state. The FILE_CHECKPOINT record is written at the end of the checkpoint process, and it contains the log sequence number (LSN) of the checkpoint, the end LSN of the redo log, and a checksum to ensure data integrity.

It has the following structure:

0xFA, // FILE_CHECKPOINT + len 10 bytes (+1 1st byte + 1 termination marker)
0x00, 0x00, // tablespace id + page no (0x0000 for FILE_CHECKPOINT)
0xXX, 0xXX, 0xXX, 0xXX, 0xXX, 0xXX, 0xXX, 0xXX, // 8 bytes checkpoint LSN
0x0X, // termination marker
0xXX, 0xXX, 0xXX, 0xXX, // checksum

Redo log file operations

Among mtr_t record types there are also FILE_s related operations, such as FILE_CREATE, FILE_DELETE, FILE_RENAME, FILE_MODIFY, and FILE_CHECKPOINT. The most important for the checkpoint is FILE_CHECKPOINT, which is used to mark the end of a checkpoint in the redo log. It is written at the end of the checkpoint process to indicate that all changes up to that point have been successfully flushed to disk and that the database is in a consistent state. It’s MTR size is defined as SIZE_OF_FILE_CHECKPOINT in mtr0types.h and is 16 bytes (3 bytes for type and page ID, 8 bytes for LSN, 1 byte for end byte, and 4 bytes).

Example of MariaDB shutdown with `--innodb_fast_shutdown=0`

Header block: 12288
Size: 10485760, Capacity: 10473472
RedoHeader {
    version: 1349024115,
    first_lsn: 12288,
    creator: "MariaDB 11.6.2",
    crc: 224651864,
}
RedoCheckpointCoordinate {
    checkpoints: [
        RedoHeaderCheckpoint {
            checkpoint_lsn: 60024,
            end_lsn: 60024,
            checksum: 1691141185,
        },
        RedoHeaderCheckpoint {
            checkpoint_lsn: 60123,
            end_lsn: 60123,
            checksum: 3550697051,
        },
    ],
    checkpoint_lsn: Some(
        60123,
    ),
    end_lsn: 60123,
    encrypted: false,
    version: 1349024115,
    start_after_restore: false,
}
MTR Chain count=1, len=16, lsn=60123
  1: Mtr { space_id: 0, page_no: 0, op: FileCheckpoint }
Checkpoint LSN/1: RedoHeaderCheckpoint { checkpoint_lsn: 60024, end_lsn: 60024, checksum: 1691141185 }
Checkpoint LSN/2: RedoHeaderCheckpoint { checkpoint_lsn: 60123, end_lsn: 60123, checksum: 3550697051 }
File checkpoint chain: Some(MtrChain { lsn: 60123, len: 16, checksum: 3572919866, mtr: [Mtr { space_id: 0, page_no: 0, op: FileCheckpoint, file_checkpoint_lsn: Some(60123), marker: 1 }] })
File checkpoint LSN: 60123

You can see that checkpoint_lsn == end_lsn and the FILE_CHECKPOINT record at position 60123 is the last record in the redo log - there is a termination marker after it and no valid MTRs.

Example of MariaDB shutdown with `pkill -9`

$ scripts/mariadb-install-db --datadir ./data --innodb-log-file-size=10M
$ bin/mariadbd --datadir ./data --innodb_fast_shutdown=0 --innodb-log-file-size=10M

$ mycli -S /tmp/mysql.sock
> CREATE TABLE a (id int not null auto_increment primary key, t TEXT);
> SET max_recursive_iterations = 1000000;
> INSERT INTO a (t)
  WITH RECURSIVE fill(n) AS (
    SELECT 1 UNION ALL SELECT n + 1 FROM fill WHERE n < 60500
  )
  SELECT RPAD(CONCAT(FLOOR(RAND()*1000000)), 64, 'x') FROM fill;
$ pkill -9 mariadbd
$ cargo run -- --log-group-path data

Header block: 12288
Size: 10485760, Capacity: 10473472
RedoHeader {
    version: 1349024115,
    first_lsn: 12288,
    creator: "MariaDB 11.6.2",
    crc: 224651864,
}
RedoCheckpointCoordinate {
    checkpoints: [
        RedoHeaderCheckpoint {
            checkpoint_lsn: 6880644,
            end_lsn: 9694174,
            checksum: 1144991502,
        },
        RedoHeaderCheckpoint {
            checkpoint_lsn: 9691474,
            end_lsn: 10553265,
            checksum: 2431378773,
        },
    ],
    checkpoint_lsn: Some(
        9691474,
    ),
    checkpoint_no: Some(
        0,
    ),
    end_lsn: 10553265,
    encrypted: false,
    version: 1349024115,
    start_after_restore: false,
}
MTR Chain count=4, len=27, lsn=9691474
  1: Mtr { space_id: 8, page_no: 76, op: Memset }
  2: Mtr { space_id: 8, page_no: 76, op: Write }
  3: Mtr { space_id: 8, page_no: 76, op: Memset }
  4: Mtr { space_id: 8, page_no: 76, op: Option }
...
MTR Chain count=13, len=89, lsn=11344877
  1: Mtr { space_id: 3, page_no: 4, op: Write }
  2: Mtr { space_id: 3, page_no: 4, op: Write }
  ...
  10: Mtr { space_id: 3, page_no: 2, op: Memset }
  11: Mtr { space_id: 3, page_no: 4, op: Option }
  12: Mtr { space_id: 3, page_no: 2, op: Option }
  13: Mtr { space_id: 3, page_no: 0, op: Option }
Checkpoint LSN/1: RedoHeaderCheckpoint { checkpoint_lsn: 6880644, end_lsn: 9694174, checksum: 1144991502 }
Checkpoint LSN/2: RedoHeaderCheckpoint { checkpoint_lsn: 9691474, end_lsn: 10553265, checksum: 2431378773 }
File checkpoint chain: Some(MtrChain { lsn: 10553265, len: 31, checksum: 2542014928, mtr: [Mtr { space_id: 8, page_no: 0, op: FileModify, file_checkpoint_lsn: None, marker: 0 }, Mtr { space_id: 0, page_no: 0, op: FileCheckpoint, file_checkpoint_lsn: Some(9691474), marker: 0 }, Mtr { space_id: 5411, page_no: 6, op: Option, file_checkpoint_lsn: None, marker: 0 }, Mtr { space_id: 252, page_no: 0, op: FreePage, file_checkpoint_lsn: None, marker: 0 }, Mtr { space_id: 252, page_no: 0, op: Extended, file_checkpoint_lsn: None, marker: 0 }, Mtr { space_id: 8, page_no: 252, op: Memset, file_checkpoint_lsn: None, marker: 0 }, Mtr { space_id: 8, page_no: 252, op: Write, file_checkpoint_lsn: None, marker: 0 }, Mtr { space_id: 8, page_no: 252, op: Memset, file_checkpoint_lsn: None, marker: 0 }, Mtr { space_id: 8, page_no: 252, op: Option, file_checkpoint_lsn: None, marker: 0 }] })
File checkpoint LSN: 9691474
WARNING: checkpoint LSN is not at the end of the log.

Here you can observe that there is an MTR chain at 9691474 and it is not a FILE_CHECKPOINT.

Source refs

From the source code PoV redo log subsystem resides in:

storage/innobase/include/log0log.h - log system interface (log_t) (writer)
storage/innobase/include/log0recv.h - recovery system interface (recv_sys_t) (reader)
storage/innobase/include/mtr0types.h - mini-transaction interface (mtr_log_t, mrec_type_t, …)
storage/innobase/include/mtr0log.h - mini-transaction log record encoding and decoding
storage/innobase/include/mtr0mtr.h - mini-transaction types (mtr_t)

and corresponding implementation CC files, where

recv_sys - is recovery system for InnoDB redo logs.
log_sys - is redo log system for InnoDB.
mtr_t - is mini-transaction (mtr) for InnoDB redo logs.

log_t:
  /** latest completed checkpoint (protected by latch.wr_lock()) */
  Atomic_relaxed<lsn_t> last_checkpoint_lsn;
  /** next checkpoint LSN (protected by latch.wr_lock()) */
  lsn_t next_checkpoint_lsn;

recv_sys_t:

  /** number of bytes in log_sys.buf */
  size_t len;
  /** start offset of non-parsed log records in log_sys.buf */
  size_t offset;
  /** log sequence number of the first non-parsed record */
  lsn_t lsn;
  /** log sequence number of the last parsed mini-transaction */
  lsn_t scanned_lsn;
  /** log sequence number at the end of the FILE_CHECKPOINT record, or 0 */
  lsn_t file_checkpoint;

Data recovery process start during boot

The boot process consists of several steps:

Listing existing redo log files.
Reading the initial checkpoint LSN coordinates from the redo log files.
Initiating crash recovery process during which the redo log is scanned and the changes are applied to the database.
If the redo log is empty after the last checkpoint, it indicates that the database was shut down cleanly, and no recovery is needed.

AFAIS, MariaDB checks the end of the log first for the records if checkpoint lsn != end_lsn, and only then goes to the scan of the log from the last checkpoint position.

storage/innobase/srv/srv0start.cc/srv_start() is the initialization point for the InnoDB storage engine. First of all it calls recv_recovery_read_checkpoint() that in turn calls storage/innobase/log/log0recv.cc/find_checkpoint() to find the list of existing redo log files and read initial checkpoint LSN coordinates from predetermined locations:

#0  recv_sys_t::find_checkpoint (this)
    at storage/innobase/log/log0recv.cc:1602
#1  recv_recovery_read_checkpoint ()
    at storage/innobase/log/log0recv.cc:4510
#2  srv_start (create_new_db=false)
    at storage/innobase/srv/srv0start.cc:1418
#3  innodb_init (p)
    at storage/innobase/handler/ha_innodb.cc:4222
#4  ha_initialize_handlerton (plugin)
    at sql/handler.cc:697
#5  plugin_do_initialize (plugin, state)
    at sql/sql_plugin.cc:1455
#6  plugin_initialize (tmp_root, plugin, argc <remaining_argc>, argv, options_only)
    at sql/sql_plugin.cc:1508
#7  plugin_init (argc <remaining_argc>, argv, flags)
    at sql/sql_plugin.cc:1753
#8  init_server_components ()
    at sql/mysqld.cc:5324
#9  mysqld_main (argc, argv)
    at sql/mysqld.cc:6015
#10 main (argc, argv)
    at sql/main.cc:34

That finally does:

const lsn_t checkpoint_lsn{mach_read_from_8(buf)};
...
if (checkpoint_lsn >= log_sys.next_checkpoint_lsn)
{
  log_sys.next_checkpoint_lsn= checkpoint_lsn;
  log_sys.next_checkpoint_no= field == log_t::CHECKPOINT_1;
  lsn= end_lsn;
}

The only case when it is allowed to skip this step is when force_recovery is >= 6 (SRV_FORCE_NO_LOG_REDO).

How MariaDB understands no crash recovery is needed during startup?

Then srv0start.c/srv_start() calls log0recv.c/recv_recovery_read_checkpoint_start() which in turn calls innobase/log/log0recv.c/recv_scan_log(false) to understand the state of the redo log. log0recv.c/recv_scan_log(last_phase) just scans the redo log store records to the parsing buffer, but first it tries to determine recv_sys_t::file_checkpoint that is the log sequence number (LSN) of the last checkpoint (at the end of the last FILE_CHECKPOINT record).

If the checkpoint is found and the log is empty after the checkpoint, it indicates that the database was shut down cleanly. In this case, no recovery is needed. This is determined by checking if found FILE_CHECKPOINT entry is the last entry in the log and if there are no records after it:

  if (c == log_sys.next_checkpoint_lsn)
  {
    /* There can be multiple FILE_CHECKPOINT for the same LSN. */
    if (file_checkpoint)
      continue;
    file_checkpoint= lsn;
    return GOT_EOF;
  }

recv_scan_log() may set recv_needed_recovery == true is something goes wrong.

Additionally, the invariants are checked at recv_sys.validate_checkpoint().

How FILE_CHECKPOINT record looks like in the log?

Version 10.8 (0x5068_7973) and later.

+---------------------------------+
|          Checkpoint LSN         |
+---------------------------------+
| 0x0  | 8  | FILE_CHECKPOINT LSN |
+---------------------------------+
| 0x8  | 8  | Redo Log End LSN    |
+---------------------------------+
| 0x10 | 44 | Reserved            |
+---------------------------------+
| 0x3C | 4  | CRC-32C Checksum    |
+---------------------------------+

64 bytes total.

FILE_CHECKPOINT correctness rules for correct shutdown

The FILE_CHECKPOINT record must be the last record in the redo log. Is is checked by verifying that the checkpoint LSN entries point to the end of the redo log (end LSN == FILE_CHECKPOINT LSN).
The FILE_CHECKPOINT LSN must be not less than any tablespace page modification LSN.

Where FILE_CHECKPOINT record is written?

TODO

Undo log scan

TODO

Wrap around and alignment

TODO

@ g Ivan Prisyazhnyy on software engineering

Scattered notes on MariaDB redo log internals and file checkpoints.

Redo log structure

LSN to Redo log physical position

Log items or mini-transactions (mtr_t) chains

Redo log record examples

Redo log file checkpoint record

Redo log file operations

Example of MariaDB shutdown with `--innodb_fast_shutdown=0`

Example of MariaDB shutdown with `pkill -9`

Source refs

Data recovery process start during boot

How MariaDB understands no crash recovery is needed during startup?

How FILE_CHECKPOINT record looks like in the log?

FILE_CHECKPOINT correctness rules for correct shutdown

Where FILE_CHECKPOINT record is written?

Undo log scan

Wrap around and alignment

Links

@ g Ivan Prisyazhnyy on software engineering

Scattered notes on MariaDB redo log internals and file checkpoints.

Redo log structure

LSN to Redo log physical position

Log items or mini-transactions (mtr_t) chains

Redo log record examples

Redo log file checkpoint record

Redo log file operations

Example of MariaDB shutdown with --innodb_fast_shutdown=0

Example of MariaDB shutdown with pkill -9

Source refs

Data recovery process start during boot

How MariaDB understands no crash recovery is needed during startup?

How FILE_CHECKPOINT record looks like in the log?

FILE_CHECKPOINT correctness rules for correct shutdown

Where FILE_CHECKPOINT record is written?

Undo log scan

Wrap around and alignment

Links

Related Posts

How to debug linux kernel API (syscalls issues)? 18 Aug 2023

What is /proc/meminfo/MemAvailable? 10 Aug 2021

Multi-version concurrency control 28 Jun 2020

Example of MariaDB shutdown with `--innodb_fast_shutdown=0`

Example of MariaDB shutdown with `pkill -9`