PostgreSQL 9.0 High Performance
上QQ阅读APP看书,第一时间看更新

Filesystem crash recovery

Filesystem writes have two major components to them. At the bottom level, you are writing out blocks of data to the disk. In addition, there is some amount of filesystem metadata involved too. Examples of metadata include the directory tree, the list of blocks and attributes associated with each file, and the list of what blocks on disk are free.

Like many disk-oriented activities, filesystems have a very clear performance vs. reliability trade-off they need to make. The usual reliability concern is what happens in the situation where you're writing changes to a file and the power goes out in the middle.

Consider the case where you're writing out a new block to a file, one that makes the file bigger (rather than overwriting an existing block). You might do that in the following order:

  1. Write data block.
  2. Write file metadata referencing use of that block.
  3. Add data block to the list of used space metadata.

What happens if power goes out between steps 2 and 3 here? You now have a block that is used for something, but the filesystem believes it's still free. The next process that allocates a block for something is going to get that block, and now two files would refer to it. That's an example of a bad order of operations that no sensible filesystem design would use. Instead, a good filesystem design would:

  1. Add data block to the list of used space metadata.
  2. Write data block.
  3. Write file metadata referencing use of that block.

If there was a crash between 1 and 2 here, it's possible to identify the blocks that were marked as used, but not actually written to use fully yet. Simple filesystem designs do that by iterating over all the disk blocks allocated, reconciling the list of blocks that should be used or free against what's actually used. Examples of this include the fsck program used to validate simple UNIX filesystems and the chkdsk program used on FAT32 and NTFS volumes under Windows.

Journaling filesystems

The more modern approach is to use what's called a journal to improve this situation. A fully journaled write would look like this:

  1. Write transaction start metadata to the journal.
  2. Write used space metadata change to the journal.
  3. Write data block change to the journal.
  4. Write file metadata change to the journal.
  5. Add data block to the list of used space metadata.
  6. Write data block.
  7. Write file metadata referencing use of that block.
  8. Write transaction end metadata to the journal.

What this gets you is the ability to recover from any sort of crash the filesystem might encounter. If you didn't reach the final step here for a given write transaction, the filesystem can just either ignore (data block write) or undo (metadata write) any partially completed work that's part of that transaction. This lets you avoid long filesystem consistency checks after a crash, because you'll just need to replay any open journal entries to fix all the filesystem metadata. The time needed to do this is proportional to the size of the journal, rather than the old filesystem checking routines whose runtime is proportional to the size of the volume.

The first thing that should jump out at you here is that you're writing everything twice, plus the additional transaction metadata, and therefore more than double the total writes per update in this situation.

The second thing to note is more subtle. Journaling in the filesystem is nearly identical to how write-ahead logging works to protect database writes in PostgreSQL. So if you're using journaling for the database, you're paying this overhead four times. Writes to the WAL, itself a journal, are journaled, and then writes to the disk are journaled too.

Since the overhead of full journaling is so high, few filesystems use it. Instead, the common situation is that only metadata writes are journaled, not the data block changes. This meshes well with PostgreSQL, where the database protects against data block issues but not filesystem metadata issues.