
Linux is a particularly good example to start with for discussing filesystem implementation trade-offs, because it provides all of the common options among its many available filesystems.
The oldest Linux filesystem still viable for use now—ext2, does not have any journaling available. Therefore, any system that uses it is vulnerable to long fsck
recovery times after a crash, which makes it unsuitable for many purposes. You should not put a database volume on ext2. While that might work theoretically, there are many known situations, such as any user error made during the quite complicated fsck
process, this can break the write ordering guarantees expected by the database.
Because not journaling any writes is faster, ext2 volumes are used sometimes for PostgreSQL WAL volumes, which require very little in the way of data guarantees. Also, WAL volumes tend to be small filesystems without many files on them, and thus fairly quick to run fsck
on. However, this does introduce the likely possibility of a failing automatic fsck
check on reboot after a crash. This requires user intervention, and therefore, will make the server fail to start fully.
Rather than presuming that you need to start with ext2, a sensible approach is to start with standard ext3, switch to writeback ext3 if the WAL disk is not keeping up with its load, and only if that too continues to lag behind consider dropping to ext2. Since the WAL writes are sequential, while ones to the database are often random, it's much harder than you might think to have the WAL be a bottleneck on your system, presuming you've first put it into its own disk(s). Therefore, using ext2 without proving it is necessary falls into the category of premature optimization—there is a downside and you should only expose yourself to it when needed. Only if the WAL is a measured bottleneck should you consider the faster but riskier ext2.
ext3 adds a journal kept in a regular file on top of the ext2 filesystem. If the journal is empty (which will be the case on a clean server shutdown), you can even open an ext3 filesystem in ext2 mode. It's backward compatible with ext2, and it's possible to convert a volume in either direction: ext2 to ext3 or ext3 to ext2.
There are three levels of journaling available in ext3, specified as options when mounting the filesystem:
data=writeback
: Data changes are not journaled at all. Metadata changes are journaled, but the order in which they are written relative to the data blocks is not guaranteed. After a crash, files can have extra junk at their end from partially completed writes, and you might have a mix of old and new file data.data=ordered
: Metadata is journaled, but data changes are not. However, in all cases the metadata writes only occur after the associated data has already been written—thus the name "ordered". After a crash, it is possible to get a mix of old and new data in the file, as in the writeback mode, but you'll never end up with a file of incorrect length. The main differences in the behavior here compared to fully journaled writes are when you're changing blocks already on disk, when there is a data change but no associated metadata change. In cases where the file is new or the block being written is at the end, expanding the size of the file, the behavior of ordered mode is functionally equivalent to journal.data=journal
: Both file data and filesystem metadata are written to the journal before the main filesystem is touched.
Choosing between these three isn't as easy as it might appear at first. You might think that writeback mode is the fastest mode here in all cases. That's true, but on systems with a functional write cache in particular the difference between it and ordered is likely to be quite tiny. There is little reason to prefer writeback from a performance perspective when your underlying disk hardware is good, which makes the exposure to risk even less acceptable. Amusingly, the usual situation in which the writeback mode is said to be safe for PostgreSQL use is one where the underlying writes involve a battery-backed write cache, but that's exactly the situation under which the penalty of ordered mode is the least expensive.
In short, there are very few reasons to ever use writeback for your main database. The main weakness of writeback mode—that files can sometimes be extended with garbage bytes—is not an issue for the WAL files. Those are both extended to their full size before use and written to with checksumming that rejects incorrect data. Accordingly, writeback is an appropriate mode to consider for a filesystem that only holds WAL data. When pressed for speed improvements on WAL writes, writeback is preferred to using ext2 from an integrity point of view, because the minimal metadata journaling you will get in this mode will prevent long fsck
recovery time after a crash.
For the database disks, the choices are ordered or journal. On simple benchmarks such as bonnie++
(more on this in the XFS section later), ordered will sometimes lose as much as half its throughput (relative to raw disk speed) because of the double writing that the journal introduces. However, that does not mean journal will be half as fast for your database writes! Like the WAL, the journal is written out to a contiguous portion of disk, making it mostly sequential writes, and the subsequent updates to the data blocks are then written later. It turns out that this behavior makes the journal model particularly well suited to the situation where there are concurrent reads mixed with writes—which is exactly the case with many database applications. In short, journal should not be immediately discarded as an option just because it delivers poor results in synthetic workloads. If you can assemble a reasonable simulation of your application running, including realistic multi-user behavior, it's worth trying the journal mode in addition to ordered. You shouldn't start there though, because in simpler workloads ordered will be faster, and the additional integrity provided by journal mode is completely optional for a PostgreSQL database.
Note that switching to a different mode for your root filesystem isn't as easy as changing the mount options in the /etc/fstab
file. Instead you'll need to edit the bootloader (typically grub) configuration file and add the change as a kernel option like the following:
rootflags=data=journal
Because the root drive is mounted before, even the fstab
file is consulted.
One of the limitations of ext3 that is increasingly relevant with today's hard drive sizes is that on common Intel and AMD processors, an ext3 filesystem can only be 16 TB in size, and individual files are limited to 2 TB.
The evolutionary replacement for ext3, ext4 was announced as production quality as of Linux kernel 2.6.28. A variety of fixes involving delayed allocation were applied between that and version 2.6.30, but more importantly for PostgreSQL some bugs involving fsync
handling were not fully corrected until kernel 2.6.31.8/2.6.32.1. Kernel 2.6.32 is the first version that includes an ext4 version that should be considered for a production PostgreSQL database. This is the version that both RHEL 6 and Ubuntu 10.04 are using, in the first long-term release from each that includes ext4 support.
The 16 TB filesystem limit of ext3 theoretically does not exist for ext4, but as this is being written the associated mkfs
utility is still stuck at that limit. The ext4 Howto at https://ext4.wiki.kernel.org/index.php/Ext4_Howto is the definitive source for updates about progress in removing that limitation.
From the perspective of PostgreSQL, the main improvement of ext4 over ext3 is its better handling of write barriers and fsync
operations. See the section about Write barriers in a while for more information.
Unlike ext3, XFS was designed by SGI for efficiently journaling from its beginning, rather than having journaling added on to an existing filesystem. As you might predict, the result is a bit faster than ext3, some of which is just from better efficiency in the journal implementation. However, part of this speed results from the fact that XFS only journals metadata, and it doesn't even have an option to try and order the data vs. metadata writes. Accordingly, XFS is most like ext3's writeback mode. One critical difference is that in situations where garbage blocks may have been written to a file, the main concern with ext3 writeback, the journal playback in XFS will instead zero out these entries. Then they are unlikely to be interpreted as real data by an application. This is sufficient to keep PostgreSQL from being confused if it tries to read them, as a zeroed block won't have the right header to look like either a WAL block or a database data block. Some consider this a major flaw of XFS, that it took what could have been a perfectly good copy of a block and erased it. The way PostgreSQL does its writes, such damage will be sorted out by the WAL. Just make sure you have the full_page_writes
configuration parameter turned on.
How much faster is XFS? Running a simple test comparing ext3 in two of its modes vs. XFS, it does quite a bit better. The following table shows some simple bonnie++ 1.96
results, which are not indicative of the performance difference you'll see on general database applications. The results show the range appearing in three runs against a single SATA drive:
Filesystem Sequential write (MB/s) Sequential read (MB/s)
ext3 data=ordered 39-58 44-72
ext3 data=journal 25-30 49-67
XFS 68-72 72-77
It's clear that raw XFS performance has less variation than these two ext3 modes, and that its write performance is much faster. The writeback mode of ext3 wasn't included because it's not safe, due to its tendency to add non-zero garbage blocks to your files after a crash.
On systems with larger disk arrays, the delayed allocation feature of XFS, and its associated I/O scheduling, are aimed to provide larger writes that are likely better aligned with RAID stripe sizes. Combine delayed allocation with a generally more efficient design, and XFS's performance advantage over ext3 can really look compelling on huge volumes. To further improve performance on larger systems, there are several options available for adjusting XFS memory use for things such as the in-memory size of the metadata journal, as well as what its target pre-allocation size should be.
XFS has historically not been popular among Linux users or distributors. However, XFS easily handles files of over a million terabytes, making it the primary Linux solution currently available for files greater than 16 TB. This has resulted in XFS having a comeback of sorts in recent enterprise-oriented Linux distributions, where that file limit is one that administrators are increasingly likely to run into. Recently RedHat's RHEL 5.4 release added preliminary XFS support specifically to address that need, and their RHEL 6 release treats it as a fully supported filesystem on par with ext3 and ext4.
Since it defaults to using write barriers (described in more detail later), XFS is also paranoid about drives with volatile write caches losing writes. To prevent that, it is aggressive in sending drive cache flush commands to the underlying disks. This is what you want from a PostgreSQL database running on XFS using regular hard drives that have their write cache turned on. However, if you have a non-volatile write cache, such as a battery-backed write controller, this cache flushing is wasteful. In that case, it's recommended to use the nobarrier
option when mounting the XFS filesystem in order to disable its cache flushing.
There are a few other filesystem options for Linux that are not well explored for PostgreSQL, some recommendations are as follows:
- JFS: Performs similar to XFS but with less CPU usage. But it is considered less stable than the recommended choices here. It's also not as well supported by mainstream Linux distributors. It's hard to tell the ordering there—is it less stable merely because it's not a mainstream choice and gets less testing as a result, or are there fewer users because of the stability issues? Regardless, JFS was never very popular, and it seems to be losing ground now.
- ReiserFS: After starting as the first journaling filesystem integrated into the Linux kernel, for some time ReiserFS was the preferred filesystem for major Linux distribution SuSE. Since SuSE abandoned it in late 2006 for ext3, ReiserFS adoption has been shrinking steadily since. At this moment the current ReiserFS v3 is considered stable, but its replacement ReiserFS v4 has yet to even be merged with the mainline Linux kernel. The uncertain future of the filesystem limits interest in it considerably. This is unfortunate given that the transparent compression feature of ReiserFS v4 would be particularly valuable for some database workloads, such as data warehouses where large sequential scans are common.
- Btrfs: This Oracle sponsored filesystem is considered the future of Linux filesystems even by the primary author of ext4. At this point the code development hasn't reached a stable release. In the future time, when this does happen, this filesystem has some unique features that will make it a compelling option to consider for database use, such as its support for easy snapshots and checksums.
While each of these filesystems has some recognized use cases where they perform well compared to the other mainstream choices, none of these are considered compelling for database use in general, nor in PostgreSQL, due to the maturity issues as this is being written. Btrfs in particular may change in that regard, as unlike the other two it has a healthy development community working on it still.
Chapter 2, Database Hardware already mentioned that most hard drives have volatile write caches in them, and that the WAL writing scheme used by the database isn't compatible with those. The important requirement here is that when a file sync operation (fsync
on UNIX) occurs, the filesystem must make sure all related data is written out to a non-volatile cache or disk itself. Filesystem journal metadata writes have a similar requirement. The writes for the metadata updates require that the journal updates are first written out to preserve proper ordering.
To support both of these situations, where something needs to be unconditionally written to the disk and to where an ordering is required, the Linux kernel implements what they call write barriers. A quote from the kernel documentation on barriers:
All requests queued before a barrier request must be finished (made it to the physical medium) before the barrier request is started, and all requests queued after the barrier request must be started only after the barrier request is finished (again, made it to the physical medium).
This sounds comforting, because this requirement that data be on "physical medium" matches what the database expects. If each database sync request turns into a write barrier, that's an acceptable way to implement what the database requires—presuming that write barriers work as advertised.
In order to support barriers, the underlying disk device must support flushing its cache, and preferably a write-through mode as well.
SCSI/SAS drives allow writes (and reads) to specify Force Unit Access (FUA), which directly accesses the disk media without using the cache—what's commonly called write-through. They also support a SYNCHRONIZE CACHE call that flushes the entire write cache out to disk.
Early IDE drives implemented a FLUSH CACHE call and were limited to 137 GB in size. The ATA-6 specification added support for larger drives at the same time it introduced the now mandatory FLUSH CACHE EXT call. That's the command you send to a drive that does what filesystems (and the database) want for write cache flushing currently. Any SATA drive in the market now will handle this call just fine; some IDE and the occasional rare early SATA drives available many years ago did not. Today, if you tell a drive to flush its cache out, you can expect it will do so reliably.
SATA drives that support Native Command Queuing also can handle FUA. Note that support for NCQ in Linux was added as part of the switch to the libata driver in kernel 2.6.19, but some distributions (such as RedHat) have back ported this change to their version of the earlier kernels they ship. You can tell if you're using libata either by noting that your SATA drives are named starting with sda, or by running:
$ dmesg | grep libata
The exact system calls used will differ a bit, but the effective behavior is that any modern drive should support the cache flushing commands needed for the barriers to work. And Linux tests the drives out to confirm that this is the case before letting you enable barriers, so if they're on, they are expected to work.
ext3 theoretically supports barriers when used with simple volumes. In practice, and for database purposes, they are not functional enough to help. The problem is that fsync
calls are not correctly translated into write barrier in a form what will always flush the drive's cache. It's just not something built into ext3 in the right form, and if you are using Linux software RAID or the Logical Volume Manager (LVM) with ext3, barriers will not be available at all anyway.
What actually happens on ext3 when you execute fsync
is that all buffered data on that filesystem gets flushed out in order to satisfy that call. You read that right—all cached data goes out every time a fsync
occurs. This is another reason why putting the WAL, which is constantly receiving fsync
calls, onto its own volume is so valuable with ext3.
XFS does handle write barriers correctly, and they're turned on by default. When you execute a fsync
call against a file on an XFS volume, it will just write out the data needed for that one file, rather than the excessive cache flushing that ext3 does. In fact, one of the reasons this filesystem has a bad reputation in some circles relates to how early it introduced this feature. It enabled barriers before either the drives available or the underlying device drivers always did the right thing to flush data out, which meant the barriers it relied upon for data integrity were not themselves reliable. This resulted in reports of XFS causing corruption, caused by bugs elsewhere in the software or hardware chain, but XFS was blamed for them. ext4 also supports barriers, and they're turned on by default. And fsync
calls are implemented properly as well. The end result is that you should be able to use ext4, leave drive caching on, and expect that database writes will be written out safely anyway. Amusingly, it was obvious that ext4 was finally doing the right thing in 2.6.32 because of the massive performance drop in PostgreSQL benchmarks. Suddenly, pgbench
tests only committed a number of transactions per second that corresponded to the rotation speed of the underlying disks, rather than showing an inflated result from unsafe behavior.
Regardless of what filesystem you choose, there are some general Linux tuning operations that apply.
The first parameter you should tune on any Linux installation is device read-ahead. When doing sequential reads that seem to be moving forward, this feature results in Linux asking for blocks from the disk ahead of the application requesting them.
This is the key to reaching full read performance from today's faster drives. The usual symptom of insufficient read-ahead is noting that write speed to a disk is faster than its read speed. The impact is not subtle; proper read-ahead can easily result in 100% or larger increase in sequential read performance. It is the most effective filesystem adjustment to make on Linux, if you want to see benchmarks like the bonnie++
read speed (covered in Chapter 3, Database Hardware Benchmarking) jump upwards. This corresponds to a big increase in large sequential I/O operations in PostgreSQL too, including sequential scans of tables and bulk operations like COPY
imports.
You can check your current read-ahead using the blockdev
command:
$ blockdev --getra /dev/sda
The default is 256 for regular drives, and may be larger for software RAID devices. The units here are normally 512 bytes, making the default value equal to 128 KB of read-ahead. The normal properly tuned range on current hardware normally works out to be 4096 to 16384, making the following change:
$ blockdev --setra 4096 /dev/sda
A reasonable starting point. If you run bonnie++
with a few read-ahead values, you should see the sequential read numbers increase as you tune upwards, eventually leveling off. Unfortunately, read-ahead needs to be set for each drive on your system. It's usually handled by putting a blockdev
adjustment for each device in the rc.local
boot script.
The Linux read-ahead implementation was developed with PostgreSQL as an initial target, and it's unlikely you will discover increased read-ahead detuning smaller reads as you might fear. The implementation is a bit smarter than that.
Each time you access a file in Linux, a file attribute called the file's last access time (atime) is updated. This overhead turns into a steady stream of writes when you're reading data, which is an unwelcomed overhead when working with a database. You can disable this behavior by adding noatime
to the volume mount options in /etc/fstab
, as in this example:
/dev/sda1 / ext3 noatime,errors=remount-ro 0 1
There are two additional levels of access time updates available in some Linux kernels: nodiratime
and relatime
, both of which turn off a subset of the atime updates. Both of these are redundant if you use the preferred noatime
, which disables them all.
Linux will try to use any extra RAM for caching the filesystem, and that's what PostgreSQL would like it to do. When the system runs low on RAM, the kernel has a decision to make. Rather than reducing the size of its buffer cache, instead the OS might swap inactive disk pages out. How often to consider this behavior is controlled by a tuneable named swappiness. You can check the current value on your system (probably 60) by looking at /proc/sys/vm/swappiness
and the easiest way to make a permanent adjustment is to add a line to /etc/sysctl.conf
like the following:
vm.swappiness=0
A value of 0 prefers shrinking the filesystem cache rather than using swap, which is the recommended behavior for getting predictable database performance. Though, you might notice a small initial decrease in performance at high memory usage The things that tend to be swapped out first are parts of the operating system that are never used, and therefore never missed. So evicting them for buffer cache space is the right move in that specific case. It's when you run so low on memory that more things start getting swapped out that the problems show up. As is often the case, optimizing for more predictable behavior (avoiding swap) might actually drop performance for a some cases (items swapped weren't necessary). Tuning for how to act when the system runs out of memory is not an easy process.
Related to this parameter is Linux's tendency to let processes allocate more RAM than the system has, in hopes not all of it will actually be used. This Linux overcommit behavior should be disabled on a PostgreSQL server by making this change to the sysctl configuration:
vm.overcommit_memory=2
Both of these changes should be considered as part of setting up any Linux PostgreSQL server. A good practice here is to bundle them in with increasing the shared memory parameters to support larger values of shared_buffers
, which requires editing the same sysctl
file.
On the write side of things, Linux handles writes to the disk using a daemon named pdflush. It will spawn some number of pdflush processes (between two and eight) to keep up with the amount of outstanding I/O. pdflush is not very aggressive about writes, under the theory that if you wait longer to write things out you will optimize total throughput. When you're writing a large data set, both write combining and being able to sort writes across more data will lower average seeking around while writing.
The main driver for when things in the write cache are aggressively written out to disk are two tuneable kernel parameters as follows:
/proc/sys/vm/dirty_background_ratio
: Maximum percentage of active RAM that can be filled with dirty pages before pdflush begins to write them./proc/sys/vm/dirty_ratio
: Maximum percentage of total memory that can be filled with dirty pages before processes are forced to write dirty buffers themselves during their time slice, instead of being allowed to do more writes. Note that all processes are blocked for writes when this happens, not just the one that filled the write buffers. This can cause what is perceived as an unfair behavior where one "write-hog" process can block all I/O on the system.
The default here depends on your kernel version. In early 2.6 kernels, dirty_background_ratio
=10 and dirty_ratio
=40. This means that a full 10% of RAM can be dirty before pdflush really considers it important to work on clearing that backlog. When combined with ext3, where any fsync
write will force the entire write cache out to disk, this is the recipe for a latency disaster on systems with large amounts of RAM. You can monitor just exactly how much memory is queued in this fashion by looking at /proc/meminfo
and noting how large the value listed for "dirty" gets.
Recognizing these defaults were particularly bad, in Linux kernel 2.6.22 both values were lowered considerably. You can tune an older kernel to use the new defaults like the following:
echo 10 > /proc/sys/vm/dirty_ratio echo 5 > /proc/sys/vm/dirty_background_ratio
And that's a common recipe to add to the /etc/rc.d/rc.local
file on RHEL4/5 server installs in particular. Otherwise, the write stalls that happen when dirty_ratio
is exceeded can be brutal for system responsiveness. The effective lower limit here is to set dirty_ratio
to 2 and dirty_background_ratio
to 1, which is worth considering on systems with >8 GB of RAM. Note that changes here will detune average throughput for applications that expect large amounts of write caching. This trade-off, that maximum throughput only comes with an increase in worst-case latency, is very common.
One of the more controversial tunable Linux performance features, in that there's no universally accepted guidelines available, is that of the I/O scheduler choice. The name elevator is used due to how they sort incoming requests. Consider a real elevator, currently at the first floor. Perhaps the first person in requests the ninth floor, then the next person the third. The elevator will not visit the floors in the order requested; it will visit the third floor first, then continue to the ninth. The scheduler elevator does read and write sorting to optimize in a similar fashion.
You can set the default scheduler elevator at kernel boot time. Here's an example from a RedHat Linux system, changing the default elevator to the deadline option:
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/sda1 elevator=deadline
The exact location of the file where you have a similar kernel boot line depends on your Linux distribution and which boot loader you use. There are too many possibilities to list them all here; instructions for your Linux distribution on how to install a custom kernel should point you the right way.
And as of Linux kernel 2.6.10, you can adjust the scheduler for individual devices without rebooting:
$ echo cfq > /sys/block/sda/queue/scheduler $ cat /sys/block/sda/queue/scheduler noop anticipatory deadline [cfq]
The four elevator choices are:
elevator=cfq
: Completely Fair Queuing tries to divide available I/O bandwidth equally among all requests. This is the default for most Linux installations.elevator=deadline
: Deadline aims to provide something like real-time behavior, where requests are balanced so no one "starves" due to waiting too long for a read or write.elevator=noop
: The no operation scheduler doesn't do any complicated scheduling, it does handle basic block merging and sorting before passing the request along to the underlying device.elevator=as
: Anticipatory scheduling intentionally delays I/O requests in hopes of bundling more of them together in a batch.
People seem drawn to this area as one that will really impact the performance of their system, based on the descriptions. The reality is that these are being covered last because this is the least effective tunable mentioned in this section. Adjusting the I/O scheduler in most cases has a minimal impact on PostgreSQL performance. If you want to improve read performance, adjusting read-ahead is vastly more important. And if you want to tweak write performance, adjusting the dirty cache writeback behavior is the primary thing to consider (after tuning the database to reduce how much dirty data it generates in the first place).
There are a few cases where the I/O scheduler can be effective to tune. On systems with a lot of device read and write cache, such as some RAID controllers and many SANs, any kernel scheduling just gets in the way. The operating system is sorting the data it knows about, but that's not considering what data is already sitting in the controller or SAN cache. The noop scheduler, which just pushes data quickly toward the hardware, can improve performance if your hardware has its own large cache to worry about.
On desktop systems with little I/O capacity, the anticipatory schedule can be helpful to make the most of the underlying disk(s), by better sorting read and write requests into batches. It's unlikely to be suitable for a typical database server.
The other two options, CFQ and deadline, are impossible to suggest specific use cases for. The reason for this is that the exact behavior depends on both the Linux kernel you're using and the associated workload. There are kernel versions where CFQ has terrible performance, and deadline is clearly better because of bugs in the CFQ implementation in that version. In other situations, deadline will add latency—exactly the opposite of what the people expect—when the system has a high level of concurrency. And you're not going to be able to usefully compare them with any simple benchmark. The main differences between CFQ and deadline only show up when there are many concurrent read and write requests fighting for disk time. Which is optimal is completely dependent on that mix.
Anyone who tells you that either CFQ or deadline is always the right choice doesn't know what they're talking about. It's worth trying both when you have a reasonable simulation of your application running, to see if there is a gross difference due to something like a buggy kernel. Try to measure transaction latency, not just average throughput, to maximize your odds of making the correct choice here. One way to measure query latency is to just enable logging query times in the database configuration, perhaps using log_min_duration
statement, then analyzing the resulting log files. But if the difference is difficult to measure, don't be surprised. Without a compelling reason to choose otherwise, you should prefer CFQ, as it's the kernel default and therefore much more widely tested.