PostgreSQL 9.0 High Performance
上QQ阅读APP看书,第一时间看更新

Disk layout for PostgreSQL

Since PostgreSQL uses standard filesystems for all its files, there are several parts of the database you can relocate to somewhere else just by moving the associated files and adding a symbolic link to the new location.

Symbolic links

Symbolic links (also called a symlink) are just entries in a filesystem directory that point towards another location. UNIX systems originally preferred to use what are called hard links, which link to the new location quite directly. The entry in the filesystem literally points to another spot on disk. To make this easier to manage, the normal approach now is to use soft symlinks, which are easily visible. The most common thing to relocate using a symlink in PostgreSQL is the WAL transaction log. You can do this after the database cluster is created (but with the server down!) like the following:

$ cd $PGDATA
$ mv pg_xlog /disk
$ ln -s /disk/pg_xlog pg_xlog
$ ls -l pg_xlog
lrwxrwxrwx 1 postgres postgres 11 2010-04-27 17:35 pg_xlog -> /disk/pg_xlog

Starting in PostgreSQL 8.3, it's possible to use the --xlogdir parameter when running initdb to create the cluster. This doesn't work any differently—it will just create the soft symlink for you. The preceding technique is still quite common, and there's no reason to prefer one over the other besides what seems easier to you.

Tablespaces

The main unit of storage for a PostgreSQL database is the tablespace. Tablespaces are described accurately by their name: they're a space to put tables (and indexes) in. The idea is that every logical disk you want to use for a distinct purpose gets assigned a tablespace name, and then when you create a table you refer that tablespace to put it there:

$ mkdir /disk/pgdata
$ psql
postgres=# CREATE TABLESPACE disk LOCATION '/disk/pgdata';
postgres=# CREATE TABLE t(i int) TABLESPACE disk;

Tablespaces are also implemented inside the database using symbolic links, and your OS needs to support them (or an equivalent like the NTFS junction) for this to work. Databases and tables are by default created in a virtual tablespace named pg_default. You can change that by setting the default_tablespace parameter in the server configuration. It's also possible to relocate an entire database by setting the TABLESPACE parameter when running CREATE DATABASE.

Database directory tree

With this background in mind, here are the major directories inside the PostgreSQL database structure that initdb creates:

  • base: This is the location that holds pg_default, the default tablespace. The template databases and any additional ones created without an explicit tablespace assignment will all be placed here. The next chapter digs into the details of how the base/ directory is organized into databases and tables.
  • global: Holds the pg_global virtual tablespace. This is where system tables shared by all databases in a cluster, such as the database roles and other system catalog data, are stored.
  • pg_clog: The transaction commit log data is stored here. This data is actively read from, by VACUUM in particular, and files are removed once there's nothing interesting in them. This directory is one of the reasons PostgreSQL doesn't work very well if you mount the database in a way that bypasses the filesystem cache. If reads and writes to the commit logs are not cached by the operating system, it will heavily reduce performance in several common situations.
  • pg_stat_tmp: This directory contains the files used to store database statistics. These files should never get very big, and if they do you'll see the database statistics process start consuming more resources.
  • pg_tblspc: When you create a new tablespace, this directory is where the symbolic link created to manage that is saved.
  • pg_xlog: The database WAL used for crash recovery is stored here.
  • pg_subtrans: Contains data related to subtransactions.
  • pg_multixact: Contains multi-transaction status data. In situations where your application heavily uses shared row locks, use of this direction could be heavy.
  • pg_twophase: Holds data related to two-phase commits.

Generally, the first thing people relocate onto its own disk is pg_xlog. Then they add more tablespaces to split out heavily accessed tables. Next, move temporary files. The directories containing transaction details are much less likely candidates to be split to their own disks, but applications where this has improved performance have been observed by PostgreSQL users.

Temporary files

There are a few situations where PostgreSQL saves temporary files, ones that are not critical to database operation. Tables created using CREATE TEMPORARY TABLE and their respective indexes are one source. Probably more importantly, when the database is doing a query that involves a sort operation, and the data exceeds work_mem, temporary files are created for that purpose. So in situations where your users will be doing lots of sorting of large tables, like in a data warehouse, there can be quite a bit of activity going to disk for this purpose.

The temp_tablespaces database parameter allows relocating all the temporary tables to one chosen from the list of additional tablespaces provided. If you put more than one tablespace on that list, which is used is selected at random when a transaction first does something that needs one. However, when a second or later bit of temporary storage is needed by a transaction, the database actually iterates over the list sequentially, therefore spreading activity more evenly across all the provided tablespaces.

By default, temporary objects are created in the default tablespace. They'll all appear in base/pgsql_tmp unless relocated. If you're not using temporary tables heavily, monitoring activity and disk space used in that directory can be used to estimate how heavy the sorting activity is on your database. This technique works even on earlier PostgreSQL releases that don't have the log_temp_files parameters. You can monitor disk space instead of relying on the logs for that information, and monitoring disk space used can be useful even if that's available, because it gives you an idea of how much concurrent sorting is happening. That can be hard to compute just from the log file entries.

One interesting property of temporary files is that they're prime candidates for storage even on less reliable drives, such as you might have your operating system on. It's possible to safely, and sometimes quite usefully, put the temporary files onto a directory on your OS disks if they are underutilized. Just be careful because if those disks are lost, you'll need to recreate that empty tablespace on the OS drives of the replacement, or remove it from the temp_tablespaces list.

Disk arrays, RAID, and disk layout

Many people start their PostgreSQL tuning work by asking about how to spread the various pieces of the database across a large disk array. Recommending how to do that is really application specific, and it's only after covering so much background the last few chapters that the right context is available to even discuss it.

If you have a really large number of disks available, a dozen or more, presuming that disk space needed for the database wasn't a problem a good configuration would look like the following:



Location           Disks           RAID Level           Purpose
/ (root)           2           1           Operating system
$PGDATA           6+           10           Database
$PGDATA/pg_xlog           2           1           WAL
Tablespace           1+           None           Temporary files

Here we're presuming that you want every bit of important data to be redundant, and therefore are using the mirroring facilities of RAID1 to do that. This gives a 3:1 ratio between database read/write throughput and that of the WAL, which is usually enough that activity to each will be balanced well. If you only have two pairs of drives for those (the database is on one pair and the WAL on another), it's likely the WAL will get considerably less activity than the database.

Note

The leftover disk here from the hypothetical set of 12 would be assigned as a hot spare, something it's always a good idea to allocate. Given that arrays tend to be an even number of disks, pairing some temporary space with a hot spare can work out better than having two hot spares. Another typical use for a single leftover disk is to create a place to store non-critical backups and other working files, such as a database dump that needs to be processed before being shipped elsewhere.

Here's where things get tricky…what if instead, the preceding was done like the following:



Location           Disks           RAID Level           Purpose
/ (root)           12           10           OS, DB, WAL

Would that perform better or worse than the manually split setup for the same number of disks shown previously? The answer to that is simple: there's no way to predict that. Having your data striped across so many disks is going to improve the average performance of every operation you do on this array. This setup is almost certainly better for a workload where random seeks are the norm. However, there are several database components that are not accessed randomly. The pg_xlog is only written to sequentially. Temporary files are not strictly sequential, but they're not heavily random either, and writes to the operating system drive normally happen without the force of a sync operation forcing them to disk, which means that the OS is free to delay those writes for improved write combining and sorting—both of which reduce seeks. Putting everything onto one big array will reduce the chances to take advantages of optimizations like that, because the cache flushing required to make the database happy is likely to force out plenty of other data too, particularly on ext3.

With that perspective, you can think about this a bit differently:

 

Function           Cache Flushes           Access Pattern
Operating system           Rare           Mix of sequential and random
Database           Regularly           Mix of sequential and random
WAL           Constant           Sequential
Temporary files           Never           More random as client count increases

The exact mix of sequential and random behavior for your database is completely application dependent, and how many concurrent users are involved impacts both database and temporary file patterns—and again, what your application is doing impacts whether temp files are even important. Any attempt to optimize disk layout that doesn't take into account the access pattern of your app, including concurrency, is unlikely to predict performance correctly.

And even if you do have enough information that you believe you can predict an optimal layout, the odds are still against you. To quote noted computer scientist Donald Knuth, "…the universal experience of programmers who have been using [performance] measurement tools has been that their intuitive guesses fail". You would be unwise to presume that you will be the lucky person who guesses correctly.

If you do split something out, make sure to measure the improvement to confirm whether you were right. Ideally, you'd be able to simulate that in a test environment before doing so on a production system. In the real world, most development servers have far fewer disks than the production ones, making that tough to simulate. One good place to try it is during acceptance testing of new production hardware. You should be doing general system burn-in and testing of any candidate replacement for a production database server. That's a great time to experiment with (and measure!) changes in application performance, including different disk configurations.

Disk layout guidelines

There are a few guidelines that can help you prune down the possible configurations:

  • Avoid putting the WAL on the operating system drive, because they have completely different access patterns and both will suffer when combined. Normally this might work out fine initially, only to discover a major issue when the OS is doing things like a system update or daily maintenance activity. Rebuilding the filesystem database used by the locate utility each night is one common source on Linux for heavy OS disk activity.
  • If you have evidence you don't do any large sorting, the temporary files can be kept at their default location, as part of the database disk.
  • On Linux systems running ext3, where fsync cache flushes require dumping the entire OS cache out to disk, split the WAL onto another disk as soon as you have a pair to spare for that purpose.

Beyond those rough rules, it's hard to settle arguments between the "put everything in one big array and let the OS sort it out" vs. "break out everything into individually assigned disk so the DBA can tune" crowds. As is the case with many types of database tuning, in order to really know for sure what works best here you'll need to simulate both configurations, using a benchmark workload as similar as possible to your production workload.

As for which side I personally fall onto, I prefer to split things up. The reason for that is not because I expect it to perform better, but because it's unquestionably easier to quantify what an application is doing that way. Being able to measure the exact balance of data going to database, WAL, and temporary disks is valuable both for general optimization, as well as for finding bottlenecks in application design. I consider the value of being able to monitor to that level to be the tie-breaker between these schools of thought on database layout.

Remember that the hardware your application is currently running on is not necessarily going to be what it runs on forever. If I'm planning out a new server, and I can predict the WAL vs. DB activity ratio based on measurements of the old system, that's not just guessing anymore, and the odds are much better it will be successful. But if the old system only has a large pile of disks, information gathered from them isn't nearly as useful in capacity planning for the future.