
WAL and the WAL writer process
When we make changes to the data, the changes are not written to the data files immediately, as mentioned before (probably many times). Changes are made to the blocks in the buffer and records of these changes are written to the WAL buffer (as soon as changes to data are made). The changes are flushed to the WAL segments when the changes are committed.
In the pg_xlog
directory, the WAL segments are each 16 MB in size:
[postgres@MyCentOS pg_xlog]$ pwd /pgdata/9.3/pg_xlog [postgres@MyCentOS pg_xlog]$ ls -alrt total 16396 drwx------. 2 postgres postgres 4096 Oct 13 13:23 archive_status drwx------. 3 postgres postgres 4096 Oct 13 13:23 . drwx------. 15 postgres postgres 4096 Nov 15 20:17 .. -rw-------. 1 postgres postgres 16777216 Nov 15 20:17 000000010000000000000001
We can find out the segment PostgreSQL is writing to now using the pg_current_xlog_location
function:
[postgres@MyCentOS pg_xlog]$ psql psql (9.3.0) Type "help" for help. postgres=# SELECT pg_xlogfile_name(pg_current_xlog_location()); pg_xlogfile_name -------------------------- 000000010000000000000001 (1 row)
The name is not a random collection of digits and numbers. It's comprised of three parts of 8 characters each:
000000010000000000000001
The digits are classified as follows:
- The first 8 digits identifies the timeline
- The following 8 digits identifies the (logical)
xlog
file - The last ones represent the (physical)
xlog
file (Segment
)
Each segment contains blocks of 8K. Usually, PostgreSQL moves from one segment to the next when one segment is filled up, that is, all the 16 MB is filled. However, it's also possible to trigger the switch.
There are many functions related to WAL, which are useful in archival, recovery, and so on.
For example, pg_switch_xlog
moves to the next transaction log file, allowing the current file to be archived. If there has been no transaction log activity since the last transaction log switch, pg_switch_xlog
does nothing and returns the start location of the transaction log file currently in use. WAL is mostly written to and rarely read from. Cases when WAL is read from include:
- Recovery
- Server startup
- Replication
Let's take a look at a few uses of WAL.
Recovery
This is the primary purpose of the WAL concept. Here, we are referring to recovering transactions that have been committed, but have not found their way to the data files. All the changes made to the database will find their way into the WAL segments, irrespective of whether the changes have been reflected into the data files or not. In fact, it's mandatory that changes have been written to WAL files before they are written to the data files themselves. Loss of WAL files almost certainly means lost transactions.
Incremental backup and point-in-time recovery
We can take a snapshot of the PostgreSQL filesystem and then set up a WAL archival process. The snapshot taken need not be a consistent one. The WAL segments generated will keep getting archived and we can use the snapshot and the archived WAL segment to perform a point-in-time recovery. In this process (which will be covered in Chapter 10, Scaling, Replication, and Backup and Recovery), we restore the file snapshot, and then replay the WAL segments until a specific point in time or until a transaction.
Replication
WAL segments have been used for the two purposes mentioned earlier. From version 9.0, WAL segments have been used for replication also. The rationale is simple. All the changes happening in the server are being recorded in the WAL segments anyway. Why not use these and get a stand-by server ready for failover?
We will cover the steps to set these up in Chapter 10, Scaling, Replication, and Backup and Recovery.
WAL also reduces the number of disk writes necessary to guarantee that a transaction is committed, thus improving performance. This is achieved because WAL writes are sequential. If a number of small transactions are committed, they will appear sequentially in the log. If the database were to work without WAL, a transaction commit will immediately result in writing data out to all the data files that were affected by the transaction.
Now, let's also take a look at a few key parameters related to WAL. Change the directory to /pgdata/9.3
and execute the following command:
[postgres@MyCentOS 9.3]$ grep wal postgresql.conf wal_level = archive # minimal, archive, or hot_standby #wal_sync_method = fsync # the default is the first option #wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers #wal_writer_delay = 200ms # 1-10000 milliseconds #max_wal_senders = 0 # max number of walsender processes #wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables #wal_sender_timeout = 60s # in milliseconds; 0 disables #wal_receiver_status_interval = 10s # send replies at least this often #wal_receiver_timeout = 60s # time that receiver waits for
The wal_level
is an important parameter as this setting has a direct impact on what we can do with WAL, whether we can use it only for recovery from a crash, or we can use it also to archive for a point-in-time recovery, or whether we can use it to create a stand-by server. This setting determines the amount of information written to WAL. This is an enum type of value, which means that it can have one of a possible list of values. The possible values are minimal
, archive
, and hot_standby
. The default value is minimal
, which writes only the information needed to recover from a crash or immediate shutdown. The archive
value adds logging required for WAL archiving, and hot_standby
further adds information required to run read-only queries on a standby server. This parameter can only be set at server start.
In minimal
level, WAL-logging of some bulk operations can be safely skipped, which can make the operations faster. Minimal WAL does not contain enough information to reconstruct the data from a base backup and the WAL logs, so either archive or the hot_standby
level should be used to enable WAL archiving (archive_mode
) and streaming replication.
In the hot_standby
level, the same information is logged as with archive
, plus the information needed to reconstruct the status of running transactions from the WAL. To enable read-only queries on a standby server, wal_level
must be set to hot_standby
on the primary, and a hot_standby
setting must be enabled in the standby.
When fsync
is on, the PostgreSQL server tries to ensure that data really gets flushed to the disk (remember the different levels of cache?). Setting fsync
to off might result in significant gains in performance, but carries a risk. This can result in an unrecoverable data corruption in the event of a power failure or a system crash. In cases where fsync
is turned off, wal_sync_method
is not relevant at all. When fsync
is on, we could choose one of many options, such as open_datasync
, fdatasync
, fsync
, and so on; each one has a different way to ensure that data gets flushed to disk. A description of these methods is available at http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-SYNC-METHOD.
For an in-depth discussion on the fsync
options and its impact on the performance and possibility of data loss, please refer to PostgreSQL 9.0 High Performance, Gregory Smith, Packt Publishing.
The wal_buffers
memory is the amount of shared memory set aside for the WAL buffer. Setting this to -1
will result in an automatic selection based on the shared_buffers
setting. In automatic setting, the value can range from 64 KB to 16 MB. When set manually, the minimum possible value is 32 KB.
Another decision is how frequently the WAL writer should flush WAL to disk. What we need to remember is that this will be far more frequent than checkpoint frequency. The WAL flush interval is indicated by wal_writer_delay
. The default value is 200 milliseconds.
There are a few more parameters relevant for replication and we will cover them in Chapter 10, Scaling, Replication, and Backup and Recovery.