PostgreSQL for Data Architects
上QQ阅读APP看书,第一时间看更新

Checkpoint

Checkpoint is a mandatory process. To understand this, let's discuss blocks. PostgreSQL always reads and writes data in blocks. Consider the emp table. It has just one record. The data in this record should add up to a few bytes; we have the value 1 in the column id, and the value Newname in the column first_name. However, this table will consume 8K in the disk because PostgreSQL works with 8K blocks. A block is also referred to as a page. It is easy to verify that PostgreSQL uses blocks. Ensure that our table has just one record as follows:

SELECT * FROM emp;
 id | first_name 
----+------------
  1 | Newname
(1 row)

Then, we find the filename:

SELECT pg_relation_filepath('emp');
 pg_relation_filepath 
----------------------
 base/24741/24742
(1 row)

Now, we check the size of the file:

\! ls -l /pgdata/9.3/base/24741/24742
-rw-------. 1 postgres postgres 8192 Nov 15 11:33 /pgdata/9.3/base/24741/24742

8192 bytes = 8K. So, a table with just one record takes up 8K.

Let's try inserting some data and see what happens:

INSERT INTO emp(id , first_name) SELECT generate_series(1,5000000), 'A longer  name ';
INSERT 0 5000000

After executing the preceding query a few times, let's check the size of the files from the shell prompt:

[root@MyCentOS 24741]# ls -lh 24742*
-rw-------. 1 postgres postgres 1.0G Nov 17 16:14 24742
-rw-------. 1 postgres postgres  42M Nov 17 16:14 24742.1
-rw-------. 1 postgres postgres 288K Nov 17 16:08 24742_fsm
-rw-------. 1 postgres postgres  16K Nov 17 16:06 24742_vm

So, we have directories for databases and files for tables. Within the files, data is managed in blocks. In short, the physical layout of a PostgreSQL cluster can be presented as follows:

Checkpoint

Once a user makes changes to the data (which has been made available in the buffer), that buffer is dirty. As mentioned earlier, the fact that a user has committed a change does not mean that the change has been written to the data file. It's the job of the checkpointer process to write the change to the data file. When a checkpoint happens, all dirty (modified) pages are written to the table and index files. The process also marks the pages as clean. It also marks the write-ahead log as applied up to this point.

Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before this checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. Until now, the change records have only been written to the write-ahead log files. In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the log (known as the redo record) from which it should start the REDO operation. Any changes made to data files before this point are guaranteed to be already on disk. Hence, after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed from http://www.postgresql.org/docs/current/static/wal-configuration.html.

The question now: is when does a checkpoint happen? To some extent, we decide this. There are a few parameters that decide when a checkpoint should happen: checkpoint_segments, checkpoint_timeout, and checkpoint_completion_target.

The first one is checkpoint_segments. The default value for this is 3. Once 3 WAL segments have been filled, a checkpoint occurs. Each WAL segment is 16 MB. Once 3 WAL segments of 16 MB worth of changes have been made, a checkpoint should happen. We will cover WAL in the next section.

The second parameter, checkpoint_timeout, is a timeout value, which can be set in seconds (default), minutes, or an hour. A checkpoint will occur when either:

  • checkpoint_timeout period has elapsed
  • checkpoint_segments number of WAL have been filled

Let's consider a server with around 16 GB shared buffer. This is a server that caters to significant load. If a significant proportion of this load consists of writes, then, most of this 16 GB buffer can become dirty in a few minutes. A low setting for checkpoint segments will result in the available segments getting filled quickly and frequent checkpoints. Similarly, a low setting for checkpoint_timeout will also result in frequent checkpoints. This results in excessive disk throughput. On the other hand, if we keep these values very high, this will result in infrequent checkpoints. In a write-heavy system, this can result in significant I/O spikes during checkpoints, which affects the performance of other queries. Another parameter: checkpoint_completion_target can be tweaked to alleviate this to some extent.

This parameter tells PostgreSQL how quickly it must try and finish the checkpointing process in each iteration. With the default value of 0.5, PostgreSQL can be expected to complete each checkpoint in about half the time before the next checkpoint starts. When we increase this value to, say 0.9, the writes resulting from the checkpoint get spread over a longer period. So, the I/O spikes get flattened out.

One issue with very infrequent checkpoints and a lot of dirty buffers is that time to recover might go up. In the case of database restarts, for example, a crash, it finds out the last checkpoint information. It will then replay all the transactions that happened after the last checkpoint and until the last commit. The transactions/changes are read from the WAL. If the volume of dirty data was huge when the system went down, this means a large number of transactions has to be replayed before the database is open for business. This implies a higher downtime. From this angle too, it is better to have more frequent checkpoints rather than infrequent checkpoints. The time to recover might be as high as the checkpoint_timeout value. An excellent post on this parameter is available at http://www.depesz.com/2010/11/03/checkpoint_completion_target/.

The checkpointer is a process started by the postmaster as soon as the startup subprocess finishes, or as soon as recovery begins if we are doing database recovery. It remains alive until the postmaster commands it to terminate.