Disk benchmarking tools_PostgreSQL 9.0 High Performance-男历史

上QQ阅读APP看书，第一时间看更新

Disk benchmarking tools

Now it's time to see some real disks measured. For the first few examples here, the drive being tested is a 320 GB Seagate Momentus 7200.4 3 Gb/s SATA, model number ST9320423AS. This is one of the faster 2.5" laptop drives on the market, and its more detailed access time specifications were given in the IOPS section before. The results shown are from an installation into a Lenovo Thinkpad T60 with an Intel Core 2 Duo T7200 running at 2.0 GHz.

We'll start with the hdtune program running on Windows, because its graphs are extremely nice and it measures almost everything you'd hope for. Its graphs demonstrate several aspects of general disk performance more clearly than the command line tools for UNIX systems covered later.

hdtune

A great tool for the sort of basic disk benchmarking needed for databases on Windows system is hdtune, available at http://www.hdtune.com/ The program is free for a trial period, with a slightly limited feature set, and is modestly priced to purchase. The free features are sufficient for database hardware validation.

Here is what the output from hdtune looks like running the basic transfer rate benchmark on this disk:

These are respectable results from a single drive of any sort, particularly a laptop one. The left axis and the upper line graph is charting MB/s at that point on the disk. The right axis label is in milliseconds, and the scattered dots in the middle and bottom are showing the access time along points in the drive. Note how transfer speed falls and access speed rises as you move further along the logical part of this disk, into the slower ZCAV zones.

An average of 68.3 MB/s is in the middle range of what you'll see on drives as this book is being written. High performance desktop SATA or SAS drives will do better, some desktop and most laptop drives will do worse. The real-world seek access time of 14.9 ms is a similarly average value, and you can see it matches the drive specification.

There's one thing really odd about this graph though. Note how the early part, from 0 to 192 GB, is both flat on top and somewhat jagged. Both of those are warning signs that something is wrong. If you look at the bonnie++ ZCAV results later, you'll see what the drive's performance really looks like. The performance should be dropping evenly from the very beginning, and the graph should be smoother. When you see a flat top where a ZCAV decline should be, normally that means the rate is being limited by some bottleneck other than the drive speed itself. This sort of issue is exactly why it's worthwhile to get familiar with what drives look like when performing well, so you can spot the ones that aren't.

Short stroking tests

hdtune includes a particularly easy-to-use short stroking test feature that shows you how performance would change if only using a portion of the disk. For our sample, the first 160 GB were obviously the fastest parts, making that a reasonable short-stroked restriction:

You can see that not using the whole disk range using short stroking considerably decreases worst-case seek time, which makes it particularly appropriate for applications that are more concerned about worst-case latency than disk capacity. And in some cases, you can discover SATA disks with very high capacities end up being much closer in performance to more expensive SAS disks just by applying this technique. For example, the first 100 GB of a 1 TB SATA drive is extremely fast, due to how dense that information is on the drive. You might be surprised at how well it compares with, say, a 143 GB SAS drive even with a higher rotation speed.

IOPS

hdtune includes a Random Access test that gives its results in terms of the standard IOPS figure, at various block sizes:

This is not far from the computed value derived previously for this drive: 69.5 IOPS.

The average speed computed number here (the poorly labeled "agv. speed" column) is always a good one to note when doing seek-based tests. It's easiest to explain what that means with a database specific example. Consider that PostgreSQL reads in blocks of 8 KB each. Based on what we have seen, we could expect about 64 IOPS out of this drive as its worst-case performance, doing nothing but random seeks, at that block size. This turns into a total transfer rate of:

64 IO/S * 8KB/IO * (1 MB / 1024 KB) = 0.5MB/s

That's what "agv. speed" is showing: the computed transfer speed for various block sizes.

This is the real world of disk performance. While you might think a disk is capable of 60 Mb/s or more of transfers, on a truly random workload you might get 0.5 MB or less. This is a laptop disk, but only seeing 1 to 2 MB/s on completely random tasks is typical for desktop and server class drives too.

With enough practice, on a UNIX system it's even possible to stare at the output from vmstat and iostat, see how busy the drives are and the actual read/write block counts, and make a rough estimate of the seek vs. sequential workload breakdown from those numbers. If the drives are always busy but only getting 0.5 MB/s, it has to be almost all seek. If they're busy half the time and getting 30 MB/s, that's probably a sequential read hitting a bottleneck on the processing side of things.

Unpredictable performance and Windows

Serious database administrators have a strong historical bias toward using UNIX-like systems for their servers. The first examples here are using Windows instead because the graphs produced are easier to read, and therefore better for introducing the concepts of this section. But doing so reminded me why Windows is not the preferred database hosting operating system of so many people.

Getting useful benchmark results requires the system be quiescent: free of other programs running that would spoil the results of what is intended to be measured. When booting into Windows Vista to generate these results, I discovered the TrustedInstaller process was hogging a considerable amount of CPU and disk resources. It turned out Windows Update had decided it was time to install the next major Vista Service Pack; it was downloading pieces in the background, and pushing me toward the upgrade at every opportunity. It was two hours later before I had completed all the background activity it compelled me to take care of, and had an idle system capable of running these tests.

dd is a standard UNIX utility that's capable of reading and writing blocks of data very efficiently. To use it properly for disk testing of sequential read and write throughput, you'll need to have it work with a file that's at least twice the size of your total server RAM. That will be large enough that your system cannot possibly cache all of the read and write operations in memory, which would significantly inflate results. The preferable block size needed by dd is to use 8 KB blocks, to match how the database is going to do sequential read and write operations. At that size, a rough formula you can use to compute how many such blocks are needed to reach twice your RAM size is as follows:

blocks = 250,000 * (gigabytes of RAM)

Once you know that number, the following simple commands will time writing out a file large enough to not fit in the OS RAM cache, and then read the results back:

time sh -c "dd if=/dev/zero of=bigfile bs=8k count=blocks && sync"
time dd if=bigfile of=/dev/null bs=8k

As this will run for a while without displaying anything interesting, you can watch vmstat or iostat during the test (in another terminal) to see what's happening. vmstat's bi and bo (block in and out) numbers will match current read/write rate. You should also note the CPU percentage required to reach the peak rate. Displayed CPU percentages are often relative to the total CPU capacity of the system. So if you have four CPUs, and you're seeing 25% overall CPU usage, that could be a single CPU running at 100%.

Once the times are reported, you can then divide the file size in MB (=1024 * GB) by the number of seconds of runtime, and get a MB/s write and read score.

Recent dd versions, on Linux at least, will report a MB/s value at the end of their run. Note that the value reported will be a bit inflated, because dd will report it is finished before the actual blocks are all flushed to disk. This is why the previous recipe includes a sync at the end—this makes sure the time reported includes that overhead. The raw transfer rate reported by dd will usually be a bit higher than what you compute when taking this into account.

The dd numbers you'll see should closely match the bonnie++ block output/input numbers, as demonstrated in the next section. If you intend to eventually run bonnie++ there's little sense in performing this test too. Using dd is mainly helpful for UNIX systems where you don't want to install anything just to test the disks out, including ones that don't have the development tools to build additional software installed.

bonnie++

The standard program used to run a simple disk I/O benchmark on a UNIX-like system is bonnie++. In its default configuration, it will create a file (or set of files) twice as large as physical RAM in the server, to prevent the entire test from being cached in RAM, then read that whole file back in again. The main virtue of the program is that you can just run it from a directory on the volume you want to test and it usually does the right thing without further input.

Here's a full example of downloading bonnie++, building it, running it, and getting an HTML report from it:

$ wget http://www.coker.com.au/bonnie++/bonnie++-1.03e.tgz
$ tar xvfz bonnie++-1.03e.tgz
$ cd bonnie++-1.03e/
$ ./configure
$ make
$ ./bonnie++ -f -n 0 | tee `hostname`.bonnie
$ chmod +x ./bon_csv2html 
$ cat `hostname`.bonnie | grep "," | ./bon_csv2html > `hostname`.htm

You'll want to do this on the disk you want to test, or to run the bonnie++ program from where you compiled it while in a directory on the test disk. Note that the program won't run as root. You'll need to make the directory you're running it in owned by a regular user and then log in as that user before running it.

The preceding example runs the program with two command line options you will likely want to always use:

-n 0: Skip file creation test
-f: Fast mode, skip per character I/O tests

Neither of those tests are very informative nowadays. What you really want are the sequential block read and write numbers, as well as the seeks result.

You'll have two files come out of this. If you use the suggested file naming convention shown in the example, a full set of results will be in a file named after your hostname with the extension .bonnie. Without the fast mode enabled it looks like the following (one run exactly as above will have less information), again from our sample laptop disk:

Version 1.03e ------Sequential Output------ --Sequential Input- -Random-
               -Char-  -Block-  -Rewrite- -Char-   -Block-  -Seeks-
Machine Size K/s %CP K/s %CP K/s %CP K/s %CP K/s %CP /s %CP
meddle   4G 44754 96 58386 24 30439 12 52637 97 71059 19 232.8 0

The "Create" tests were deleted from the output shown above to save some space, and the above reformatted a bit to fit better into the width of the page here. The results will also be summarized in a list of comma-separated values that look like the following:

meddle
4G,44754,96,58386,24,30439,12,52637,97,71059,19,232.8,0,16,11349,36,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

The values showing up with + characters are tests that didn't produce useful output. This comma delimited part can be sent through a program named bon_csv2html to produce a HTML formatted version of the results, which is much easier to read. In the preceding full example, that gets redirected to a file named after your host with the extension .htm.

If you did happen to compute just by running the program with its default parameters, you do want to ignore the per-character and create results, and look at the block output/input ones instead.

The Random Seeks number reported by bonnie++ is not a simple read-only seek test. Here's its actual description from the manual:

The test runs SeekProcCount processes (default 3) in parallel, doing a total of 8000 random seek reads to locations in the file. In 10% of cases, the block read is changed and written back.

This actually makes it a mixed read/write test, which is really more useful for something like a database simulation anyway. Note that the duration of this test—8000 seeks—is short enough that powerful systems nowadays with large caches can easily give inflated results here, at intermittent times. You might see 500 seeks/second on one run, followed by 2000/second on the next. It's important to confirm the seek figure using additional tools that run longer.

bonnie++ 2.0

The preceding results were generated with bonnie++ 1.03e, the most recent version from the stable release series at this point. If you have a terabyte or larger drive you'll need 1.03e at a minimum to get useful results, the earlier 1.0 releases don't handle that right. And currently development is nearly complete on an updated bonnie++ 2.0. There are already plenty of systems where the V1 bonnie++ results aren't very useful. Ideally you'd default to trying the snapshots of the experiment V2 releases (currently at 1.96) and only fall back to the 1.0 series if that doesn't work for you, but as described in the next section that may not always be practical. I normally end up needing to use both.

Here's a command line that works with V1.96 and provides a faster and more accurate test than the earlier versions:

bonnie++ -f -n 0 -c 4

In addition to the flags described in the previous version, this turns on the following new option; -c 4: concurrency, use four processes at once.

Tests here suggest very high values for concurrency fail to deliver any incremental improvement, but going from one up to two, three, or four can improve results in a multi-core system.

bonnie++ ZCAV

Just like hdtune, bonnie++ has a utility, a separate binary named zcav, that will track transfer rate across the disk. It produces a simple text file you can save to a file:

./zcav -f/dev/sda > sda.zcav

You will probably need to run this program as root as it accesses the whole drive. You can convert the program's output to a graph using the gnuplot software.

Note

Unfortunately, results from zcav on the experimental branch using version 1.96 haven't been usable in my own tests. I find myself needing to drop back to the 1.03e version for zcav to work properly. Hopefully this gets sorted out before the official version 2.0 is released.

There are some examples of zcav results on the bonnie++ site; here's the output showing performance on the sample laptop used for all the examples so far:

And this is the gnuplot script that produced it:

unset autoscale x
set autoscale xmax
unset autoscale y
set autoscale ymax
set xlabel "Position GB"
set ylabel "MB/s"
set key right bottom
set title "Seagate Momentus 7200.4 Laptop"
set terminal png
set output "laptop-zcav.png"
plot "laptop.zcav" title "7200RPM Laptop ST9320423AS Max/Min=92/49"

The zcav software didn't produce any of the max/min data or titles shown on that graph, those were all manually edited to make a nicer graph after extracting the values from the raw text file. This is an example of how the Linux tools can give the same quality of basic results as something like hdtune, but you have to put a bit more work into it.

There is one positive aspect to this extra work: the bottleneck on the early part of the disk seen on the Windows hdtune results isn't there. Performance peaks at 92 MB/s and falls slowly and steadily through all the transfer zones on the disk in the classic ZCAV pattern. Now that you see exactly what this sort of graph is supposed to look like, if you revisit the hdtune once again it should be obvious why I immediately suspected a problem when seeing it—the shape just wasn't right. This is exactly the same system, just booted into Ubuntu Linux instead of Windows Vista. That strongly suggests that whatever was limiting the Windows transfer rate to 78 MB/s, and therefore flattening the left side of the hdtune graph, was just some sort of Vista software problem.

This particular issue, noticing that something might be wrong with the disk configuration just because it didn't look right compared to similar hardware seen in the past, is exactly why it's worthwhile even for database administrators to become familiar with hardware benchmarking basics. A more serious problem of this type could easily throttle database performance, and without knowing it's the hardware to blame much time could be wasted attacking that problem from the database side.

sysbench

While initially targeted as a MySQL tool, the sysbench program available from http://sysbench.sourceforge.net/ is also valuable for testing low-level aspects of performance that impact databases, ones that apply equally well to PostgreSQL. It even supports running read-only tests against PostgreSQL databases, too, a feature not demonstrated here. The use in this chapter is just for its low-level hardware tests.

Note

These specific sysbench tests should be considered secondary to the ones already shown. If you have confirmed sequential read, writes, and seek speeds are good with bonnie++, and know commit rates are good from doing INSERT tests with PostgreSQL, you really don't need this data too. It's mainly useful as an alternate way to do more thorough and specific seek and commit rate tests, and is not recommended as your primary testing tool.

Here's an example of downloading and compiling sysbench without support for any database, the only way it will be used here:

$ wget http://sourceforge.net/projects/sysbench/files/sysbench/0.4.10/sysbench-0.4.10.tar.gz/download
$ tar xvfz sysbench-0.4.10.tar.gz 
$ cd sysbench-0.4.10/
$ ./configure --without-mysql
$ make
$ sysbench/sysbench --test=cpu run

This shows an example of running the simplest test available, the one for CPU speed. The results that test gives are not very useful for comparing modern processors, but it does confirm the program is working quickly.

Note

The choice of sysbench-0.4.10 instead of the latest version available right now (0.4.12) is deliberate. Changes made since 0.4.10 have introduced compilation issues on several platforms, and the software isn't very well maintained in that regard. You may discover that it takes more work on the configure and make steps to build sysbench for your system than shown here.

Once you've compiled the program, you don't even need to install it onto your system. It will run fine from a local build in your home directory.

Seek rate

Unlike the rest of the tests sysbench is used for in this chapter, the seek rate test requires a 3-step process where test files are created, the test is run, and then those files are cleaned up. You also have options for how many threads to keep active, how large the file should be, and what read/write mode to use. The following script shows how to run a seek test with the most commonly changed portions as environment variables:

#!/bin/sh
PREFIX="$HOME/sysbench-0.4.10"
THREADS=1
GBSIZE=4
MODE=rndrd
$PREFIX/sysbench/sysbench --test=fileio --num-threads=$THREADS --file-num=$GBSIZE --file-total-size=${GBSIZE}G --file-block-size=8K --file-test-mode=rndrd --file-fsync-freq=0 --file-fsync-end=no prepare
$PREFIX/sysbench/sysbench --test=fileio --num-threads=$THREADS --file-num=$GBSIZE --file-total-size=${GBSIZE}G --file-block-size=8K --file-test-mode=rndrd --file-fsync-freq=0 --file-fsync-end=no run --max-time=60
$PREFIX/sysbench/sysbench --test=fileio --num-threads=$THREADS --file-num=$GBSIZE --file-total-size=${GBSIZE}G --file-block-size=8K --file-test-mode=rndrd --file-fsync-freq=0 --file-fsync-end=no cleanup

Unlike the bonnie++ seek test which aimed at just twice your total RAM by default, you can easily make this test span a large portion of the disk instead. Any seek results should always include what portion of the disk the seeking took place over. To get useful results from larger disks, you might want to use hundreds of GB worth of data on this test, instead of just the 2 * RAM that bonnie++ uses for its seek testing.

fsync commit rate

It's possible to use sysbench to measure how fast commits can be flushed to disk, using the standard fsync call just like the database defaults to. Note that in this case, the file size being used is 16384 bytes, even though PostgreSQL block writes are 8192 bytes. The version tested here didn't work correctly with the block size reduced that far, and as the actual amount of bytes doesn't impact the commit rate until it becomes much larger anyway that's not worth worrying about. Below is a Linux specific script that includes disabling then re-enabling the drive's write cache; the basic sysbench call can be used on any platform that program runs on however:

#!/bin/sh
DRIVE="/dev/sda"
PREFIX="$HOME/sysbench-0.4.10"
# Disable write cache
hdparm -W 0 $DRIVE
echo fsync with write cache disabled, look for "Requests/sec"
$PREFIX/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run
# Enable write cache (returning it to the usual default)
hdparm -W 1 $DRIVE
echo fsync with write cache enabled, look for "Requests/sec"
$PREFIX/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run

On a standard 7200 RPM drive, spinning 120 times per second and therefore limited to that as its maximum commit rate, the version with the write cache disabled would look like this:

 104.81 Requests/sec executed

While the cached version will likely show thousands of commits per second.

Complicated disk benchmarks

There are many more complicated disk benchmark programs available:

iozone: http://www.iozone.org/ allows testing all sorts of disk scenarios.
fio: http://freshmeat.net/projects/fio/ lets you completely script exactly what benchmark scenario you want to run. Many samples at http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide.
pgiosim: http://pgfoundry.org/projects/pgiosim/ simulates very specific types of PostgreSQL workloads.

There are a few issues that make all of these less useful than the tools covered here. The first is that these are complicated tools to set up and interpret the results of. Correspondingly, when you do find a problem, if it involves a hardware vendor issue, there's no way they will trust or attempt to replicate things discovered via these tests. dd, bonnie++, and hdtune are on their respective platforms known to be simple, reliable, easy to interpret tools. If you can show your vendor a problem using one of those tools, there's little they can do to wiggle out of that. Even sysbench is a bit more complicated than you'd want to rely upon in a vendor dispute. Simplicity and transparency with your vendor is much more important for doing basic database performance testing than being able to test more complicated scenarios.

And really, if your goal is eventually to tune a database application, tuning exercises should primarily be done at that level anyway. Once you're sure the basic hardware works as expected, move right onto database-level tests and see if tuning changes have any impact. That's much more likely to find out the real issues that do and don't matter than these complicated synthetic disk benchmarks.