PostgreSQL 9.0 High Performance
上QQ阅读APP看书,第一时间看更新

Sample disk results

Here's a summary of what was measured for the laptop drive tested in detail previously, as well as a desktop drive both alone and in a RAID array as a second useful data point:

        

Note how all the seek-related information is reported here relative to the size of the area being used to seek over. This is a good habit to adopt. Also note that in the laptop rate, two commit rates are reported. The lower value is without the write cache enabled (just under the rotation rate of 120 rotations/second), while the higher one has it turned on—and therefore providing an unsafe, volatile write cache.

The other two samples use an Areca ARC-1210 controller with a 256 MB battery-backed write cache, which is why the commit rate is so high yet still safe. The hard drive shown is a 7200 RPM 160 GB Western Digital SATA drive, model WD1600AAJS. The last configuration there includes three of that drive in a Linux Software RAID0 stripe. Ideally, this would provide 3X as much performance as a single drive. It might not look like this is quite the case from the bonnie++ read/write results: those represent closer to a 2.1X speedup. But this is deceiving, and once again it results from ZCAV issues.

Using the bonnie++ zcav tool to plot speeds on both the single drive and RAID0 configurations, you get the following curves:

Sample disk results

The derived figures from the raw data max and min transfer speed numbers are almost exactly tripled as follows:

  • 3 X 37=111MB theoretical min, actual is 110MB
  • 3 X 77=231MB theoretical max, actual is 230MB

That's perfect scaling, exactly what you'd hope to see when adding more disks to a RAID array. This wasn't clearly doing the right thing when only looking at average performance, probably because the files created were not on exactly the same portion of the disk to get a fair comparison. The reason why ZCAV issues have been highlighted so many times in this chapter is because they pop up so often when you attempt to do fair comparison benchmarks of disks.

Disk performance expectations

So what are the reasonable expectations for how your disks should perform? The last example shown demonstrates how things should work. Any good drive nowadays should have sequential transfers of well over 50 MB/s on its fastest area, with 100 MB/s being easy to find. The slowest part of the drive will be closer to half that speed. It's good practice to try and test an individual drive before building more complicated arrays using them. If a single drive is slow, you can be sure an array of them will be bad too.

The tricky part of estimating how fast your system should be is when you put multiple drives into an array.

For multiple disks into a RAID1 array, the sequential read and write speed will not increase. However, a good controller or software RAID implementation will use both drives at once for seeking purposes, which might as much as double measurements of that rate.

When multiple drives are added to a RAID0 array, you should get something close to linear scaling of the speeds, as shown in the previous section. Two 50 MB/s drives in RAID0 should be at least close to 100 MB/s. It won't be perfect in most cases, but it should be considerably faster than a single drive.

Combinations like RAID10 should scale up sequential reads and writes based on the number of drive pairs in RAID0 form, while also getting some seek improvement from the RAID1 mirroring. This combination is one reason it's preferred for so many database disk layouts.

If you're using RAID5 instead, which isn't recommended for most databases, read speed will scale with the number of drives you use, while write speeds won't increase.

Sources of slow disk and array performance

Most of the time, if you meet expectations for sequential read and write speeds, your disk subsystem is doing well. You can measure seek time, but there's little you can do to alter it besides add more disks. It's more a function of the underlying individual physical drives than something you can do anything about. Most problems you'll run into with slow disks will show up as slow read or write speeds.

Poor quality drivers for your controller can be a major source of slow performance. Usually you'll need to connect the same drives to another controller to figure out when this is the case. For example, if you have an SATA drive that's really slow when connected to a RAID controller, but the same drive is fast connected directly to the motherboard, bad drivers are a prime suspect.

One problem that can significantly slow down read performance in particular is not using sufficient read-ahead for your drives. This normally manifests itself as writes being faster than reads, because the drive ends up idle too often while waiting for the next read request to come in. This subject is discussed more in Chapter 4, Disk Setup.

Conversely, if writes are very slow relative to reads, check the write caching policy and size on your controller if you have one. Some will prefer to allocate their cache for reading instead of writing, which is normally the wrong decision for a database system. Reads should be cached by your operating system and the database, it's rare they will ask the controller for the same block more than once. Whereas it's quite likely your OS will overflow the write cache on a controller by writing heavily to it.

RAID controller hardware itself can also be a bottleneck. This is most often the case when you have a large number of drives connected, with the threshold for what "large" means dependent on the speed of the controller. Normally to sort this out, you'll have to reduce the size of the array temporarily, and see if speed drops. If it's the same even with a smaller number of drives, you may be running into a controller bottleneck.

The connection between your disks and the rest of the system can easily become a bottleneck. This is most common with external storage arrays. While it might sound good that you have a "gigabit link" to a networked array over Ethernet, a fairly common NAS configuration, if you do the math that's at most 125MB/s—barely enough to keep up with two drives, and possible to exceed with just one. No way will that be enough for a large storage array. Even Fiber Channel arrays can run into their speed limits and become the bottleneck for high sequential read speeds, if you put enough disks into them. Make sure you do a sanity check on how fast your drive interconnect is relative to the speed you're seeing.

It's also possible for a disk bottleneck to actually be somewhere on the CPU or memory side of things. Sometimes disk controllers can use quite a bit of the total system bus bandwidth. This isn't as much of a problem with modern PCI Express controller cards that use the higher transfer rates available, but you do need to make sure the card is placed in a slot and configured so it's taking advantage of those. Monitoring overall system performance while running the test can help note when this sort of problem is happening; it will sometimes manifest as an excessive amount of CPU time being used during the disk testing.

Poor or excessive mapping of your physical disk to how the operating system sees them can also slow down results by more than you might expect. For example, passing through Linux's LVM layer can cause a 10-20% speed loss, compared to just using a simple partition instead. Other logical volume abstractions, either in hardware or software, can drop your performance too.

One performance aspect that is overrated by storage vendors in particular is aligning file system blocks with those of blocks on physical storage, or with the "stripes" of some RAID array configurations. While theoretically such a misalignment can turn a single physical block write into two when blocks straddle a stripe boundary, you'll really need to have quite a system before this is going to turn into your biggest problem. For example, when doing random database writes, once the disk has done a seek to somewhere it makes little difference whether it writes one or two blocks once it arrives. By far the biggest overhead was the travel to the write location, not the time spent writing once it got there. And since it's easy to have your bottlenecks actually show up at the block level in the OS or controller, where strip splits aren't even noticed, that makes this problem even less likely to come up. Alignment is something worth investigating if you're trying to get good performance out of RAID5, which is particularly sensitive to this problem. But for most systems using the better performing RAID levels, trying to tune here is more trouble than it's worth. Don't be surprised if a storage vendor, particular one defending an underperforming SAN, tries to blame the performance issues on this area though. You'll likely have to humor them by doing the alignment just to rule that out, but don't expect that to change your database performance very much.

The length of this list should give you an idea why doing your own testing is so important. It should strike you that there are a whole lot of points where a disk configuration can go wrong in a way that slows performance down. Even the most competent vendor or system administrator can easily make a mistake in any one of theses spots that cripples your system's disk speed, and correspondingly how fast the database running it will get work done. Would you believe that even excessive vibration is enough to considerably slow down a drive nowadays? It's true!

        

Disks

Seq Read

Seq Write

bonnie++ seeks

sysbench seeks

Commits per sec

Seagate 320GB 7200.4 Laptop

71

58

232 @ 4GB

194 @ 4GB

105 or 1048

WD160GB 7200RPM

59

54

177 @ 16GB

56 @ 100GB

10212

3X WD160GB RAID0

125

119

371 @ 16GB

60 @ 100GB

10855