Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Burn-in and benchmark karm #78

Closed
zerebubuth opened this issue May 16, 2016 · 32 comments
Closed

Burn-in and benchmark karm #78

zerebubuth opened this issue May 16, 2016 · 32 comments

Comments

@zerebubuth
Copy link
Collaborator

This is the new NVMe server, so there are bound to be some tweaks and work-arounds needed to get the best performance.

@pnorman
Copy link
Collaborator

pnorman commented May 16, 2016

Intel's suggested workload for NVMe testing is

  • install fio (apt-get install fio)
    create a file called 4krr.ini with
[global]
name=4k random read 4ios in 32 queues
filename=/dev/nvme0n1
ioengine=libaio
direct=1
bs=4k
rw=randread
iodepth=32
numjobs=8
buffered=0
size=100%
runtime=120
time_based
randrepeat=0
norandommap
refill_buffers
[job1]

Adjust to have numjobs= num logical cores, and iodepth = 128 / numjobs

then fio 4krr.ini. This will write directly to the NVMe device, so don't have anything on it

clat percentiles and iops are the two most relevant, but keep in mind the reporting is per job.

You can also adjust the filename to write to the filesystem instead, which lets you test after those settings.

If testing relative performance, you need to make sure the drive is consistently conditioned to get comparable results, which involves doing a secure erase, workload-independent preconditioning, and finally testing.

Ref: https://communities.intel.com/servlet/JiveServlet/download/38-137191/Lab_NVMe_PCI_SSD.pdf

@zerebubuth
Copy link
Collaborator Author

That looks like a great way to test and compare the raw block devices, and it would be interesting to see what (hopefully huge) numbers we can pull off raw disk. There are a few layers of software between the raw disk and Postgres, though, including software RAID and the filesystem, which will probably also impact performance.

The last benchmarking done on katla with pgbench (scale factor 70,000 for about 1TB database) gave something like peak 5K IOPS. Concurrency was swept from 1 through 128, and I think peak was usually around 16-32. We were mainly looking to benchmark various filesystems and SSD WAL / index layouts, though, so less concerned with absolute performance as relative.

Still, I'm hoping that karm can smash through 5K IOPS!

@zerebubuth
Copy link
Collaborator Author

I've started running fio with the suggested config. I'll be testing RAID6 and RAID10 raw, plus those two again with ext4, and finally btrfs. This is just to see what performance bottlenecks there are. We've potentially already found one, as software RAID6 parity calculation under Linux is done in a single kernel thread which becomes a bottleneck at the data rates that the NVMe drives are capable of.

@zerebubuth
Copy link
Collaborator Author

Partial results summary, see karm_benchmark.tar.gz for full details. Note that these were not done on the raw disk, but against files pre-allocated on an ext4 filesystem mounted noatime, discard.

For separate 4KB random read (4krr) and 128KB sequential write (128ksw), the results for both RAID10 and RAID6 are extremely impressive. Over 40 threads, the sum IOPS was in the millions for reads and 80k for RAID10 write, 3.8k for RAID6 write. RAID6 write was limited by the single-threaded parity calculation, rather than disk speed.

For a mixed workload with 10% writes, which is similar to the ratio on the current database master, I found a surprising result. As expected, both results were reduced from the pure read / pure write cases, but RAID10 goes down to 13k read and 1k write IOPS, whereas RAID6 is at 88k read and 10k write. I find this very surprising and will be re-running the RAID10 mixed case to make sure there wasn't something odd going on with that case.

In any case, the 88k read IOPS and 10k write is far more than the 5k IOPS that katla got during pgbench, and the 3k read / 0.4k write that katla is peaking at now.

So, although RAID6 parity calculation is a bottleneck, it looks like it's one that's unlikely to seriously constrain us in the near future.

@zerebubuth
Copy link
Collaborator Author

Re-running the RAID10 mixed case resulted in about 15% improved performance, but nowhere near enough to make it competitive with RAID6.

I also tried btrfs, which got 138k read + 15k write IOPS in the mixed case, or about 57% more than RAID6 with ext4. The filesystem was created with RAID6 options for both data and metadata, and mounted without any special options.

Unless anyone's got further suggestions to try, I'm inclined to say btrfs is interesting but perhaps still a bit risky for production use, and go with RAID6 + ext4 and move on to pgbench.

@pnorman
Copy link
Collaborator

pnorman commented Jun 15, 2016

PostgreSQL 9.0 High performance is reports that XFS faster performance and less variation than ext3 and to turn on the full_page_writes GUC. I'm not sure if it's worth checking or not.

What do we want to check with pgbench? My list would include

Low-level and kernel

  • device readahead
  • /proc/sys/vm/dirty_ratio + /proc/sys/vm/dirty_background_ratio
  • swappiness, if swap is turned on

Setup

  • Moving pg_xlog to its own volume

PostgreSQL GUCs

  • wal_sync_method
  • shared_buffers
  • checkpoint_completion_target
  • max_wal_size
  • wal_writer_delay

This isn't a list of all the GUCs which have performance implications, just the ones where benchmarking might help

@tomhughes
Copy link
Member

I'm not sure the disks we have on karm really lend themselves to creating multiple arrays, so there may not be much point testing moving pg_xlog unless we're going to put it on the system disk.

@pnorman
Copy link
Collaborator

pnorman commented Jun 15, 2016

moving disks generally means having the main array two smaller and using those disks in a RAID1, reducing total capacity. It's a very common thing to do, but I have no clue if it's useful with NVMe or SSDs.

@tomhughes
Copy link
Member

The problem is that if we took two of the NVMEs to make a second array it wouldn't leave enough to have enough space on the main array.

@zerebubuth
Copy link
Collaborator Author

I've tested a few values for read-ahead and sync methods - full details are in karm_benchmark.tar.gz. The summary is: read-ahead seems to make no measurable difference, and fdatasync is the default and probably best choice. I only did a full pg_bench test for fdatasync and open_datasync, as these were the only ones which looked interesting from a run of pg_test_fsync. fdatasync looked the best on 2x 8kB writes, but open_datasync looked better for one. The full results don't show much difference, except that open_datasync performance peaks at a smaller number of clients.

I'm moving on to testing shared_buffers, but might perhaps make it back to the VM/dirty stuff. What would be worth testing there, and what's the best way to test it?

@tomhughes
Copy link
Member

tomhughes commented Jul 13, 2016

I just built the latest smartmontools (which has nvme support) from source and I noticed that karm (and dulcy) are reporting the nvmes are formatted with 512 byte LBAs:

Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size:     512

and the supported LBA sizes are listed as:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -     512       8         2
 2 -     512      16         2
 3 -    4096       0         0
 4 -    4096       8         0
 5 -    4096      64         0
 6 -    4096     128         0

I suspect that 4096 byte LBAs will be more efficient (likely that is what the Rel_Perf column refers to).

@tomhughes
Copy link
Member

According to https://www.reddit.com/r/archlinux/comments/3k3ohz/nmve_ssd_fstab_options/cuvr5uk a "relative performance" of 2 is "good" and 0 is "best".

http://www.intel.com/content/www/us/en/support/solid-state-drives/data-center-ssds/000016238.html is Intel's documentation on changing it.

@Firefishy
Copy link
Member

512 bytes will be for compatibility. I think by now ALL tools in Ubuntu 16.04 (hell 14.04 too) will likely be 4096 byte clean. The Intel isdct is already installed, so likely best to use that.

@Firefishy
Copy link
Member

PostgreSQL uses 8192 bytes as its default block size, so 4096 aligns nicely with 2 operations, rather than 16.

@zerebubuth
Copy link
Collaborator Author

Yup, it seemed to go without a hitch until this kernel panic. I'm hoping that's because the kernel wasn't expecting an LBA size change online... I've rebooted and it seems okay so far. Rebuilding the array and will post some fio / pg_bench stats once it has finished.

@zerebubuth
Copy link
Collaborator Author

For a mixed workload with 10% writes, ... RAID6 is at 88k read [IOPS] and 10k write.

With 4K LBAs and 16KB block size, I'm seeing 160k read IOPS and 18k write, which shouldn't be much of a surprise. The read latency average 262us, write 6,470us. Compared to read 70us, write 14,900us for the 512 byte LBAs. Perhaps that's due to being able to split reads for a 16kB block over more drives?

It'll be interesting to see if that read latency penalty carries over into the pg_bench tests.

Full FIO results

@zerebubuth
Copy link
Collaborator Author

Results from pg_bench are unexpected - to be honest, I was expecting no change, but the peak TPS have dropped by 17% and the latency is up by 14% at the previous peak of TPS (clients=128).

If the latency was write-dominated, as I was expecting it to be, then I'd have expected latency to have improved, based on the FIO results. Something else must be going on, and I don't understand it... Anyone have any ideas?

Full results available in karm_benchmark.tar.gz.

@pnorman
Copy link
Collaborator

pnorman commented Jul 18, 2016

is it lba4k and newconf that are equivalent except for the LBA size?

@zerebubuth
Copy link
Collaborator Author

lba4k and newconf-fsync+sync_method=fdatasync are equivalent apart from the LBA size, and probably newconf as well, given that fdatasync is probably the default. Note that the -fsync in the name of the test case is wrong - a mistake when I was naming the output files, fsync is actually on for all the tests so far.

@zerebubuth
Copy link
Collaborator Author

I re-ran the lba4k test, and now its results are identical to newconf-fsync+sync_method=fdatasync to within the margin of error. This is annoying, because I thought that running each test for 40 mins and taking the median result of 3 runs (i.e: 18h in total) would be enough to control for this, but evidently not.

Now bumping up checkpoint_completion, although I'm not sure the change from 0.8 to 0.9 is going to make much difference.

@zerebubuth
Copy link
Collaborator Author

Changing checkpoint completion from 0.8 to 0.9 made no significant difference. Given that nothing I've changed has made a significant difference, I wonder if the config parameters make much of a difference on NVMe - i.e: the disk isn't the bottleneck any more. Another possibility is that I'm not testing what I think I'm testing...

At this point, it seems worthwhile to move onto pg_restoreing a backup to test how long that takes into 9.5. Slightly related to #11 - we should see if that's still a good plan for how to migrate to 9.5.

@Firefishy
Copy link
Member

Deploy the karm! ;-) I'd stick with the 4k. We are well within our comfort zone in terms of performance. Timing for DB import would be good, even if we do it again.

@pnorman
Copy link
Collaborator

pnorman commented Jul 21, 2016

I wonder if the config parameters make much of a difference on NVMe - i.e: the disk isn't the bottleneck any more.

This would reflect what I found with a hezner NVMe machine, osm2pgsql, and mapnik rendering. At no point could I saturate the IO.

@zerebubuth
Copy link
Collaborator Author

With -j 20:

real    1212m46.449s
user    188m27.744s
sys     13m51.380s

With -j 40:

real    1457m7.658s
user    188m42.188s
sys     14m22.460s

So it looks like it's around 20h to restore the database dump into 9.5. It's possible that could go quicker if we temporarily turned off fsync and increased the max WAL size. However, it's also possible that would have little effect, as the peak I/O saturation was under 30% for the majority of the restore. I think it's possible that there isn't enough parallelism in the custom dump format to saturate either the processors, or the I/O system.

image

I'm closing this now, as I think we've done enough and it's time to move on to #94.

@pnorman
Copy link
Collaborator

pnorman commented Jul 27, 2016

An individual table can't be done in parallel so its probably being limited by nodes or another large table, like osm2pgsql is limited by ways.nodes.

@zerebubuth
Copy link
Collaborator Author

Yup, sorry, that's what I meant. It's the same for planet-dump-ng, and the pattern of starting with lots of parallelism and it gradually finishing all the tables except for a small number of huge ones is pretty familiar - here's the load graph for ironbelly running planet-dump-ng and preparing the tables on 26 Jul:

image

Do you think it would be worth considering the "directory" dump format? That would allow parallel dumps as an improvement over the custom format, and might allow individual tables to be compressed using a seekable compression for parallel access. Although I'm not sure about that last bit...

@pnorman
Copy link
Collaborator

pnorman commented Jul 27, 2016

It would improve dump parallelism, but restoring from either custom or directory can do tables in parallel and neither can do a single table in parallel.

If I'm doing a dump I normally do the custom format because its slightly easier to send around, and z 9 if space or bandwidth matters.

If we did move to directory we'd need to make sure not to put too much load on the DB at the start of the backup.

For the one-time backup and restore directory might be better.

@jburgess777
Copy link
Member

Did we try tuning the postgres effective_io_concurrency? The email below suggests it can make a significant difference for SSDs (or RAID systems):

https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com

The chef files suggest we don't set this on any of the current machines.

@pnorman
Copy link
Collaborator

pnorman commented Sep 18, 2016

https://www.postgresql.org/message-id/20141209204337.GC24488%40momjian.us contains more discussion. The gains in that situation are from readahead and benefit bitmap scans hugely. The message implies sequential scans won't change since there's already readahead there from a different layer.

I can't be sure of the access patterns, but I wouldn't expect a significant change in performance from the effective_io_concurrency GUC. It should still be adjusted, just because it's easy one to estimate without benchmarks.

@tomhughes
Copy link
Member

@pnorman So what are you suggesting we set it to?

@pnorman
Copy link
Collaborator

pnorman commented Sep 18, 2016

The docs state

A good starting point for this setting is the number of separate drives comprising a RAID 0 stripe or RAID 1 mirror being used for the database. (For RAID 5 the parity drive should not be counted.) However, if the database is often busy with multiple queries issued in concurrent sessions, lower values may be sufficient to keep the disk array busy. A value higher than needed to keep the disks busy will only result in extra CPU overhead.

For more exotic systems, such as memory-based storage or a RAID array that is limited by bus bandwidth, the correct value might be the number of I/O paths available. Some experimentation may be needed to find the best value.

The minimum value I see suggested for SSD systems is 32, the maximum is 256. A conservative setting based on data from SATA SSDs would be 64-128. NVMe tends to be more parallel and some of the guidance would imply up to 512 with NVMe is worth it but I'm reluctant to recommend that high.

@tomhughes
Copy link
Member

I've set it to 256 in openstreetmap/chef@c276a3a and we'll see what happens...

@Firefishy Firefishy removed their assignment Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants