-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Burn-in and benchmark karm #78
Comments
Intel's suggested workload for NVMe testing is
Adjust to have numjobs= num logical cores, and iodepth = 128 / numjobs then clat percentiles and iops are the two most relevant, but keep in mind the reporting is per job. You can also adjust the filename to write to the filesystem instead, which lets you test after those settings. If testing relative performance, you need to make sure the drive is consistently conditioned to get comparable results, which involves doing a secure erase, workload-independent preconditioning, and finally testing. Ref: https://communities.intel.com/servlet/JiveServlet/download/38-137191/Lab_NVMe_PCI_SSD.pdf |
That looks like a great way to test and compare the raw block devices, and it would be interesting to see what (hopefully huge) numbers we can pull off raw disk. There are a few layers of software between the raw disk and Postgres, though, including software RAID and the filesystem, which will probably also impact performance. The last benchmarking done on katla with Still, I'm hoping that karm can smash through 5K IOPS! |
I've started running |
Partial results summary, see karm_benchmark.tar.gz for full details. Note that these were not done on the raw disk, but against files pre-allocated on an ext4 filesystem mounted For separate 4KB random read ( For a In any case, the 88k read IOPS and 10k write is far more than the 5k IOPS that katla got during So, although RAID6 parity calculation is a bottleneck, it looks like it's one that's unlikely to seriously constrain us in the near future. |
Re-running the RAID10 mixed case resulted in about 15% improved performance, but nowhere near enough to make it competitive with RAID6. I also tried Unless anyone's got further suggestions to try, I'm inclined to say |
PostgreSQL 9.0 High performance is reports that XFS faster performance and less variation than ext3 and to turn on the What do we want to check with pgbench? My list would include Low-level and kernel
Setup
PostgreSQL GUCs
This isn't a list of all the GUCs which have performance implications, just the ones where benchmarking might help |
I'm not sure the disks we have on karm really lend themselves to creating multiple arrays, so there may not be much point testing moving pg_xlog unless we're going to put it on the system disk. |
moving disks generally means having the main array two smaller and using those disks in a RAID1, reducing total capacity. It's a very common thing to do, but I have no clue if it's useful with NVMe or SSDs. |
The problem is that if we took two of the NVMEs to make a second array it wouldn't leave enough to have enough space on the main array. |
I've tested a few values for read-ahead and sync methods - full details are in karm_benchmark.tar.gz. The summary is: read-ahead seems to make no measurable difference, and I'm moving on to testing |
I just built the latest smartmontools (which has nvme support) from source and I noticed that karm (and dulcy) are reporting the nvmes are formatted with 512 byte LBAs:
and the supported LBA sizes are listed as:
I suspect that 4096 byte LBAs will be more efficient (likely that is what the Rel_Perf column refers to). |
According to https://www.reddit.com/r/archlinux/comments/3k3ohz/nmve_ssd_fstab_options/cuvr5uk a "relative performance" of 2 is "good" and 0 is "best". http://www.intel.com/content/www/us/en/support/solid-state-drives/data-center-ssds/000016238.html is Intel's documentation on changing it. |
512 bytes will be for compatibility. I think by now ALL tools in Ubuntu 16.04 (hell 14.04 too) will likely be 4096 byte clean. The Intel isdct is already installed, so likely best to use that. |
PostgreSQL uses 8192 bytes as its default block size, so 4096 aligns nicely with 2 operations, rather than 16. |
Yup, it seemed to go without a hitch until this kernel panic. I'm hoping that's because the kernel wasn't expecting an LBA size change online... I've rebooted and it seems okay so far. Rebuilding the array and will post some |
With 4K LBAs and 16KB block size, I'm seeing 160k read IOPS and 18k write, which shouldn't be much of a surprise. The read latency average 262us, write 6,470us. Compared to read 70us, write 14,900us for the 512 byte LBAs. Perhaps that's due to being able to split reads for a 16kB block over more drives? It'll be interesting to see if that read latency penalty carries over into the |
Results from If the latency was write-dominated, as I was expecting it to be, then I'd have expected latency to have improved, based on the FIO results. Something else must be going on, and I don't understand it... Anyone have any ideas? Full results available in karm_benchmark.tar.gz. |
is it lba4k and newconf that are equivalent except for the LBA size? |
|
I re-ran the Now bumping up |
Changing checkpoint completion from 0.8 to 0.9 made no significant difference. Given that nothing I've changed has made a significant difference, I wonder if the config parameters make much of a difference on NVMe - i.e: the disk isn't the bottleneck any more. Another possibility is that I'm not testing what I think I'm testing... At this point, it seems worthwhile to move onto |
Deploy the karm! ;-) I'd stick with the 4k. We are well within our comfort zone in terms of performance. Timing for DB import would be good, even if we do it again. |
This would reflect what I found with a hezner NVMe machine, osm2pgsql, and mapnik rendering. At no point could I saturate the IO. |
With
With
So it looks like it's around 20h to restore the database dump into 9.5. It's possible that could go quicker if we temporarily turned off fsync and increased the max WAL size. However, it's also possible that would have little effect, as the peak I/O saturation was under 30% for the majority of the restore. I think it's possible that there isn't enough parallelism in the custom dump format to saturate either the processors, or the I/O system. I'm closing this now, as I think we've done enough and it's time to move on to #94. |
An individual table can't be done in parallel so its probably being limited by nodes or another large table, like osm2pgsql is limited by ways.nodes. |
Yup, sorry, that's what I meant. It's the same for planet-dump-ng, and the pattern of starting with lots of parallelism and it gradually finishing all the tables except for a small number of huge ones is pretty familiar - here's the load graph for ironbelly running planet-dump-ng and preparing the tables on 26 Jul: Do you think it would be worth considering the "directory" dump format? That would allow parallel dumps as an improvement over the custom format, and might allow individual tables to be compressed using a seekable compression for parallel access. Although I'm not sure about that last bit... |
It would improve dump parallelism, but restoring from either custom or directory can do tables in parallel and neither can do a single table in parallel. If I'm doing a dump I normally do the custom format because its slightly easier to send around, and z 9 if space or bandwidth matters. If we did move to directory we'd need to make sure not to put too much load on the DB at the start of the backup. For the one-time backup and restore directory might be better. |
Did we try tuning the postgres effective_io_concurrency? The email below suggests it can make a significant difference for SSDs (or RAID systems): The chef files suggest we don't set this on any of the current machines. |
https://www.postgresql.org/message-id/20141209204337.GC24488%40momjian.us contains more discussion. The gains in that situation are from readahead and benefit bitmap scans hugely. The message implies sequential scans won't change since there's already readahead there from a different layer. I can't be sure of the access patterns, but I wouldn't expect a significant change in performance from the |
@pnorman So what are you suggesting we set it to? |
The docs state
The minimum value I see suggested for SSD systems is 32, the maximum is 256. A conservative setting based on data from SATA SSDs would be 64-128. NVMe tends to be more parallel and some of the guidance would imply up to 512 with NVMe is worth it but I'm reluctant to recommend that high. |
I've set it to 256 in openstreetmap/chef@c276a3a and we'll see what happens... |
This is the new NVMe server, so there are bound to be some tweaks and work-arounds needed to get the best performance.
The text was updated successfully, but these errors were encountered: