Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make database operations 30x faster by using SQLite #28

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kiler129
Copy link
Contributor

@kiler129 kiler129 commented Sep 7, 2023

Preface

Rebalancing is a lengthy process and thus ensuring it can be resumed is rather important. Currently, the code implements
a simple database-like textual file. When script is interrupted and then subsequently restarted, the rebalancing process
can be restarted more-or-less from where it was interrupted.

The problem

Current solution is great conceptually. However, the implementation suffers from very poor performance due to multiple
factors:

  • spawning multiple processes carries a non-insignificant cost
  • re-reading a good chunk of the database file generates a lot of I/O, even if in-memory
  • the data structure, due ot its simplicity and lack of indexing does not scale well with large number of files

The result of this is a very poor performance during both rebalance and when script needs to be resumed. In the case of
my dataset with ~1M files resume from 500k point took close to 12 hours(!). The performance problems are magnified with
small files being present

Solutions evaluation

The first optimization I attempted was limiting re-reads for counting. However, this still proven very limiting. The
next idea I tested was using bash associative arrays, implementing an in-memory cache persisted occasionally to the
disk. Unfortunately, with 500k+ files the memory usage started to creep into 500-600-800MB range, with more and more
being ostensibly leaked. This shows clearly that bash array were never designed for such a dataset.

I scraped the idea and came back to the drawing board. After ad-hoc testing, I implemented a prototype based on SQLite
database, that is stored in a dedicated file as well. The time for read-save operations were cut by orders of magnitude,
despite still launching at least 2 processes for every file. As a next step I implemented a delayed database flush, with
60s timeout, further cutting the db operations time.

Potential impact for users

The change carries some impact to systems utilizing this project.

  • SQLite database is a binary blob, that is ~30-40% larger than a text file
    • while the database is no longer readable as a text file, the sqlite3 tool offers an easy interface
    • README.md contains a simulated transaction, so that unfamiliar users can easily interact with the db
  • The usage of SQLite introduces a new dependency on sqlite3
    • sqlite3 is already present on systems that are most likely the target of this script
    • I verified existence of a compatible sqlite3 binary on TrueNAS SCALE, TrueNAS CORE, macOS 13, as well as Ubuntu
    • Debian doesn't ship with sqlite3 by default, but package is in the default repo
  • With the delayed-flush of the database to disk, there's a potential of rebalance state being lost
    • at most 60 seconds of the progress can be lost when interrupted
    • re-balancing a file more times than requested doesn't carry a risk, thus I see this as non-issue
  • Previously built databases are no longer usable
    • this PR includes convert-legacy-db.sh that handles the migration
    • the script when started in presence of the old database will abort to ensure balancing doesn't start from scratch
      unintentionally
    • documentations has been updated to cover that

Testing methodology

  • I used my preexisting database file where most of the files will be hit
    • database starting point is being equal for all scripts
    • new version gets cp rebalance_db.txt_copy rebalance_db.txt ; ./convert-legacy-db.sh (this takes ~35 seconds)
    • old version gets cp rebalance_db.txt_copy rebalance_db.txt
    • in other words, both versions operate on the same logical set of data
  • benchmark pure database operations (commented-out cp/mv/rm)
  • the pool uses HDDs with SSDs for metadata and contains ~500k files
  • command run: time ./zfs-inplace-rebalancing.sh --checksum false --passes 2 /mnt/testdata > rebalance.log

Test results

  • current .txt-based database: 29h 14min
  • SQLite database with per-file flush: 1h 32min
  • SQLite database with delayed flush every ~60 seconds: 56min

In other words, it's almost a 30x speed-up :)
There are a few smaller low-hanging fruits for optimization, but I'm tackling one thing at a time.

WDYT?

...about this PR and the idea itself? Marking as draft as of right now as I didn't have a chance to test on FreeBSD
yet (and there are minor bugs with special characters ;)).

@markusressel
Copy link
Owner

Hey @kiler129 ! 🤓
Thx for the really in-depth analysis and work you put into this 😲 ❤️

I have been thinking about this for a bit and it really looks like a great improvement to the script. I have two thoughts we may want to discuss:

1. Should we keep both the old/current as well as the new script?

Since this script is a pretty big change to "how things work", as well as what a potential user has to know about that to be able to work with it (reading txt file vs sqlite db, etc.), I am wondering if we should keep the old script available for people with a small usecase, and add this version for bigger usecases. I think maintaining both versions shouldn't be that big of a deal (at least if you can keep an eye on it as well 😅 ).

2. Should we consider "moving" to a "real" programming language to implement this script?

Looking at how this version works, I am wondering if a shift to a "real" programming language that works on most systems, like f.ex. golang (for performance) or python (for convenience and accessibility) would be the "better" path for future optimization? The single reason I initially used Bash to implement this script is because it was a one-time operation for me, which I will probably never do again. Bash is hard though, especially optimizing, debugging and testing it. Utilizing a programming language, accessing a database, as well as providing the CLI tools to work with it would be much easier.

Note: Following this route does not necessitate "trashing" this PR, it might simply be a third version, or an evolution of this PR.

@kiler129
Copy link
Contributor Author

kiler129 commented Sep 10, 2023

Hey @kiler129 ! 🤓 Thx for the really in-depth analysis and work you put into this 😲 ❤️

Don't we all start with "it will be just a small fix"? ...and that's just the beginning** 😄

automation

1. Should we keep both the old/current as well as the new script?

I was battling with this as well: should we make this a parameter... or maybe support both based on existence of the current "resume" file? What about the default? (should we feature-check or....)

This is exactly why I put this PR as a draft to see what's the best compromise for that. I think if the intention would be to keep the current database a smart way would be to:

  1. Default to SQLite so new users can benefit from the speed-up
  • Feature-check for sqlite3 and behave git-way if missing, i.e. tell the user that sqlite3 isn't available and suggest that it should be installed. In addition provide an opt-in to force text file with a warning of the speed drawback, esp. problematic with large pools but I wouldn't suggest that in the error message as ideally you don't want people using txt if they can use sqlite3 ;)
    • Maybe here we can also add a small scream-test: deprecation notice + exit 2 but actually use the txt database. This will let admins know that option will be removed in the future, but still works for now.
  • The user should be able to decide to opt-in for textual database, even if sqlite3 is available (but maybe it's broken/substandard/etc)
  • I believe this can be solved with an option --db=[sqlite,txt]. When option is present it will force the script to fail if e.g. sqlite3 isn't available but --db=sqlite was passed.
  1. Handle old databases gracefully
  • If old database is detected, put a clear/big warning "hey, it's suboptimal but I will continue with the old method". In essence this will behave like passing --db=txt, but will also suggest the db conversion or just running with --db=sqlite/renaming old db to start from scratch
  • If old db is present and --db=sqlite is specified, continue using sqlite3 but also add a warning "hey, your old DB will be outdated, rename it to avoid confusion`

2. Should we consider "moving" to a "real" programming language to implement this script?

The good: easy of maintenance
My software engineer soul screams "yes!".... and I even started prototyping it from scratch. First I started with PHP due to its very vast SPL (as it should be a single file, not a dependency-hell) and good select() and process control support. Then I realized while our servers do have PHP, storage boxes usually don't and started considering other possibilities.

The bad: portability
....and then the realization was sad. Bash is horrible, but portable-ish. Perl would arguably be even a worse choice given the write-once read-never nature. While Python is available on most system, by design its standard library is very slim and pip isn't an option here. Additional complexity is the versions shipped, which in case of Python actually matters quite a bit, as I learned in my last project. In addition, Python is painfully slow with many non-obvious things (e.g. some mass-string operations that take 1-2s on slow PHP take over an HOUR in Python 3, while select() being performance-fragile).

However, I started digging deeper in a hope we can MAYBE at least justify this by not using cp or mv or other CLI tools with exec but actually use native functionality. However, it seems like nobody wants to deal with complexity of VFS et. al. in a general programing language: see the description of shutil.copy and the sad red box on the top of the same page.

The ugly: real world has no alternatives
It seems like there are no other universal candidates sadly. I would love to use anything else: Go, Rust, Swift... you name it. But sadly it seems like Bash is the best compromise here, especially while working with appliances like TrueNAS that prevent users from installing any software. This is also why I maintain this abomination of shell scripts for monitoring, even if doing that in a normal programming language would've been orders of magnitude easier.


Off-topic:
** Issue awaiting; I need to do much more testing but tl;dr the code works great if you can ensure nothing is writing to the filesystem. However, when you cannot be sure things get messy. Even with md5 check it's not safe to run it when processes can potentially write/have files open. At this point there are 3 software engineers/Linux admins poking at how to do it safely ;) If our ideas will work in practice on a test 20TB pool with many apps trying to grab files and we don't destroy any files, the costly checksum can probably be completely avoided + the code could guarantee data safety.

@markusressel
Copy link
Owner

markusressel commented Sep 10, 2023

I was battling with this as well: should we make this a parameter... or maybe support both based on existence of the current "resume" file? What about the default? (should we feature-check or....)

I really don't want to over engineer this. The chance of somebody using the initial script and then wanting to migrate to the SQLite script is minuscule. The fact that a migration is already possible with your PR is more than good enough. I would simply propose to have two separate script files in this repo, one with a text file as DB, and one with SQLite. That way a user can decide for themselves which version is right for them, and both versions are completely independent.

Also, if you intend to further refine this script it might actually be better to create your own repository, f.ex. as a fork. I really don't want to make this script any more complex than it already is. If it needs to support more usecases, I would rather switch to a language like golang and provide cross platform executables. For the change of using sqlite instead of a text file, I would be fine with having a separate script in this repo, but if you intend to support running this script on changing data (which is currently explicitly stated in the README as not supported, not even data that is actively read ) I would not want to support such things in a simple bash script, because I would want to actually feel comfortable maintaining it and ensuring it works as intended 😅 I am mainly an android programmer, not a sysadmin, so my linux, zfs, and bash expertise is quite limited.

However, I started digging deeper in a hope we can MAYBE at least justify this by not using cp or mv or other CLI tools with exec but actually use native functionality. However, it seems like nobody wants to deal with complexity of VFS et. al. in a general programing language: see the description of shutil.copy and the sad red box on the top of the same page.

Even using a golang program for most of it and executing a bash command from within for things that are not supported natively would be preferred by me. I also have to say that, as I mentioned initially, I don't really have a usecase for this right now and probably not for the foreseeable future, meaning its not likely I will invest a lot of time into this.

@feld
Copy link

feld commented Oct 28, 2023

This breaks with filenames that have apostrophes, but I much prefer this approach to the text file

@Toys0125
Copy link

Toys0125 commented Feb 5, 2024

This breaks with filenames that have apostrophes, but I much prefer this approach to the text file

So one way to fix it is by doing '${1//\'/\'\'}' which replaces all single apostrophes with double single which sqlite will auto convert on parsing.

Also, another bug I noticed is that since the rebalance function is called in its own scope rebalance_cache_db string will be empty if you finish before the interval and its called outside the function loop. One way to avoid it is to use lastpipe according to FAQ https://mywiki.wooledge.org/BashFAQ/024

@Toys0125
Copy link

Toys0125 commented Feb 5, 2024

I also highly recommend using the go language since I already had to use it since the problem I was having was the long wait times for it to find all the file counts on my system. I implemented using locar to get my file count faster since I have the ZFS system on a mirror array.

@Miladiir
Copy link
Contributor

Is it really a good idea to use sqlite for this? Isn't sqlite well known to behave in a weird way on zfs currently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants