-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve migration performance #3
Comments
I think a lot of the things can be parallelised. The creation of users can run in parallel, as well as rooms and non-threaded, non-status events (messages, but events might cause problems). I haven't used a multi-process library for quite some time, but maybe even a bit more (but still limited) async in single process might change a lot, as I suspect that the HTTP requests to synapse are the bottleneck for the most part. |
Do you have any test data (maybe a dump of a test server that has been used a lot for testing) which I can use to test the performance? Or maybe some tips on how to write a script to generate some test data? |
I am not a programmer, so take my opinion with a grain of salt :) Maybe I am overlooking something but shouldn't it be possible, to just split up the userdata when exporting the DB in roughly equal parts (~1/n) and then just run the script with the magic of GNU parallels (https://www.gnu.org/software/parallel/)? |
As Rocket.Chat wants to force me to do things I don't want when I set up a test instance, I don't have one and rely on data from a real instance. However, I'm currently working on some test data which wouldn't be enough to test performance, though.
It's not that easy because of some cross-references, unfortunately. So we need to have all users and rooms before creating the messages, so we'd need to wait for each step in each process before proceeding (something like a semaphore). Finally and most importantly, we don't know yet, where the bottleneck is. It would probably reduce the overall runtime if entities of the same type are handled in a more concurrent manner, but maybe the HTTP requests to Synapse are the slow action, rendering such a change useless. |
We started working on a script that generates test data: We don't yet generate a proper random timestamp, but that would be an important next step. We tried testing this script and tried understanding what might be the bottleneck, but for now we have more questions than answers so we will continue trying to understand this, but we just wanted to share this work in progress script. |
Hi everyone,
I never worked with npm before this project, don't know if it can have an impact on performance. |
Sounds great! Could we help you test this in any way? |
Great stuff I'm reading here! @chagai95 That script really looks helpful to test performance. For edge cases we can use the manually added test data (I mentioned the commit f99771b adding it, somewhere), so you don't need to worry about that in the script. Have you considered using Faker to generate the random data resembling the purpose, instead of random strings? It provides a lot of providers to generate names, texts, dates and a lot of other stuff. It can even wrap it in JSON conveniently. @grvn-ht, you said:
I think that's a sane approach, the number of messages should be magnitudes larger than the rest (my assumption for instances so big, that they need a higher performance). I could help with some The question I didn't spend much time on, yet, is, whether there is a convenient and performant way of parallelisation within the app, that would make it unnecessary to adapt the db. The storage adapter/TypeORM allows us to use another db for parallelised access, nonetheless. Out of curiosity I ask you all naively why you need a significantly lower execution time for such an one-off migration? Can you name any numbers, yet? Are you intending to migrate different chats regularly? What's the story? |
Why do we need some performance? We have to migrate at least two large RC instances with 10k+ users each. We simply need to avoid a downtime of weeks, maybe a maximum of 3 days (weekend in summer) is acceptable to run the migration script. Unless we find some mechansim like with rsync, that allows to re-sync only what has changed. So if that would be possible, we could stop registrating new users in RC, start syncing, and after a week or however how long it takes we fetch again only the newest messages to have them all in matrix. |
you're right, I only need to do it twice, one for a small RC servers: 500 users, 3000 rooms, 120 000 messages like @rasos I would like to be able to run migration script on a week-end to avoid synchronization issues between RC and Matrix. But it's true that I could also do main migration and then migrate onother time with fresh data on same db.sqlite. But I found interresting to try improve performances ^^ |
4 messages per second is pretty slow indeed, I understand the need. My current approach is to let the script run multiple times (on the same DB, with different inputs). This works mostly fine as far as I can tell, but it doesn't detect any changes to already processed (and thus mapped) entities. Thus I'm not entirely happy with this solution, which is more like a crash-resistant design. |
I experimented a bit with handling multiple rooms and messages concurrently, but I can't check the results for correctness for a lack of tests. I suspect that it misses most threaded messages. |
Now I've implemented concurrency with the aforementioned PR (or commits b48400a and b48400a) after testing it with end to end tests and fixing some bugs. I would really appreciate your feedback on it, if you could test it. I didn't see any problems testing with our database, but maybe I missed some points ;-) |
Oh my, of course there has to be a problem with my simple approach. Thank you for reporting this! So, I interpret this as the queue getting too long. My first thought would be to read the file and enqueue the handling jobs depending on the queue. Or maybe another approach which uses more CPU cores 🤷 |
As raised in #9 (comment), concurrency in messages messes with the ordering, as explained in the WARNING of https://spec.matrix.org/v1.11/application-service-api/#timestamp-massaging. |
Problems with the message order are also raised in #22. |
For now I reverted the concurrency in 698062c to allow a functioning migration.
I would also create a queue for each room, handling these messages sequentially and handling the rooms in parallel, as you mention. Maybe with a different library like Promise Pool. |
Indeed, I missed that!
I can try to tackle that tomorrow, but I'm no typescript expert! |
I implemented the sequential messages per room concurrency in #31. |
Currently, the migration into the Matrix server is a single process because data needs to be imported in a chronological order. We should investigate how performance can be increased with multi-threading or processing.
The text was updated successfully, but these errors were encountered: