Update recalc logic to be resumable #1005
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here are some notes that should help understand these changes.
Reindexing issues
It is incorrect to use these caches: cache_qcount, cache_acount and cache_ccount to count the amount of posts that need to be reindexed. This is because those caches count the amount of individual posts that are in NORMAL (not HIDDEN and not QUEUED) status. The thing is that a question could be hidden and its answers not. This would lead to the question not being counted in the cache_qcount but their answers being counted in the cache_acount.
I think whether this is an expected behavior or not, falls more on the requirements definition side rather than the bug-fixing side. The real issue is that when it comes to indexing status, a hidden question is not indexed and their children are not indexed either. So using the sum of all cache_[qac]count does not really match the amount of indexed posts.
In order to get the amount of indexed posts, we should use a similar filter to the one it is being used by
qa_db_posts_get_for_reindexing()
. My revised version took 1-2 seconds to run in a server with almost 3 million posts so I'd say it is fair in the context of the whole reindexing process.General counting issues
There were a few processes, in addition to the one mentioned above, that were not properly counting the amount of work to be done. I added some additional code to take care of this that is backwards compatible. EG: you can check the changes made to
qa_db_count_posts()
and how it is being called from theStepRefillEvents
class.Regarding changes in
qa_db_users_get_for_recalc_points()
The way of getting batches of elements is based on numeric IDs. This means, if the last processed ID in a batch was 105, the next batch would start searching in 106 (regardless of whether that ID exists). So fetching the next processable ID means just adding an integer.
The problem with user IDs is that they can be strings. So incrementing in one unit just won't work. This was solved by fetching one more record in the batch and starting from there. However, this could have been solved differently: instead of tracking the next processable ID and include it in the next batch, just track the last processed ID and exclude it from the next batch. In pseudo code, this means instead of doing
WHERE id >= next_processable_id
just doWHERE id > last_processed_id
.This is not relevant with numeric IDs. However, it is for string IDs. This is because it would avoid fetching an extra record in every batch. It is not a big deal, but the
qa_db_users_get_for_recalc_points()
is constantly fetching an extra record, and even forces the call to the function to have an increased$count
, which results in subqueries to fetch additional records that, when added with UNIONs, might result in even more records that will be fetched unnecessarily. So the code of this function (and the call to it) was rewritten in this new way which should save some unnecessary record fetches.Changes to how blobs are managed
Apparently, there was a bug when incrementing blob ids, which is a step performed when moving blogs to disk or DB. These IDs are BIGINTs in MySQL, which are unsigned 64-bit values. PHP integers are signed. So, even in 64-bit systems, it is not possible to reference half of the possible values and, incrementing the blob id in 1, could trigger an overflow.
So this one was another candidate to be restructured to use the
last_processed_id
logic as there is no need to increment anything in PHP. And it only requires a minor change toqa_db_get_next_blob_in_db()
.Language changes
Yes, there were a lot. I standardized many of the strings to a format that includes the name of the process. I kept the deprecated ones, anyway.
No-JS versions
I've learned there is a qa-include/pages/admin/admin-recalc.php file. Apparently, it is a sort of controller of
admin/recalc?dorecountposts=1
. It is intended to be a no-js version for recounting posts. The recalc process is not completely integrated to working in this way.Considering even crawlers in 2024 are using JS by default, I think there is no point in keeping this page alive and even having to maintain the code for browsers with JS disabled. Note there are many other features that rely on JS to be enabled (e.g. voting).
I'm not removing no-JS support from Q2A, but rather just removing it from the recalc process.
Cache trimming/clearing
When fetching the last page of the elements, there was a bug which allowed for more elements to be fetched. Not a big deal, but I fixed it.
I also changed the language from "deleted" to "processed" as not all elements will be deleted (specially for the process that deletes only expired items).
Why this refactor and not the one in 1.9
First of all, this is not (just) a refactor, but rather adding features and fixing bugs. This is mainly adding the ability to resume processes that might take one or two days to finish in big sites and they always fail, for whatever reason, without a way to resume them.
Secondly, the refactor in 1.9, even though it is quite OOP, still takes care of the state as a string, which ends up being the heart of the whole recalc solution. I've removed the state as such, and kept track of it server-side and with the appropriate data types. Also, in 1.9, even though there are OOP classes that handle most of the logic of the steps, there are also huge case statements (
question2answer/qa-src/Recalc/State.php
Line 149 in fa86105
question2answer/qa-src/Recalc/State.php
Line 139 in fa86105
Lastly, the process with steps architecture I suggest, I think it reflects real life better: there exists processes, with easy-to-configure steps, and each step has a setup phase to prepare and provide information of what's going to happen, and then the execution phase, in which it executes as many times as needed until it finishes.
In conclusion, I think this approach is better than the original recalc approach and also the "mechanical" refactor that currently exists in 1.9
Random comments