Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMP] snippets: move all work from parent to mp workers #137

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Commits on Nov 14, 2024

  1. [IMP] snippets: move all work from parent to mp workers

    In `convert_html_columns()`, we select 100MiB worth of DB tuples and pass them
    to a ProcessPoolExecutor together with a converter callable. So far, the
    converter returns all tuples, changed or unchanged together with the
    information if it has changed something. All this is returned through IPC to
    the parent process. In the parent process, the caller only acts on the changed
    tuples, though, the rest is ignored. In any scenario I've seen, only a small
    proportion of the input tuples is actually changed, meaning that a large
    proportion is returned through IPC unnecessarily.
    
    What makes it worse is that processing of the converted results in the parent
    process is often slower than the conversion, leading to two effects:
    1) The results of all workers sit in the parent process's memory, possibly
       leading to MemoryError (upg-2021031)
    2) The parallel processing is being serialized on the feedback, defeating a
       large part of the intended performance gains
    
    To improve this, this commit
    - moves all work into the workers, meaning not just the conversion filter, but
      also the DB query as well as the DB updates.
      - by doing so reduces the amount of data passed by IPC to just the query
        texts
      - by doing so distributes the data held in memory to all worker processes
    - reduces the chunk size by one order of magnitude, which means
      - a lot less memory used at a time
      - a lot better distribution of "to-be-changed" rows when these rows are
        clustered in the table
    
    All in all, in my test case, this
    - reduces maximum process size in memory to 300MiB for all processes compared
      to formerly >2GiB (and MemoryError) in the parent process
    - reduces runtime from 17 minutes to less than 2 minutes
    cawo-odoo committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    fe9b2c8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3a916ac View commit details
    Browse the repository at this point in the history