Resuming failed imports/updates #1751

joto · 2022-08-28T10:05:48Z

Imports with osm2pgsql can take quite some time and if something breaks you have to start from scratch. Some kind of "resume" feature is requested often and we should look into what's needed for that, but it will not be easy to support that.

There are several reasons why an import can fail:

Bug in osm2pgsql. Not much we can do about that. If there is a bug we don't know the state of the program and any kind of resume would risk data corruption.
PostgreSQL database problem or connection is lost. Currently we can not detect this. This is certainly something where we could do better and try reconnecting, but there still is the question of how long we wait for the database to re-appear before we give up and how do we find out how much data made it into the database already.
We are running out of memory. This is probably the case users see most often. There is not much we can do here, because we can not detect the out-of-memory condition in the first place. And if we'd restart the process we'll probably need that memory again and the same failure will happen.
Crashes unrelated to osm2pgsql (power failure, kernel crash, etc.) Unlikely but these things can happen. It will not be easy to find out where we are in the process.

A resuming function will need these things:

Define checkpoints where we can resume. Something like "every 10.000 objects imported" or "objects importet, building of index x pending" etc.
Checkpoints have to be committed to disk (probably in the database).
In case of crash, the next run of osm2pgsql has to detect the failed import and allow resuming.
We need a sensible user interface. Is the resume automatic or has the user to request it? How to make sure we are using the same input data and same command line options so we are not creating a mess with half the import using different settings then the other half? For this we'll probably need to store information about the import data and the command line options/config used in the database. This is something we'd like to have for other use cases, too (updates could re-use the options from their corresponding import).

And we'll need to have solutions for all the details:

For failed database connections we need (configurable?) timeouts and retries.
How do we handle repeated problems (crashing every time because we don't have enough memory, or database is unavailable) without running in an endless loop?

pnorman · 2022-08-28T19:33:46Z

For processing, it seems we would need to move to importing in-transaction, then committing ever N objects, including updating the state in some in-db table. This would rule out parallel connections.

The low-hanging fruit here is crashes in the table optimizing state. If we make it so the earlier phase import into table_temp named tables, then we can re-run the SQL commands safely.

joto mentioned this issue Sep 28, 2022

Making indexes more flexible #1780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming failed imports/updates #1751

Resuming failed imports/updates #1751

joto commented Aug 28, 2022

pnorman commented Aug 28, 2022

Resuming failed imports/updates #1751

Resuming failed imports/updates #1751

Comments

joto commented Aug 28, 2022

pnorman commented Aug 28, 2022