Memory usage during phase 2 flex processing #1536

mboeringa · 2021-07-09T19:21:41Z

mboeringa
Jul 9, 2021

Based on @pnorman's good work for the flex version of the openstreetmap-carto style (gravitystorm/openstreetmap-carto#4431), I have been experimenting with planet size data to see the results and what potential issues might come up with the new options.

I have done two tests:

Process a 67 GB planet PBF file based on the Facebook "Daylight" distribution (v1.2), with additionally added Microsoft ML buildings as made available by Facebook, and the "administrative boundaries" as well. The combined PBF file was made using osmium per the Facebook website's instructions.
Process the official 58 GB planet OSM file.

For the first test, involving the Facebook Daylight, I used an early version of the flex file of openstreetmap-carto, where Paul didn't yet add the 'planet-osm-admin', 'planet-osm-transport-line' and 'planet-osm-transport-polygon' tables, but only enhanced it with a non-spatial 'planet-osm-route' table that can be used to display routes based on database joins.

My VM was configured with 100 GB RAM, and 50 GB Swap in Ubuntu. Peak memory usage during this first test was about 95 GB RAM + 9 GB swap (with "-C 75000" set on the command line), so this first run went successfully, despite the large PBF.

I then ran the second test with the smaller 58 GB official planet PBF file. This time, I used the latest state of the Paul's work on the flex style, which adds the new 'planet-osm-admin', 'planet-osm-transport-line' and 'planet-osm-transport-polygon' spatial tables, next to the 'planet-osm-route' non-spatial table.

With the same VM configuration, the osm2pgsql process was killed when all RAM and swap was consumed. I then attempted it with smaller cache settings, but even with "-C 10000", the process was killed. Switching to slim-mode and flatnodes file allowed the processing to succeed.

Now my question:

As far as I now understand Paul's code in the flex style, only the 'planet-osm-admin' table actually requires phase 2 processing, and all the other tables are just created using phase 1 processing.

As documented on the "Osm2pgsql Manual" pages, phase 2 processing can add a considerable amount of extra memory usage, due to the need to store in main memory all data from phase 1 needed in phase 2 ("All data stored in stage 1 for use in stage 2 in your Lua script will use main memory.").

Clearly, with the "administrative boundaries", we have a kind of "worst case" scenario here, as the admin boundary relations are some of the most complex and largest ones in the whole of OpenStreetMap. So it is probably not a surprise to see a big jump in memory usage.

Yet, seeing the memory usage of osm2pgsql jump up by > 40 GB before being killed, still seems a bit to much, even for the admin boundaries??

If I understand the LUA code of the style right, only the member ways of the boundary relations need to be stored, and since the ways are even being de-duplicated in the process (this was the purpose of the new 'planet-osm-admin' spatial table), additional memory usage should be relatively modest??

What am I missing? Or is this kind of memory usage in this kind of usage scenario involving OpenStreetMap "administrative boundaries" and phase 2 processing just simply expected and normal?

mboeringa · 2021-07-10T07:06:23Z

mboeringa
Jul 10, 2021
Author

Just for the record: according to DBeaver, the resulting physical size of the 'planet-osm-admin' table is only 1.9 GB, so way less than the > 40 GB jump in memory usage. There are 2,294,085 records in the table, with admin values 2-13.

0 replies

joto · 2021-07-10T11:03:29Z

joto
Jul 10, 2021
Maintainer

You can't compare the amount of memory used by some Lua structure with how much this will take in the database. Memory usage in Lua will be much larger. That's why there is the warning in the manual. So I do not rule out that the problem you are describing is due to the way we do two-stage processing.

To figure out where the problem is, I suggest running the exact same config, but with the one line removed where you are actually storing anything in the global variable. The difference you see then should tell you something. There are also ways to ask Lua how much memory it has allocated etc. but this goes beyond the scope here.

1 reply

mboeringa Jul 18, 2021
Author

So I do not rule out that the problem you are describing is due to the way we do two-stage processing.

Hi @joto,

Considering this is a relatively small dataset in terms of number of records needing to be processed in phase 2, the memory usage observed does raise the question about the viability of using phase 2 processing on much larger datasets, like all or part of OpenStreetMap's highways with the current implementation...

Do you have any ideas how this situation may be improved, or is there anything on the roadmap for future development of osm2pgsql to tackle this issue? I think it would potentially severely limit the abilities to use phase 2 processing on 'planet' size datasets, if there is no good way to efficiently store the intermediate data.

mboeringa · 2021-07-10T12:23:27Z

mboeringa
Jul 10, 2021
Author

You can't compare the amount of memory used by some Lua structure with how much this will take in the database. Memory usage in Lua will be much larger.

Yes, I realize that the type of data structure used, and the way the data is stored, can make a huge difference. I recently had to handle a slightly similar issue, where I needed to store unique IDs and the vertex count of polygons for a multi-threaded Python application. My first, naive, approach was to store this information as a nested Python lists-in-list structure, with a separate sub-list for each polygon record. With a few hundred million records, memory soared to over 130 GB... Reading up more about Python objects and memory consumption, I finally settled on re-implementing this as one big 2D 'numpy array', which probably reduced memory consumption by a factor 20x.

To figure out where the problem is, I suggest running the exact same config, but with the one line removed where you are actually storing anything in the global variable.

Yes, thanks for the suggestion. I will attempt that. It will take some time before I can report the results though, as I have another process running that I would like to finish first.

The code involved though, is this by the way:

phase2_admin_ways = {}

...

function osm2pgsql.process_way(object)
    if osm2pgsql.stage ==  1 then
        if clean_tags(object.tags) then
            return
        end

        local area_tags = isarea(object.tags)
        if object.is_closed and area_tags then
            add_polygon(object.tags)

            if z_order(object.tags) ~= nil then
                add_transport_polygon(object.tags)
            end
        else
            add_line(object.tags)

            if z_order(object.tags) ~= nil then
                add_transport_line(object.tags)
            end

            if roads(object.tags) then
                add_roads(object.tags)
            end
        end
    elseif osm2pgsql.stage == 2 then
        -- Stage two processing is called on ways that are part of admin boundary relations
        local props = phase2_admin_ways[object.id]
        if props ~= nil then
            tables.admin:add_row({admin_level = props.level, multiple_relations = (props.parents > 1), geom = { create = 'line' }})
        end
    end
end

function osm2pgsql.process_relation(object)
    -- grab the type tag before filtering tags
    local type = object.tags.type
    object.tags.type = nil

    if clean_tags(object.tags) then
        return
    end
    if type == "boundary" or (type == "multipolygon" and object.tags["boundary"]) then
        add_line(object.tags)

        if roads(object.tags) then
            add_roads(object.tags)
        end

        add_polygon(object.tags)

    elseif type == "multipolygon" then
        add_polygon(object.tags)

        if z_order(object.tags) ~= nil then
            add_transport_polygon(object.tags)
        end
    elseif type == "route" then
        add_line(object.tags)
        add_route(object)
        -- TODO: Remove this, roads tags don't belong on route relations
        if roads(object.tags) then
            add_roads(object.tags)
        end
    end
end

function osm2pgsql.select_relation_members(relation)
    if relation.tags.type == 'boundary'
       and relation.tags.boundary == 'administrative' then
        local admin = tonumber(admin_level(relation.tags.admin_level))
        if admin ~= nil then
            for _, ref in ipairs(osm2pgsql.way_member_ids(relation)) do
                -- Store the lowest admin_level, and how many relations it used in
                if phase2_admin_ways[ref] == nil then
                    phase2_admin_ways[ref] = {level = admin, parents = 1}
                else
                    if phase2_admin_ways[ref].level == admin then
                        phase2_admin_ways[ref].parents = phase2_admin_ways[ref].parents + 1
                    elseif admin < phase2_admin_ways[ref].level then
                        phase2_admin_ways[ref] = {level = admin, parents = 1}
                    end
                end
            end
            return { ways = osm2pgsql.way_member_ids(relation) }
        end
    end
end

0 replies

StyXman · 2021-07-18T08:30:30Z

StyXman
Jul 18, 2021

You will probably need to use an intermediate db for handling such data size. Probably osm2pgsql should come with some easy APIs for this, at least export bindings for the same pg db that its already using.

0 replies

mboeringa · 2021-09-22T06:56:26Z

mboeringa
Sep 22, 2021
Author

Although slightly speculative, I have the feeling that a large part of this issue with excessive memory usage, was down to the flex style issue described here:

gravitystorm/openstreetmap-carto@b82ba63#r672314283

where in the OpenStreetMap route relation processing code of the flex style, the route member objects would not only be added to the route table, but also to the line and roads table. This was fixed by Paul in the above linked commit, and I have the feeling this also solved the excessive memory usage.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage during phase 2 flex processing #1536

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Memory usage during phase 2 flex processing #1536

mboeringa Jul 9, 2021

Replies: 5 comments · 1 reply

mboeringa Jul 10, 2021 Author

joto Jul 10, 2021 Maintainer

mboeringa Jul 18, 2021 Author

mboeringa Jul 10, 2021 Author

StyXman Jul 18, 2021

mboeringa Sep 22, 2021 Author

mboeringa
Jul 9, 2021

Replies: 5 comments 1 reply

mboeringa
Jul 10, 2021
Author

joto
Jul 10, 2021
Maintainer

mboeringa Jul 18, 2021
Author

mboeringa
Jul 10, 2021
Author

StyXman
Jul 18, 2021

mboeringa
Sep 22, 2021
Author