Making indexes more flexible #1780

joto · 2022-09-28T15:17:30Z

All of the following concerns only the flex output. We are not planning to change any of the other outputs.

One of the most asked for features is to have more functionality around indexes:

Current situation

Osm2pgsql will always create indexes on the id columns of all tables if the database is created for update (with --slim and without --drop). These indexes are needed for osm2pgsql itself, so it can do updates properly. This will not change.

In addition osm2pgsql always creates a GIST index on the geometry column of each table. If a table doesn't have a geometry column, no index is created. If it has several, only the first will get the index.

Indexes are created in parallel (up to the number set with --number-processes), unless --disable-parallel-indexing is set.

Features we want

Allow tables without any indexes
Allow indexes on any column (of any column type)
Allow indexes on several columns at once
Allow indexes on expressions
Allow any type of index (btree, gist, ...)
Allow setting of WHERE conditions
Allow setting of TABLESPACE
Allow setting of other options like UNIQUE etc.

See also the PostgreSQL docs for indexes.

Also any solution must be backwards compatible, so running osm2pgsql with an unchanged config file and command line options must still do the same.

Proposal

Add indexes field to the options given to the osm2pgsql.define_table() command.

The indexes fields can have one of the following values:

nil (default) means to create an index on the first geometry field. This takes care of the backwards compatibilty.
false means that we do not want any indexes on this table.
a list of index definitions describing what indexes we want. The list may be empty which is the same as setting indexes to false.

Examples

Here is an examples config showing the old way of doing things:

local data_table = osm2pgsql.define_table{
    name = "data",
    ids = { type = 'any', id_column = 'osm_id', type_column = 'osm_type' },
    index = nil,
    columns = {
        { column = 'name', type = 'text' },
        { column = 'tags',  type = 'jsonb' },
        { column = 'geom',  type = 'geometry' }
    }
}

You can set this instead if you don't want any index:

    indexes = false,

Or this does the same:

    indexes = {},

Or you can indicate exactly what you want:

    indexes = {{ column = 'geom', method = 'gist', fillfactor = 'auto' },
               { column = 'name', method = 'btree', tablespace = 'name_index_ts' },
               { column = {'street', 'housenumber'} },
               { expression = 'upper(name)', include = {'id'}, where = 'ever' }}

Index options

The following index options are available. Some values for these options can be checked by osm2pgsql to allow nice error messages. Others can not be checked by osm2pgsql and we can only report the error message from the database.

Generally not all options (such as fillfactor) are supported for all index methods but we can not reliably check this because it might be different in different database versions.

We can create the indexes directly after creating their tables to check whether the CREATE INDEX command is successful and remove it again afterwards. Otherwise users might only get the error message after many hours of import.

`column`

The column name or an array of column names to build the index on. Must be at least one column. Names are checked against available column names for this index. Can not be used together with expression.

`method`

The indexing method, btree (default), gist, etc. Checked against the list of methods supported by the databases, as returned by the query SELECT * FROM pg_am WHERE amtype='i';.

`tablespace`

The tablespace to use for this index. Default (nil) is the tablespace set by the index_tablespace option on the table or the default tablespace if none is set. Checked against the list of available tablespaces (SELECT spcname FROM pg_tablespace;).

We can keep the index_tablespace option on the table or possibly mark it as deprecated later on.

`expression`

An expression interpreted as-is by the database. No check is done. Can not be used together with column.

`include`

Extra columns to include, default is none. Checked against the list of columns of this table. Only available for PostgreSQL 11+.

`fillfactor`

Set the fill factor. Allowed values are integers 1 to 100 or auto (default) which means set to 100 for non-updateable databases (created without --slim or with --drop) and leave unset otherwise. (This is the current behaviour for backwards compatibility.)

`where`

Add a WHERE clause to the index creation. The content is passed through to the database without check.

`unique`

Create a UNIQUE index. Must be set to true or false (default).

Showing progress and help with crash recovery

See also #207 for the question of how processing progress can be shown in a better way.

With modern PostgreSQL versions it is possible to monitor index creation progress through the pg_stat_progress_create_index system table. If we create a table osm2pgsql_indexes and add all indexes to be created to that table, we can always get the current status with a simple SQL query.

We can add the SQL CREATE INDEX command that osm2pgsql issues to that table. This would allow advanced users to re-run index creation manually if the import failed while creating indexes. In the long run this could be part of a more automatic approach to resume failed imports.

The text was updated successfully, but these errors were encountered:

rustprooflabs · 2022-09-29T00:36:20Z

This sounds great. I have a number of indexes manually created post-import, this will consolidate that. Very happy to see support for both partial indexes and expressions.

lonvia · 2022-09-29T07:21:22Z

I wouldn't want to see fillfactor exposed to the user. In fact, I'd rather see us silently get rid of the code that sets a special fillfactor. I suspect that the gain is minimal. The downside is that it makes the code more complex and it may lead to nasty surprises when users import non-updateable and then further post-process the tables.

@pnorman You introduced the fillfactor code. Any opinion?

mboeringa · 2022-09-29T16:54:25Z

Nice ideas, but personally, I see little value of

indexes = false,

over:

indexes = {},

I would just support the latter. There is no reason to have two options, and setting an empty list seems straightforward and clear enough from a syntactical point of view, while setting false is mildly ambiguous, it doesn't necessarily indicate "no indexes will be created".

pnorman · 2022-09-29T18:55:14Z

I'd rather keep it. Saving 10% disk space on indexes is significant.

In the uncommon situation where a user is doing post-processing that rewrites portions of the table, they need to rebuilding the indexes to get optimal performance regardless of fillfactor.

I don't think fillfactor should be exposed to the user, it should come from the --slim and --drop settings on the command-line.

The table definitions have a new (optional) field called "indexes" now which takes a list of index definitions. If the field is not there, we fall back to what we did before, a GIST index on the only/first geometry column of table is created. To disable indexes, set to an empty array. See the flex-config/indexes.lua Lua config for some usage examples. See osm2pgsql-dev#1780

joto · 2022-12-05T09:30:36Z

The PR #1847 implements most of this now. I took the suggestion from @mboeringa and didn't implement the indexes = false.

I didn't expose the fillfactor to the user. If you are not setting any index, you get the old behaviour with the fillfactor depending on the import mode. But for any other indexes this is not set. I need to do some experimenting and benchmarking with these settings. But this PR gives us a lot of functionality beyond what we have and doesn't take away anything, so that's good for now. (You can always disable index build and run an SQL command after osm2pgsql has run to do anything really special.)

This should resolve #27, #1357, and #1691 and also unblocks #1311.

pnorman · 2022-12-05T09:33:03Z

But for any other indexes this is not set. I need to do some experimenting and benchmarking with these settings.

We should set it the same way as we do for the default indexes.

The table definitions have a new (optional) field called "indexes" now which takes a list of index definitions. If the field is not there, we fall back to what we did before, a GIST index on the only/first geometry column of table is created. To disable indexes, set to an empty array. See the flex-config/indexes.lua Lua config for some usage examples. See osm2pgsql-dev#1780

joto · 2022-12-05T13:15:20Z

But for any other indexes this is not set. I need to do some experimenting and benchmarking with these settings.

We should set it the same way as we do for the default indexes.

I am not sure that's a good idea before we have any data on how this behaves in reality. For one the behaviour might be different for different index types. The PostgreSQL docs say "The other index methods use fillfactor in different but roughly analogous ways; the default fillfactor varies between methods." And when users use different, special indexes, chances are greater that they are doing something special where the automagic setting of the fillfactor might get in the way. I believe doing the behaviour I am suggesting here gives us the most options if and when we want to adjust the behaviour later on.

The table definitions have a new (optional) field called "indexes" now which takes a list of index definitions. If the field is not there, we fall back to what we did before, a GIST index on the only/first geometry column of table is created. To disable indexes, set to an empty array. See the flex-config/indexes.lua Lua config for some usage examples. See osm2pgsql-dev#1780

joto · 2023-01-25T13:53:49Z

This has been implemented.

joto pinned this issue Sep 28, 2022

joto added the flex Issues about the flex output label Sep 28, 2022

joto mentioned this issue Dec 5, 2022

Flex: Add support for configuring table indexes in the Lua config #1847

Merged

joto mentioned this issue Dec 19, 2022

Manual index configuration for id indexes? #1854

Closed

joto unpinned this issue Jan 4, 2023

joto closed this as completed Jan 25, 2023

joto mentioned this issue Jun 4, 2024

Switching OSM Carto to use the osm2pgsql flex output gravitystorm/openstreetmap-carto#4977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making indexes more flexible #1780

Making indexes more flexible #1780

joto commented Sep 28, 2022

rustprooflabs commented Sep 29, 2022

lonvia commented Sep 29, 2022

mboeringa commented Sep 29, 2022

pnorman commented Sep 29, 2022

joto commented Dec 5, 2022

pnorman commented Dec 5, 2022

joto commented Dec 5, 2022

joto commented Jan 25, 2023

Making indexes more flexible #1780

Making indexes more flexible #1780

Comments

joto commented Sep 28, 2022

Current situation

Features we want

Proposal

Examples

Index options

column

method

tablespace

expression

include

fillfactor

where

unique

Showing progress and help with crash recovery

rustprooflabs commented Sep 29, 2022

lonvia commented Sep 29, 2022

mboeringa commented Sep 29, 2022

pnorman commented Sep 29, 2022

joto commented Dec 5, 2022

pnorman commented Dec 5, 2022

joto commented Dec 5, 2022

joto commented Jan 25, 2023

`column`

`method`

`tablespace`

`expression`

`include`

`fillfactor`

`where`

`unique`