Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use index for insert conflict resolution #7529

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

antekresic
Copy link
Contributor

During insert conflict resolution, we find the index on the compressed chunk based on the unique constraint on the chunk. However, we can/should always use the index since we have all the index column values in the slot thats being inserted so create scan keys until we find a column that doesn't exist in the uncompressed chunk.

@antekresic antekresic force-pushed the antekresic/idx_find_fix branch from 39d7f54 to 4decd26 Compare December 11, 2024 14:43
@antekresic antekresic marked this pull request as ready for review December 11, 2024 14:43
@antekresic antekresic self-assigned this Dec 11, 2024
@antekresic antekresic enabled auto-merge (rebase) December 11, 2024 14:44
Copy link
Member

@akuzm akuzm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to merge it to the 2.17.x directly. Approving since this is an urgent fix.

I think the proper fix would need more changes, e.g. the key_columns is not needed anymore. It also looks suboptimal that we're going through every index for every input tuple again.

Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since this is an important fix.

Still, I think this function needs refactoring. Like @akuzm points out, key_columns is no longer needed and it looks like this is more complicated than it needs to be without delving too deep into it now.

Could we at least get rid of the key_columns arg in this fix since it is a dead argument?

Also, wondering if the other similar functions to build scan keys have this problem too since they use key_columns too? (I didn't actually check how it is used so cannot tell now whether it is an issue or not).

@antekresic antekresic force-pushed the antekresic/idx_find_fix branch from 4decd26 to 75d728e Compare December 11, 2024 15:28
@erimatnor
Copy link
Contributor

I think the proper fix would need more changes, e.g. the key_columns is not needed anymore. It also looks suboptimal that we're going through every index for every input tuple again.

100% agree. Finding the index should be separated from building the scan key. Then we should open the index in the chunk insert state or similar and keep it open until the insert is finished. Only the scankey needs to be updated for the next tuple.

@antekresic
Copy link
Contributor Author

antekresic commented Dec 11, 2024

Actually, key columns are necessary, we just need to skip over columns which are not in key columns. Previous implementation could potentially give wrong results.

As far as the index caching, I don't disagree:
https://github.com/timescale/timescaledb/blob/main/tsl/src/compression/compression_dml.c#L416

@erimatnor
Copy link
Contributor

Actually, key columns are necessary, we just need to skip over columns which are not in key columns. Previous implementation could potentially give wrong results.

Ok, I guess I need to take another look. How could it give wrong results? Did it decompress the wrong segments, or?

@erimatnor erimatnor self-requested a review December 11, 2024 15:40
if (!bms_is_member(column_attno, key_columns))
{
break;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this really work? Don't you need at least the first index column as a scan key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check later for number of scankeys a particular index can use, we only use the index if we can generate at least one scan key.

Copy link
Contributor

@erimatnor erimatnor Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can't use an index if we only match, e.g., the last index key right?

With this check you might generate a suffix.

Let's say an index includes a,b,c

then you can use it if you match <a>, <a,b>, <a,b,c>, or <a,c> right?

But you cannot use the index if you match <b> or <c> or <b,c>... or am I missing something?

Here it looks like you can generate a match for the latter examples too since you can skip any column, including the first one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should also look for longest prefix match. Here it looks like we are happy with first matching index even if there is a better one that matches more keys.

Maybe the simple approach works for our case because we know what indexes exist, but I am not sure the code is really doing the right thing here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though not optimal, there are cirumstances where picking an index without the leading edge is actually done in Postgres?

feike=# \d a
                 Table "public.a"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 odd    | boolean |           |          |
 j      | integer |           |          |
 t      | text    |           |          |
Indexes:
    "a_odd_j_idx" btree (odd, j)

feike=# explain select * from a where j=4;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Index Scan using a_odd_j_idx on a  (cost=0.42..18484.43 rows=1 width=1909)
   Index Cond: (j = 4)
(2 rows)

It's more a statistics based algorithm that should pick the index though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more a statistics based algorithm that should pick the index though?

Also, this index matching code isn't using stats at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, If you are confident that the current code is OK despite the concerns/questions I raised, then feel free to go ahead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can't use an index if we only match, e.g., the last index key right?

It's going to return the correct results. It'll be a kind of "index seq scan", where you scan an index but don't actually have any search keys, only plain scan keys ("index cond" vs "index filter" in explain). We actually had a bug like this in our catalog lookups where we used a wrong index.

Copy link
Contributor

@erimatnor erimatnor Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can't use an index if we only match, e.g., the last index key right?

It's going to return the correct results. It'll be a kind of "index seq scan", where you scan an index but don't actually have any search keys, only plain scan keys ("index cond" vs "index filter" in explain). We actually had a bug like this in our catalog lookups where we used a wrong index.

Sure, but this isn't only about correctness, but about performance. Mostly wondering if falling back to a seqscan isn't faster in that case... 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly doubt using a seq scan is ever faster, unless there is a very small number of batches. Just for the fact that we are looking for a specific tuple.

I think its an OK assumption that we want to do an index scan here whenever we can.

@antekresic antekresic force-pushed the antekresic/idx_find_fix branch from 75d728e to d7278e6 Compare December 11, 2024 16:16
@erimatnor
Copy link
Contributor

What's the current status of this fix? Is it ready to be merged?

@antekresic antekresic force-pushed the antekresic/idx_find_fix branch from d7278e6 to 05752e0 Compare December 12, 2024 08:45
During insert conflict resolution, we find the index
on the compressed chunk based on the unique constraint
on the chunk. However, we can/should always use the index
since we have all the index column values in the slot
thats being inserted so create scan keys until we find
a column that doesn't exist in the uncompressed chunk.
@antekresic antekresic force-pushed the antekresic/idx_find_fix branch from 05752e0 to 0e3600f Compare December 12, 2024 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants