Add SqlCatalog _commit_table support #265

sungwy · 2024-01-11T22:03:57Z

This PR takes much code-spiration from @HonahX 's recent PR for a similar feature enhancement to the Glue Catalog.

The only difference here is that we use the metadata_location as the optimistic locking parameter to ensure that the version of the table we used to validate the requirements on is still the same when the SQL update is being made.

EDIT:
for methods that require rowcount on UPDATE and DELETE for optimistic locking, we now fallback to "SELECT ... FOR UPDATE" sqlalchemy expression with_for_update to acquire a row level lock if the sqlalchemy dialect does not support rowcount for UPDATE / DELETE

https://docs.sqlalchemy.org/en/14/core/internals.html - supports_sane_rowcount

HonahX

Thanks @syun64 for the great work! Overall LGTM. Just have two minor comments

pyiceberg/catalog/sql.py

…unt is False

HonahX · 2024-01-13T08:44:09Z

pyiceberg/catalog/sql.py

+                try:
+                    tbl = (
+                        session.query(IcebergTables)
+                        .with_for_update(of=IcebergTables, nowait=True)


Quick question: Why do we use nowait here? It looks like we're not catching errors from concurrent SELECT FOR UPDATE NOWAIT attempts.

I think skip_locked=True may be better here, as it lets the existing NoResultFound catch handle the case when the row is locked by another session. What do you think?

Great question @HonahX . The reason I wanted to avoid using skip_locked is because I think that would obfuscate the exception message we receive when NoResultFound is raised, which currently means that the table that we wanted to find doesn't exist in the DB.

I felt that using nowait would be a good practice to avoid processes hanging and waiting for locks, which would behave a lot less like optimistic locking. With nowait, if we caught the right exception, we could raise a CommitFailedException with a message describing that a conflicting concurrent commit was made to the record, which is an expected type of exception with optimistic locking.

Would it work to catch sqlalchemy.exc.OperationalError that is raised when the lock isn't available?

Thanks for the explanation! Reflecting further, I think that using neither NOWAIT nor SKIP_LOCKED might be better for maintaining consistency. Currently, when the engine supports accurate rowcounts, we employ UPDATE TABLE or DELETE TABLE, which inherently wait for locks. If the engine lacks rowcount support, sticking with SELECT FOR UPDATE ensures consistent behavior with UPDATE TABLE operations, as it waits for locks. This consistency eliminates the need to handle additional exceptions, as any concurrent transaction will naturally lead to a NoResultFound scenario once the table is dropped, renamed, or updated.

Shifting focus to potential alternatives, If we later decide that avoiding lock waits is preferable in fallback situations, we could consider the following modifications:

For drop_table and rename_table, I think you're right that we can catch sqlalchemy.exc.OperationalError and re-raise it as a new exception indicating that another process is set to delete the table. Using CommitFailedException doesn't seem appropriate here, as it's primarily meant for failed commits on iceberg tables.

For _commit_table, skip_locked might still be useful. At the point when the SQL command is executed, we're assured of the table's existence. Therefore, encountering NoResultFound would directly imply a concurrent commit attempt.

Do you think the concern about maintaining consistency between cases that do and do not support rowcount is valid? If not, what are your thoughts on adopting NOWAIT for drop_table and rename_table, and SKIP_LOCKED for _commit_table?

Hi @HonahX . Thank you for explaining so patiently.

If the engine lacks rowcount support, sticking with SELECT FOR UPDATE ensures consistent behavior with UPDATE TABLE operations, as it waits for locks. This consistency eliminates the need to handle additional exceptions, as any concurrent transaction will naturally lead to a NoResultFound scenario once the table is dropped, renamed, or updated.

I agree with your analysis here, and I think that dropping nowait and skip_locked will be best to mimic the other behavior with UPDATE TABLE operation as closely as possible.

tests/catalog/test_sql.py

HonahX

LGTM! Thanks for the great work @syun64 ! For this important PR, let's give some time for others to provide their input, especially looking forward to thoughts from @Fokko and @cccs-eric, who are the original authors of the SqlCatalog.

pyiceberg/catalog/sql.py

Fokko · 2024-01-17T12:36:16Z

Thanks @syun64 for working on this 👍

sql commit

5d3c5ec

sungwy marked this pull request as draft January 11, 2024 22:04

SqlCatalog _commit_table

079a3dc

sungwy marked this pull request as ready for review January 12, 2024 02:33

HonahX reviewed Jan 12, 2024

View reviewed changes

pyiceberg/catalog/sql.py Outdated Show resolved Hide resolved

pyiceberg/catalog/sql.py Outdated Show resolved Hide resolved

sungwy added 2 commits January 12, 2024 04:45

better variable names

e09ba7c

fallback to FOR UPDATE commit when engine.dialect.supports_sane_rowco…

6bf3a31

…unt is False

HonahX reviewed Jan 13, 2024

View reviewed changes

sungwy added 2 commits January 13, 2024 23:09

remove stray print

d79884f

wait

6516b57

HonahX approved these changes Jan 15, 2024

View reviewed changes

Fokko reviewed Jan 16, 2024

View reviewed changes

pyiceberg/catalog/sql.py Outdated Show resolved Hide resolved

Fokko reviewed Jan 16, 2024

View reviewed changes

pyiceberg/catalog/sql.py Outdated Show resolved Hide resolved

better logging

90516cf

Fokko approved these changes Jan 17, 2024

View reviewed changes

Fokko merged commit 7deb739 into apache:main Jan 17, 2024
6 checks passed

Fokko mentioned this pull request Jan 17, 2024

SqlCatalog _commit_table() support #262

Closed

Fokko added this to the PyIceberg 0.6.0 release milestone Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SqlCatalog _commit_table support #265

Add SqlCatalog _commit_table support #265

sungwy commented Jan 11, 2024 •

edited

Loading

HonahX left a comment

HonahX Jan 13, 2024

sungwy Jan 14, 2024

HonahX Jan 14, 2024 •

edited

Loading

sungwy Jan 14, 2024

HonahX left a comment

Fokko commented Jan 17, 2024

Add SqlCatalog _commit_table support #265

Add SqlCatalog _commit_table support #265

Conversation

sungwy commented Jan 11, 2024 • edited Loading

HonahX left a comment

Choose a reason for hiding this comment

HonahX Jan 13, 2024

Choose a reason for hiding this comment

sungwy Jan 14, 2024

Choose a reason for hiding this comment

HonahX Jan 14, 2024 • edited Loading

Choose a reason for hiding this comment

sungwy Jan 14, 2024

Choose a reason for hiding this comment

HonahX left a comment

Choose a reason for hiding this comment

Fokko commented Jan 17, 2024

sungwy commented Jan 11, 2024 •

edited

Loading

HonahX Jan 14, 2024 •

edited

Loading