Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics #5567

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

InvisibleProgrammer
Copy link
Contributor

What changes were proposed in this pull request?

There can be a concurrency issue and persisting table parameters can have duplicate key error when replication wants to store table column statistics and the table have statistics for multiple engines. In this change, in a parallel process already saved those, we retry it so that, DataNucleus will try to do update instead of insert.

Why are the changes needed?

To avoid duplicate key errors.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Repro and test process is described in the ticket: https://issues.apache.org/jira/browse/HIVE-28578?focusedCommentId=17901761&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17901761
Tested manually, on a cluster.

Copy link
Contributor

@zratkai zratkai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments!

catch (NucleusDataStoreException e) {
retries--;

if (retries == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about logging the error here? Something like Maximum number of retries (Retry = 3) reached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is necessary. Logging is already there from the retries. I don't know what is the value in a log entry right before throwing an exception. The exception will be logged and it contains the stack trace.
It would be something like that in the log:

[INFO] I will throw an exception
[ERROR] There is an exception...

int retries = 3;
boolean success = false;
while (!success && retries > 0) {
// TODO## ideally the col stats stats should be in colstats, not in the table!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" col stats stats" I think the stats was repeated by mistake, so this can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the author made it as a mistakes. Column statistics are a kind of statistics. So saying the statistics around column statistics shouldn't be in the table make sense to me.

@deniskuzZ
Copy link
Member

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

@InvisibleProgrammer
Copy link
Contributor Author

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

The existing table-level lock is an in-memory lock. That means in case you have multiple HMS instances and two different processes are getting requests in the same time, it just doesn't have any effect - as on JVM level, there is only one single instance.

Copy link

sonarqubecloud bot commented Dec 4, 2024

@deniskuzZ
Copy link
Member

deniskuzZ commented Dec 4, 2024

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

The existing table-level lock is an in-memory lock. That means in case you have multiple HMS instances and two different processes are getting requests in the same time, it just doesn't have any effect - as on JVM level, there is only one single instance.

ok, in-memory lock serves diff purpose - thread safety, see HIVE-25904.

Descriptions says "updateTableColumnStatistics can throw SQLIntegrityConstraintViolationException during replication if HA is on and two different HMS instance gets the same call but with different engine. "
isn't the engine part of PK? If not, maybe it should be..

Some of RawStore interface methods are not retriable by design, so I would expect others to be retriable without the need for extra hacks, simply use RetryingMetaStoreClient.

public interface RawStore extends Configurable {
  /***
   * Annotation to skip retries
   */
  @Target(value = ElementType.METHOD)
  @Retention(value = RetentionPolicy.RUNTIME)
  @interface CanNotRetry {
  }

@InvisibleProgrammer
Copy link
Contributor Author

InvisibleProgrammer commented Dec 4, 2024

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

The existing table-level lock is an in-memory lock. That means in case you have multiple HMS instances and two different processes are getting requests in the same time, it just doesn't have any effect - as on JVM level, there is only one single instance.

ok, in-memory lock serves diff purpose - thread safety, see HIVE-25904.

Descriptions says "updateTableColumnStatistics can throw SQLIntegrityConstraintViolationException during replication if HA is on and two different HMS instance gets the same call but with different engine. " isn't the engine part of PK? If not, maybe it should be..

Some of RawStore interface methods are not retriable by design, so I would expect others to be retriable without the need for extra hacks, simply use RetryingMetaStoreClient.

public interface RawStore extends Configurable {
  /***
   * Annotation to skip retries
   */
  @Target(value = ElementType.METHOD)
  @Retention(value = RetentionPolicy.RUNTIME)
  @interface CanNotRetry {
  }

Descriptions says "updateTableColumnStatistics can throw SQLIntegrityConstraintViolationException during replication if HA is on and two different HMS instance gets the same call but with different engine. " isn't the engine part of PK? If not, maybe it should be..

It is trickier than this. Most of the statistics are stored in TAB_COL_STATS and it does have an engine field. But we have at least one statistics that is stored in TABLE_PARAMS. I really don't know the reason, but even the code says it shouldn't be in that way. See that comment from 2018:

      // TODO: (HIVE-20109) ideally the col stats stats should be in colstats, not in the table!

I assume that TODO never happened.
And TABLE_PARAMS is just a key-value list. For example, at the customer that faced with the problem, the duplicate entry is COLUMN_STATS_ACCURATE. I really don't know what is the reason why this property is in TABLE_PARAMS, not TAB_COL_STATS. And I also have no information about how many other fields can be in the same situation.

Some of RawStore interface methods are not retriable by design, so I would expect others to be retriable without the need for extra hacks, simply use RetryingMetaStoreClient.

Let me check what that exactly does and if it is possible to apply at the customer.
TBH, I'm not completely sure if retry is a proper expression about the current change: You can imagine, if, for example, we could cover the situation with a simple SQL script, it could be an upsert merge statement: insert if it doesn't exist. Update if it does. But DataNucleus is an ORM tool that cannot do such a thing. But if we force it to query the table state again, at the second attempt it will recognise that ups, we have records. So we should do an update instead of an insert.
Do you know better way to do an insert or update statement, using DataNucleus?

@InvisibleProgrammer
Copy link
Contributor Author

Hi @deniskuzZ ,
Could you please explain the reason why you consider a wrong idea to try an insert if an update didn't succeed due to a concurrency issue?

TBH, I'm not sure if using the Retrying handler for all the metastore client calls is an approach that worth to use just because we have this one and only rare edge case.

And I also consider it the problem of using an ORM tool that cannot properly handle such a trivial case. It could be handled with a simple merge statement, using plain SQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants