HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics #5567

InvisibleProgrammer · 2024-11-28T17:17:36Z

What changes were proposed in this pull request?

There can be a concurrency issue and persisting table parameters can have duplicate key error when replication wants to store table column statistics and the table have statistics for multiple engines. In this change, in a parallel process already saved those, we retry it so that, DataNucleus will try to do update instead of insert.

Why are the changes needed?

To avoid duplicate key errors.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Repro and test process is described in the ticket: https://issues.apache.org/jira/browse/HIVE-28578?focusedCommentId=17901761&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17901761
Tested manually, on a cluster.

...e-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java

zratkai

Please check my comments!

…tistics

kokila-19 · 2024-12-03T16:57:03Z

...e-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java

+        catch (NucleusDataStoreException e) {
+          retries--;
+
+          if (retries == 0) {


How about logging the error here? Something like Maximum number of retries (Retry = 3) reached.

I'm not sure if it is necessary. Logging is already there from the retries. I don't know what is the value in a log entry right before throwing an exception. The exception will be logged and it contains the stack trace.
It would be something like that in the log:

[INFO] I will throw an exception [ERROR] There is an exception...

kokila-19 · 2024-12-03T17:04:39Z

...e-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java

+      int retries = 3;
+      boolean success = false;
+      while (!success && retries > 0) {
+        // TODO## ideally the col stats stats should be in colstats, not in the table!


" col stats stats" I think the stats was repeated by mistake, so this can be removed.

I'm not sure if the author made it as a mistakes. Column statistics are a kind of statistics. So saying the statistics around column statistics shouldn't be in the table make sense to me.

deniskuzZ · 2024-12-04T10:47:28Z

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

InvisibleProgrammer · 2024-12-04T10:50:28Z

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

The existing table-level lock is an in-memory lock. That means in case you have multiple HMS instances and two different processes are getting requests in the same time, it just doesn't have any effect - as on JVM level, there is only one single instance.

sonarqubecloud · 2024-12-04T10:54:20Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

deniskuzZ · 2024-12-04T11:31:06Z

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

The existing table-level lock is an in-memory lock. That means in case you have multiple HMS instances and two different processes are getting requests in the same time, it just doesn't have any effect - as on JVM level, there is only one single instance.

ok, in-memory lock serves diff purpose - thread safety, see HIVE-25904.

Descriptions says "updateTableColumnStatistics can throw SQLIntegrityConstraintViolationException during replication if HA is on and two different HMS instance gets the same call but with different engine. "
isn't the engine part of PK? If not, maybe it should be..

Some of RawStore interface methods are not retriable by design, so I would expect others to be retriable without the need for extra hacks, simply use RetryingMetaStoreClient.

public interface RawStore extends Configurable {
  /***
   * Annotation to skip retries
   */
  @Target(value = ElementType.METHOD)
  @Retention(value = RetentionPolicy.RUNTIME)
  @interface CanNotRetry {
  }

InvisibleProgrammer · 2024-12-04T13:02:00Z

I am not sure I get this. There is already a table-level lock for atomic stats update and we are adding an extra re-try loop inside?

The existing table-level lock is an in-memory lock. That means in case you have multiple HMS instances and two different processes are getting requests in the same time, it just doesn't have any effect - as on JVM level, there is only one single instance.

ok, in-memory lock serves diff purpose - thread safety, see HIVE-25904.

Descriptions says "updateTableColumnStatistics can throw SQLIntegrityConstraintViolationException during replication if HA is on and two different HMS instance gets the same call but with different engine. " isn't the engine part of PK? If not, maybe it should be..

Some of RawStore interface methods are not retriable by design, so I would expect others to be retriable without the need for extra hacks, simply use RetryingMetaStoreClient.
public interface RawStore extends Configurable {
  /***
   * Annotation to skip retries
   */
  @Target(value = ElementType.METHOD)
  @Retention(value = RetentionPolicy.RUNTIME)
  @interface CanNotRetry {
  }

Descriptions says "updateTableColumnStatistics can throw SQLIntegrityConstraintViolationException during replication if HA is on and two different HMS instance gets the same call but with different engine. " isn't the engine part of PK? If not, maybe it should be..

It is trickier than this. Most of the statistics are stored in TAB_COL_STATS and it does have an engine field. But we have at least one statistics that is stored in TABLE_PARAMS. I really don't know the reason, but even the code says it shouldn't be in that way. See that comment from 2018:

      // TODO: (HIVE-20109) ideally the col stats stats should be in colstats, not in the table!

I assume that TODO never happened.
And TABLE_PARAMS is just a key-value list. For example, at the customer that faced with the problem, the duplicate entry is COLUMN_STATS_ACCURATE. I really don't know what is the reason why this property is in TABLE_PARAMS, not TAB_COL_STATS. And I also have no information about how many other fields can be in the same situation.

Some of RawStore interface methods are not retriable by design, so I would expect others to be retriable without the need for extra hacks, simply use RetryingMetaStoreClient.

Let me check what that exactly does and if it is possible to apply at the customer.
TBH, I'm not completely sure if retry is a proper expression about the current change: You can imagine, if, for example, we could cover the situation with a simple SQL script, it could be an upsert merge statement: insert if it doesn't exist. Update if it does. But DataNucleus is an ORM tool that cannot do such a thing. But if we force it to query the table state again, at the second attempt it will recognise that ups, we have records. So we should do an update instead of an insert.
Do you know better way to do an insert or update statement, using DataNucleus?

InvisibleProgrammer · 2024-12-07T08:08:15Z

Hi @deniskuzZ ,
Could you please explain the reason why you consider a wrong idea to try an insert if an update didn't succeed due to a concurrency issue?

TBH, I'm not sure if using the Retrying handler for all the metastore client calls is an approach that worth to use just because we have this one and only rare edge case.

And I also consider it the problem of using an ORM tool that cannot properly handle such a trivial case. It could be handled with a simple merge statement, using plain SQL.

asf-ci-hive added tests pending tests passed and removed tests pending labels Nov 28, 2024

zratkai reviewed Dec 2, 2024

View reviewed changes

...e-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java Outdated Show resolved Hide resolved

zratkai reviewed Dec 2, 2024

View reviewed changes

...e-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java Outdated Show resolved Hide resolved

zratkai reviewed Dec 2, 2024

View reviewed changes

...e-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java Show resolved Hide resolved

zratkai suggested changes Dec 2, 2024

View reviewed changes

asf-ci-hive added tests pending and removed tests passed labels Dec 3, 2024

zsmiskolczi and others added 2 commits December 3, 2024 10:35

HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnSta…

43c89b0

…tistics

HIVE-28578: Address review comments

2a6ce35

InvisibleProgrammer force-pushed the HIVE-28578 branch from 9df299d to 2a6ce35 Compare December 3, 2024 09:35

asf-ci-hive added tests failed tests pending and removed tests pending tests failed labels Dec 3, 2024

kokila-19 reviewed Dec 3, 2024

View reviewed changes

asf-ci-hive added tests unstable tests pending and removed tests pending tests unstable labels Dec 3, 2024

asf-ci-hive added tests passed and removed tests pending labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics #5567

HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics #5567

InvisibleProgrammer commented Nov 28, 2024

zratkai left a comment

kokila-19 Dec 3, 2024

InvisibleProgrammer Dec 4, 2024

kokila-19 Dec 3, 2024

InvisibleProgrammer Dec 4, 2024

deniskuzZ commented Dec 4, 2024

InvisibleProgrammer commented Dec 4, 2024

sonarqubecloud bot commented Dec 4, 2024

deniskuzZ commented Dec 4, 2024 •

edited

Loading

InvisibleProgrammer commented Dec 4, 2024 •

edited

Loading

InvisibleProgrammer commented Dec 7, 2024

HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics #5567

Are you sure you want to change the base?

HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics #5567

Conversation

InvisibleProgrammer commented Nov 28, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

zratkai left a comment

Choose a reason for hiding this comment

kokila-19 Dec 3, 2024

Choose a reason for hiding this comment

InvisibleProgrammer Dec 4, 2024

Choose a reason for hiding this comment

kokila-19 Dec 3, 2024

Choose a reason for hiding this comment

InvisibleProgrammer Dec 4, 2024

Choose a reason for hiding this comment

deniskuzZ commented Dec 4, 2024

InvisibleProgrammer commented Dec 4, 2024

sonarqubecloud bot commented Dec 4, 2024

Quality Gate passed

deniskuzZ commented Dec 4, 2024 • edited Loading

InvisibleProgrammer commented Dec 4, 2024 • edited Loading

InvisibleProgrammer commented Dec 7, 2024

deniskuzZ commented Dec 4, 2024 •

edited

Loading

InvisibleProgrammer commented Dec 4, 2024 •

edited

Loading