HIVE-28408: Support ARRAY field access in CBO #5577

ramesh0201 · 2024-12-12T17:17:44Z

What changes were proposed in this pull request?

We need to support array field access in the CBO. This is currently already supported in the execution side. Only CBO fails with an error. So we will be by passing the CBO by doing no-op in this case to leave it to hive compiler and execution engine to do rest of the work

Why are the changes needed?

In order to enable CBO for all the queries

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=nested_column_pruning.q

common/src/java/org/apache/hadoop/hive/conf/HiveConf.java

ql/src/java/org/apache/hadoop/hive/ql/parse/type/RexNodeExprFactory.java

okumin · 2024-12-22T10:32:47Z

ql/src/test/results/clientpositive/llap/vector_orc_nested_column_pruning.q.out

@@ -3008,7 +3008,7 @@ STAGE PLANS:
                enabled: true
                enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-                notVectorizedReason: Key expression for GROUPBY operator: Vectorizing complex type LIST not supported
+                notVectorizedReason: exception: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo cannot be cast to org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo stack trace: org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getStructFieldIndex(VectorizationContext.java:1106), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getGenericUDFStructField(VectorizationContext.java:1094), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:1074), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:972), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.vectorizeSelectOperator(Vectorizer.java:4806), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperator(Vectorizer.java:5443), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChild(Vectorizer.java:1011), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChildren(Vectorizer.java:898), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperatorTree(Vectorizer.java:868), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.access$2500(Vectorizer.java:253), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapOperators(Vectorizer.java:2118), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapOperators(Vectorizer.java:2070), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapWork(Vectorizer.java:2045), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.convertMapWork(Vectorizer.java:1199), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.dispatch(Vectorizer.java:1051), ...


Although the result is unchanged, maybe we would like to have a better error here. ClassCastException sounds irregular. Also, this stack trace should not be robust

Thank you @okumin. Please review my latest updated patch for this.

zabetak · 2025-01-06T14:31:25Z

ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java

@@ -1103,6 +1103,9 @@ private VectorExpression getGenericUDFStructField(ExprNodeFieldDesc exprNodeFiel
  private int getStructFieldIndex(ExprNodeFieldDesc exprNodeFieldDesc) throws HiveException {
    ExprNodeDesc structNodeDesc = exprNodeFieldDesc.getDesc();
    String fieldName = exprNodeFieldDesc.getFieldName();
+    if (exprNodeFieldDesc.getIsList()) {
+      throw new HiveException("Could not vectorize expression: Vectorizing complex type LIST is not supported");
+    }
    StructTypeInfo structTypeInfo = (StructTypeInfo) structNodeDesc.getTypeInfo();


The fact that we are hitting a ClassCastException without the extra if check above is a bit worrisome. It means that the new logic has some impact on the vectorization codepath. Why are we hitting this now?

I was able to reproduce the ClassCastException even without the patch for a simpler query that I mentioned below:
EXPLAIN VECTORIZATION EXPRESSION
SELECT s5.f16.f18.f19 FROM nested_tbl_1;
I will be adding this to the list of queries in the same q file below. What I identified is that, we have certain nested access queries that were not handled well in the vectorization and we need to address that. But there is no new failures after our patch here. We can try to address this issue in a separate patch if you think it is reasonable idea.

zabetak · 2025-01-06T14:33:50Z

ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java

@@ -1103,6 +1103,9 @@ private VectorExpression getGenericUDFStructField(ExprNodeFieldDesc exprNodeFiel
  private int getStructFieldIndex(ExprNodeFieldDesc exprNodeFieldDesc) throws HiveException {
    ExprNodeDesc structNodeDesc = exprNodeFieldDesc.getDesc();
    String fieldName = exprNodeFieldDesc.getFieldName();
+    if (exprNodeFieldDesc.getIsList()) {
+      throw new HiveException("Could not vectorize expression: Vectorizing complex type LIST is not supported");


The Vectorizing complex type.. message is created by getValidateDataTypeErrorMsg during some validation type checks. Why aren't we hitting these checks before arriving into this method?

We are hitting this message because we never do that check(call getValidateDataTypeErrorMsg) for the select clause expressions and do this check only for the group by expression, which is later in the operator tree for that particular vertex.

I have not addressed this by calling the getValidateDataTypeErrorMsg() because, doing so will prevent any struct field in the select clause but right now we only fail if there is a list without an index in vectorization. I have also addressed the exception message accordingly

zabetak · 2025-01-06T14:40:54Z

ql/src/java/org/apache/hadoop/hive/ql/parse/type/RexNodeExprFactory.java

+   * Special operator that is used as syntactic sugar to change the type of collection
+   * expressions in order to perform field access over them.
+   */
+  public static final SqlOperator COMPONENT_ACCESS =


Most of the existing Hive specific operators/functions are under org.apache.hadoop.hive.ql.optimizer.calcite.reloperators package (e.g., HiveDateAddSqlOperator). Consider moving the new operator into a separate class under that package.

Having one class per function is not ideal but it may be a good idea to follow the existing paradigm at least till we decide to perform a holistic refactoring that puts them all together.

zabetak · 2025-01-06T14:42:06Z

ql/src/java/org/apache/hadoop/hive/ql/parse/type/RexNodeExprFactory.java

-      // supplied by serdes.
-      throw new CalciteSemanticException("Unexpected rexnode : "
-          + expr.getClass().getCanonicalName(), UnsupportedFeature.Schema_less_table);
+      // Safe exception. Shouldn't Ideally come here.


The comment does not provide much info; consider dropping it.

zabetak · 2025-01-06T14:42:35Z

ql/src/java/org/apache/hadoop/hive/ql/parse/type/RexNodeExprFactory.java

-      // This may happen for schema-less tables, where columns are dynamically
-      // supplied by serdes.
-      throw new CalciteSemanticException("Unexpected rexnode : "
-          + expr.getClass().getCanonicalName(), UnsupportedFeature.Schema_less_table);


Can we remove the UnsupportedFeature.Schema_less_table from the enumeration.

zabetak · 2025-01-06T14:48:16Z

It would be also nice to include a qfile with an EXPLAIN CBO statement that shows that we can treat ARRAY field access in CBO. Possibly the book example that I added under HIVE-28408 is a good fit.

ramesh0201 · 2025-01-09T07:16:02Z

@zabetak Can you please review the latest patch? I have addressed most of the comments, except the two that have I left a comment above.

sonarqubecloud · 2025-01-09T15:01:49Z

Quality Gate passed

Issues
18 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added the tests pending label Dec 12, 2024

ramesh0201 force-pushed the HIVE-28408 branch from ef2d276 to e702419 Compare December 12, 2024 17:29

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed labels Dec 12, 2024

ramesh0201 force-pushed the HIVE-28408 branch from e702419 to d146c5f Compare December 18, 2024 12:50

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 18, 2024

ramesh0201 force-pushed the HIVE-28408 branch from d146c5f to 2f4338c Compare December 18, 2024 18:34

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 18, 2024

okumin reviewed Dec 22, 2024

View reviewed changes

ramesh0201 force-pushed the HIVE-28408 branch from 2f4338c to 7133c6b Compare January 2, 2025 07:53

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Jan 2, 2025

zabetak reviewed Jan 6, 2025

View reviewed changes

HIVE-28408: Support ARRAY field access in CBO

7c7856c

ramesh0201 force-pushed the HIVE-28408 branch from 7133c6b to 7c7856c Compare January 9, 2025 06:58

asf-ci-hive added tests pending and removed tests passed labels Jan 9, 2025

asf-ci-hive added tests unstable tests pending and removed tests pending tests unstable labels Jan 9, 2025

asf-ci-hive added tests unstable and removed tests pending labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28408: Support ARRAY field access in CBO #5577

HIVE-28408: Support ARRAY field access in CBO #5577

ramesh0201 commented Dec 12, 2024

okumin Dec 22, 2024

ramesh0201 Jan 2, 2025

zabetak Jan 6, 2025

ramesh0201 Jan 8, 2025 •

edited

Loading

zabetak Jan 6, 2025

ramesh0201 Jan 8, 2025

ramesh0201 Jan 9, 2025

zabetak Jan 6, 2025

zabetak Jan 6, 2025

zabetak Jan 6, 2025

zabetak commented Jan 6, 2025

ramesh0201 commented Jan 9, 2025

sonarqubecloud bot commented Jan 9, 2025

HIVE-28408: Support ARRAY field access in CBO #5577

Are you sure you want to change the base?

HIVE-28408: Support ARRAY field access in CBO #5577

Conversation

ramesh0201 commented Dec 12, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramesh0201 Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zabetak commented Jan 6, 2025

ramesh0201 commented Jan 9, 2025

sonarqubecloud bot commented Jan 9, 2025

Quality Gate passed

ramesh0201 Jan 8, 2025 •

edited

Loading