Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28408: Support ARRAY field access in CBO #5577

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ramesh0201
Copy link
Contributor

What changes were proposed in this pull request?

We need to support array field access in the CBO. This is currently already supported in the execution side. Only CBO fails with an error. So we will be by passing the CBO by doing no-op in this case to leave it to hive compiler and execution engine to do rest of the work

Why are the changes needed?

In order to enable CBO for all the queries

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=nested_column_pruning.q

common/src/java/org/apache/hadoop/hive/conf/HiveConf.java Outdated Show resolved Hide resolved
@@ -3008,7 +3008,7 @@ STAGE PLANS:
enabled: true
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
notVectorizedReason: Key expression for GROUPBY operator: Vectorizing complex type LIST not supported
notVectorizedReason: exception: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo cannot be cast to org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo stack trace: org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getStructFieldIndex(VectorizationContext.java:1106), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getGenericUDFStructField(VectorizationContext.java:1094), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:1074), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:972), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.vectorizeSelectOperator(Vectorizer.java:4806), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperator(Vectorizer.java:5443), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChild(Vectorizer.java:1011), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChildren(Vectorizer.java:898), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperatorTree(Vectorizer.java:868), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.access$2500(Vectorizer.java:253), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapOperators(Vectorizer.java:2118), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapOperators(Vectorizer.java:2070), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapWork(Vectorizer.java:2045), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.convertMapWork(Vectorizer.java:1199), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.dispatch(Vectorizer.java:1051), ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the result is unchanged, maybe we would like to have a better error here. ClassCastException sounds irregular. Also, this stack trace should not be robust

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @okumin. Please review my latest updated patch for this.

@@ -1103,6 +1103,9 @@ private VectorExpression getGenericUDFStructField(ExprNodeFieldDesc exprNodeFiel
private int getStructFieldIndex(ExprNodeFieldDesc exprNodeFieldDesc) throws HiveException {
ExprNodeDesc structNodeDesc = exprNodeFieldDesc.getDesc();
String fieldName = exprNodeFieldDesc.getFieldName();
if (exprNodeFieldDesc.getIsList()) {
throw new HiveException("Could not vectorize expression: Vectorizing complex type LIST is not supported");
}
StructTypeInfo structTypeInfo = (StructTypeInfo) structNodeDesc.getTypeInfo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we are hitting a ClassCastException without the extra if check above is a bit worrisome. It means that the new logic has some impact on the vectorization codepath. Why are we hitting this now?

Copy link
Contributor Author

@ramesh0201 ramesh0201 Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce the ClassCastException even without the patch for a simpler query that I mentioned below:
EXPLAIN VECTORIZATION EXPRESSION
SELECT s5.f16.f18.f19 FROM nested_tbl_1;
I will be adding this to the list of queries in the same q file below. What I identified is that, we have certain nested access queries that were not handled well in the vectorization and we need to address that. But there is no new failures after our patch here. We can try to address this issue in a separate patch if you think it is reasonable idea.

@@ -1103,6 +1103,9 @@ private VectorExpression getGenericUDFStructField(ExprNodeFieldDesc exprNodeFiel
private int getStructFieldIndex(ExprNodeFieldDesc exprNodeFieldDesc) throws HiveException {
ExprNodeDesc structNodeDesc = exprNodeFieldDesc.getDesc();
String fieldName = exprNodeFieldDesc.getFieldName();
if (exprNodeFieldDesc.getIsList()) {
throw new HiveException("Could not vectorize expression: Vectorizing complex type LIST is not supported");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vectorizing complex type.. message is created by getValidateDataTypeErrorMsg during some validation type checks. Why aren't we hitting these checks before arriving into this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are hitting this message because we never do that check(call getValidateDataTypeErrorMsg) for the select clause expressions and do this check only for the group by expression, which is later in the operator tree for that particular vertex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not addressed this by calling the getValidateDataTypeErrorMsg() because, doing so will prevent any struct field in the select clause but right now we only fail if there is a list without an index in vectorization. I have also addressed the exception message accordingly

* Special operator that is used as syntactic sugar to change the type of collection
* expressions in order to perform field access over them.
*/
public static final SqlOperator COMPONENT_ACCESS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the existing Hive specific operators/functions are under org.apache.hadoop.hive.ql.optimizer.calcite.reloperators package (e.g., HiveDateAddSqlOperator). Consider moving the new operator into a separate class under that package.

Having one class per function is not ideal but it may be a good idea to follow the existing paradigm at least till we decide to perform a holistic refactoring that puts them all together.

// supplied by serdes.
throw new CalciteSemanticException("Unexpected rexnode : "
+ expr.getClass().getCanonicalName(), UnsupportedFeature.Schema_less_table);
// Safe exception. Shouldn't Ideally come here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not provide much info; consider dropping it.

// This may happen for schema-less tables, where columns are dynamically
// supplied by serdes.
throw new CalciteSemanticException("Unexpected rexnode : "
+ expr.getClass().getCanonicalName(), UnsupportedFeature.Schema_less_table);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the UnsupportedFeature.Schema_less_table from the enumeration.

@zabetak
Copy link
Member

zabetak commented Jan 6, 2025

It would be also nice to include a qfile with an EXPLAIN CBO statement that shows that we can treat ARRAY field access in CBO. Possibly the book example that I added under HIVE-28408 is a good fit.

@ramesh0201
Copy link
Contributor Author

@zabetak Can you please review the latest patch? I have addressed most of the comments, except the two that have I left a comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants