-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28408: Support ARRAY field access in CBO #5577
base: master
Are you sure you want to change the base?
Conversation
ef2d276
to
e702419
Compare
e702419
to
d146c5f
Compare
d146c5f
to
2f4338c
Compare
ql/src/java/org/apache/hadoop/hive/ql/parse/type/RexNodeExprFactory.java
Outdated
Show resolved
Hide resolved
@@ -3008,7 +3008,7 @@ STAGE PLANS: | |||
enabled: true | |||
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true | |||
inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | |||
notVectorizedReason: Key expression for GROUPBY operator: Vectorizing complex type LIST not supported | |||
notVectorizedReason: exception: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo cannot be cast to org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo stack trace: org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getStructFieldIndex(VectorizationContext.java:1106), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getGenericUDFStructField(VectorizationContext.java:1094), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:1074), org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:972), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.vectorizeSelectOperator(Vectorizer.java:4806), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperator(Vectorizer.java:5443), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChild(Vectorizer.java:1011), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChildren(Vectorizer.java:898), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperatorTree(Vectorizer.java:868), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.access$2500(Vectorizer.java:253), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapOperators(Vectorizer.java:2118), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapOperators(Vectorizer.java:2070), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeMapWork(Vectorizer.java:2045), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.convertMapWork(Vectorizer.java:1199), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.dispatch(Vectorizer.java:1051), ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the result is unchanged, maybe we would like to have a better error here. ClassCastException sounds irregular. Also, this stack trace should not be robust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @okumin. Please review my latest updated patch for this.
@@ -1103,6 +1103,9 @@ private VectorExpression getGenericUDFStructField(ExprNodeFieldDesc exprNodeFiel | |||
private int getStructFieldIndex(ExprNodeFieldDesc exprNodeFieldDesc) throws HiveException { | |||
ExprNodeDesc structNodeDesc = exprNodeFieldDesc.getDesc(); | |||
String fieldName = exprNodeFieldDesc.getFieldName(); | |||
if (exprNodeFieldDesc.getIsList()) { | |||
throw new HiveException("Could not vectorize expression: Vectorizing complex type LIST is not supported"); | |||
} | |||
StructTypeInfo structTypeInfo = (StructTypeInfo) structNodeDesc.getTypeInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we are hitting a ClassCastException
without the extra if check above is a bit worrisome. It means that the new logic has some impact on the vectorization codepath. Why are we hitting this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reproduce the ClassCastException even without the patch for a simpler query that I mentioned below:
EXPLAIN VECTORIZATION EXPRESSION
SELECT s5.f16.f18.f19 FROM nested_tbl_1;
I will be adding this to the list of queries in the same q file below. What I identified is that, we have certain nested access queries that were not handled well in the vectorization and we need to address that. But there is no new failures after our patch here. We can try to address this issue in a separate patch if you think it is reasonable idea.
@@ -1103,6 +1103,9 @@ private VectorExpression getGenericUDFStructField(ExprNodeFieldDesc exprNodeFiel | |||
private int getStructFieldIndex(ExprNodeFieldDesc exprNodeFieldDesc) throws HiveException { | |||
ExprNodeDesc structNodeDesc = exprNodeFieldDesc.getDesc(); | |||
String fieldName = exprNodeFieldDesc.getFieldName(); | |||
if (exprNodeFieldDesc.getIsList()) { | |||
throw new HiveException("Could not vectorize expression: Vectorizing complex type LIST is not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Vectorizing complex type..
message is created by getValidateDataTypeErrorMsg
during some validation type checks. Why aren't we hitting these checks before arriving into this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are hitting this message because we never do that check(call getValidateDataTypeErrorMsg) for the select clause expressions and do this check only for the group by expression, which is later in the operator tree for that particular vertex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not addressed this by calling the getValidateDataTypeErrorMsg() because, doing so will prevent any struct field in the select clause but right now we only fail if there is a list without an index in vectorization. I have also addressed the exception message accordingly
* Special operator that is used as syntactic sugar to change the type of collection | ||
* expressions in order to perform field access over them. | ||
*/ | ||
public static final SqlOperator COMPONENT_ACCESS = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the existing Hive specific operators/functions are under org.apache.hadoop.hive.ql.optimizer.calcite.reloperators
package (e.g., HiveDateAddSqlOperator
). Consider moving the new operator into a separate class under that package.
Having one class per function is not ideal but it may be a good idea to follow the existing paradigm at least till we decide to perform a holistic refactoring that puts them all together.
// supplied by serdes. | ||
throw new CalciteSemanticException("Unexpected rexnode : " | ||
+ expr.getClass().getCanonicalName(), UnsupportedFeature.Schema_less_table); | ||
// Safe exception. Shouldn't Ideally come here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment does not provide much info; consider dropping it.
// This may happen for schema-less tables, where columns are dynamically | ||
// supplied by serdes. | ||
throw new CalciteSemanticException("Unexpected rexnode : " | ||
+ expr.getClass().getCanonicalName(), UnsupportedFeature.Schema_less_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the UnsupportedFeature.Schema_less_table
from the enumeration.
It would be also nice to include a qfile with an EXPLAIN CBO statement that shows that we can treat ARRAY field access in CBO. Possibly the book example that I added under HIVE-28408 is a good fit. |
@zabetak Can you please review the latest patch? I have addressed most of the comments, except the two that have I left a comment above. |
Quality Gate passedIssues Measures |
What changes were proposed in this pull request?
We need to support array field access in the CBO. This is currently already supported in the execution side. Only CBO fails with an error. So we will be by passing the CBO by doing no-op in this case to leave it to hive compiler and execution engine to do rest of the work
Why are the changes needed?
In order to enable CBO for all the queries
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=nested_column_pruning.q