Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28642:column stats state unnecessarily created when hive.stats.c… #5560

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

yigress
Copy link
Contributor

@yigress yigress commented Nov 26, 2024

do not set column stats state when load new partition if hive.stats.column.autogather=false

What changes were proposed in this pull request?

simplify query string when load new partition by removing the column stats state setup when HIVE_STATS_COL_AUTOGATHER=false

Why are the changes needed?

avoid PARAM_VALUE too long error from metastore db.

Does this PR introduce any user-facing change?

no

Is the change a dependency upgrade?

no

How was this patch tested?

tested locally

@ngsg
Copy link
Contributor

ngsg commented Nov 26, 2024

The patch seems reasonable, but I have a little doubt about the behaviour of StatsSetupConst.setStatsStateForCreateTable.

If I understand correctly, StatsSetupConst.setStatsStateForCreateTable does not computes column statistics immediately, but it updates COLUMN_STATS_ACCURATE. Could you explain how does COLUMN_STATS_ACCURATE gather ColStats or trigger ColStats computation? If this is not related to ColStats computation, then I think it would be better to backport HIVE-20221 into branch-3.1.

Also I found that CreateTableLikeOperation and AbstractAddPartitionAnalyzer also call StatsSetupConst.setStatsStateForCreateTable similarly. Maybe we could apply the same fix on them too.

@yigress
Copy link
Contributor Author

yigress commented Nov 26, 2024

thanks @ngsg for the review. I changed CreateTableLikeOperation, AbstractAddPartitionAnalyzer and the loadPartition. for loadPartition the actual column stats is updated later. for CreateTableLikeOperation the call should not have column stats so setting state true is wrong. i can't find anywhere using the COLUMN_STATS_ACCURATE to trigger compute, it seems it is just used for parsing.

some user can not change their metastore db so backport HIVE-20221 not an option for them and instead they want to just turn column stats off.

@@ -74,8 +74,7 @@ public int execute() throws HiveException {

if (desc.getLocation() == null && !tbl.isPartitioned() &&
context.getConf().getBoolVar(HiveConf.ConfVars.HIVE_STATS_AUTOGATHER)) {
StatsSetupConst.setStatsStateForCreateTable(tbl.getTTable().getParameters(),
MetaStoreUtils.getColumnNames(tbl.getCols()), StatsSetupConst.TRUE);
StatsSetupConst.setStatsStateForCreateTable(tbl.getTTable().getParameters(),null, StatsSetupConst.TRUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, we missed a space between tbl.getTTable().getParameters() and null.

MetaStoreUtils.getColumnNames(table.getCols()), StatsSetupConst.TRUE);
List<String> colNames = conf.getBoolVar(HiveConf.ConfVars.HIVE_STATS_COL_AUTOGATHER) ?
MetaStoreUtils.getColumnNames(table.getCols()) : null;
StatsSetupConst.setStatsStateForCreateTable(params,colNames, StatsSetupConst.TRUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, the same space issue here.

@@ -2820,8 +2820,7 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
if (oldPart == null) {
newTPart.getTPartition().setParameters(new HashMap<String,String>());
if (this.getConf().getBoolVar(HiveConf.ConfVars.HIVE_STATS_AUTOGATHER)) {
StatsSetupConst.setStatsStateForCreateTable(newTPart.getParameters(),
MetaStoreUtils.getColumnNames(tbl.getCols()), StatsSetupConst.TRUE);
StatsSetupConst.setStatsStateForCreateTable(newTPart.getParameters(),null, StatsSetupConst.TRUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, the same space issue here, too.

@ngsg
Copy link
Contributor

ngsg commented Nov 27, 2024

Could you check the result of Hive CI? It seems that your last change disables HIVE_STATS_COL_AUTOGATHER feature.

@ngsg
Copy link
Contributor

ngsg commented Nov 29, 2024

LGTM with one nit and updating 3 qfile outputs

I have checked that 3 qfiles failed because Hive marks all columns as accurate in COLUMN_STATS_ACCURATE for newly create TableLikes and Partitions. I think these test qfile verify your patch on CreateTableLikeOperation and AbstractAddPartitionAnalyzer.

I wonder if there exists any incorrect column statistics computation issues which this patch can fix, but I couldn't find such problems so far. The following link contains a query that I used to investigate. It would be appreciated if anyone know and could share an example of incorrect column statistics computation.

import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Could we remove star-import? Star-import should be used in limited situations.

@okumin
Copy link
Contributor

okumin commented Dec 9, 2024

I wonder why we don't have to apply a similar update on the other classes using HIVE_STATS_AUTOGATHER. CreateTableDesc or AlterTableConcatenateAnalyzer are examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants