Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet-format and parquet-format-structures defines Util with inconsitent methods provided #420

Open
asfimport opened this issue Jun 23, 2023 · 2 comments

Comments

@asfimport
Copy link
Collaborator

I have been running into a bug due to parquet-format and parquet-format-structures both defining the org.apache.parquet.format.Util class but doing so inconsistently.

Examples of this are several methods which include a BlockCipher parameter that are defined from parquet-format-structures but not {}parquet-format{}. While invoking code that happens to use these, such as {}org.apache.parquet.hadoop.ParquetFileReader.readFooter{}, the code will fail if the parquet-format happens to be loaded first on the classpath.

Here is an example stack trace for a Scala Spark application.

Caused by: java.lang.NoSuchMethodError: 'org.apache.parquet.format.FileMetaData org.apache.parquet.format.Util.readFileMetaData(java.io.InputStream, org.apache.parquet.format.BlockCipher$Decryptor, byte[])'
at org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1441) ~[parquet_hadoop.jar:1.13.1]
at org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1438) ~[parquet_hadoop.jar:1.13.1]
at org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:1173) ~[parquet_hadoop.jar:1.13.1]
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1438) ~[parquet_hadoop.jar:1.13.1]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:591) ~[parquet_hadoop.jar:1.13.1]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:536) ~[parquet_hadoop.jar:1.13.1]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:530) ~[parquet_hadoop.jar:1.13.1]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:478) ~[parquet_hadoop.jar:1.13.1]
... (my application code invoking the above)

Because of issues external to Parquet that I have yet to figure out (a complex Spark and dependency setup), my classpaths are not deterministically ordered and I am unable to pin the parquet-format-structures ahead hence why I'm chiming in about this.

Even if that weren't the case, this is a fairly prickly edge to run into as both modules define overlapping classes. Util is not the only class that appears to be defined by both, just what I have been focusing on due to this bug.
It appears these methods were introduced in at least 1.12: apache/parquet-java@65b95fb#diff-852341c99dcae06c8fa2b764bcf3d9e6860e40442d0ab1cf5b935df80a9cacb7

Reporter: Joey Pereira

Note: This issue was originally created as PARQUET-2317. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
The Util class defined in parquet-format has been deprecated.

https://github.com/apache/parquet-format/blob/master/src/main/java/org/apache/parquet/format/Util.java#L56

Should we delete it in the next release? @gszadovszky @ggershinsky

@asfimport
Copy link
Collaborator Author

Gabor Szadovszky / @gszadovszky:
@wgtmac, Let me summarize the history of this. parquet-format contains all the specification docs and the parquet.thrift itself which is a kind of source code and spec at the same time. This is good to have all of these separated from the implementations. Meanwhile, since the thrift file is there, it was natural to have Thrift code generation and the Util there as well. But it was not a good choice since we only had the java code there. In some new features we had to extend Util which is clearly related to parquet-mr. So, we decided to deprecate all of the java related stuff in parquet-format and moved them to parquet-format-structures under parquet-mr.
So, it would be good to not only have Util be removed but all the other java classes including the Thrift generated ones to be part of the jar.
The catch is we still need to have some mechanism that validates the thrift file so we won't add invalid changes. Also, the distribution should be changed because providing a jar file without java classes would not make sense. I think, we should release a tarball instead that contains all the specs and the thrift file as well. Of course, we would need to update the parquet-mr (and maybe other affected implementations) to download that tarball instead of the jar file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants