You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been running into a bug due to parquet-format and parquet-format-structures both defining the org.apache.parquet.format.Util class but doing so inconsistently.
Examples of this are several methods which include a BlockCipher parameter that are defined from parquet-format-structures but not {}parquet-format{}. While invoking code that happens to use these, such as {}org.apache.parquet.hadoop.ParquetFileReader.readFooter{}, the code will fail if the parquet-format happens to be loaded first on the classpath.
Here is an example stack trace for a Scala Spark application.
Because of issues external to Parquet that I have yet to figure out (a complex Spark and dependency setup), my classpaths are not deterministically ordered and I am unable to pin the parquet-format-structures ahead hence why I'm chiming in about this.
Gabor Szadovszky / @gszadovszky: @wgtmac, Let me summarize the history of this. parquet-format contains all the specification docs and the parquet.thrift itself which is a kind of source code and spec at the same time. This is good to have all of these separated from the implementations. Meanwhile, since the thrift file is there, it was natural to have Thrift code generation and the Util there as well. But it was not a good choice since we only had the java code there. In some new features we had to extend Util which is clearly related to parquet-mr. So, we decided to deprecate all of the java related stuff in parquet-format and moved them to parquet-format-structures under parquet-mr.
So, it would be good to not only have Util be removed but all the other java classes including the Thrift generated ones to be part of the jar.
The catch is we still need to have some mechanism that validates the thrift file so we won't add invalid changes. Also, the distribution should be changed because providing a jar file without java classes would not make sense. I think, we should release a tarball instead that contains all the specs and the thrift file as well. Of course, we would need to update the parquet-mr (and maybe other affected implementations) to download that tarball instead of the jar file.
I have been running into a bug due to
parquet-format
andparquet-format-structures
both defining theorg.apache.parquet.format.Util
class but doing so inconsistently.Examples of this are several methods which include a
BlockCipher
parameter that are defined fromparquet-format-structures
but not{}parquet-format{
}. While invoking code that happens to use these, such as{}org.apache.parquet.hadoop.ParquetFileReader.readFooter{
}, the code will fail if theparquet-format
happens to be loaded first on the classpath.Here is an example stack trace for a Scala Spark application.
Because of issues external to Parquet that I have yet to figure out (a complex Spark and dependency setup), my classpaths are not deterministically ordered and I am unable to pin the
parquet-format-structures
ahead hence why I'm chiming in about this.Even if that weren't the case, this is a fairly prickly edge to run into as both modules define overlapping classes.
Util
is not the only class that appears to be defined by both, just what I have been focusing on due to this bug.It appears these methods were introduced in at least 1.12: apache/parquet-java@65b95fb#diff-852341c99dcae06c8fa2b764bcf3d9e6860e40442d0ab1cf5b935df80a9cacb7
Reporter: Joey Pereira
Note: This issue was originally created as PARQUET-2317. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: