Assembly shaderules break serialization/deserialization with Dataset and Dataframe #279

oroundtree · 2022-08-23T15:00:21Z

I've been working on an issue for a while now where certain features of sparksql-scalapb haven't been working correctly, mostly related to encoders and the following error when creating a Dataframe or Dataset of serialized protobuf data:
Unable to find encoder for type Array[Byte]. An implicit Encoder[Array[Byte]] is needed to store Array[Byte] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

My scalatests for serialization and deserialization work when they are run in the same project that the protobuf messages are in, using the compiled code. However, they fail if I'm using the assembled jar unless I remove the following shaderule from build.sbt:
ShadeRule.rename("shapeless.**" -> "shadeshapeless.@1").inAll

I've also tested this and found the same results when running a class without scalatest dependencies.

I haven't yet seen any issues from removing the above shaderule, but I'm also not sure why it is there and what the implications of removing it are...

The text was updated successfully, but these errors were encountered:

thesamet · 2022-08-23T15:59:32Z

Hi @oroundtree , thanks for reporting. Indeed sounds very strange that the presence of the shading rule creates a problem. Can you provide a minimal example to reproduce this including instructions? You can start by forking https://github.com/thesamet/sparksql-scalapb-test

oroundtree · 2022-08-23T16:15:21Z

Here you go: https://github.com/oroundtree/sparksql-scalapb-test-oroundtree

Master has the shaderule and the tests present, and the no-shaderule branch has the shaderule removed and the version number bumped up so I can test adding the two jars as unmanaged dependencies separately. Here are the steps I follow to reproduce it:

Pull master from https://github.com/oroundtree/sparksql-scalapb-test-oroundtree
build the jar using sbt assembly (take note that all the tests pass in the process)
Pull master from the project we'll use to test pulling the jar as a dependency (https://github.com/oroundtree/sparksql-scalapb-import-oroundtree)
Place sparksql-scalapb-test-oroundtree-assembly-1.0.0.jar into the /lib folder in the project
Run sbt test
You'll get an encoder not found error and the tests will not compile

After that, you can give the non-shaded jar a try using the same steps as above, except:

Use the no-shaderule branch of https://github.com/oroundtree/sparksql-scalapb-test-oroundtree
Assemble the jar and replace sparksql-scalapb-test-oroundtree-assembly-1.0.0.jar with sparksql-scalapb-test-oroundtree-assembly-1.0.1.jar
Run sbt test
The tests should run and pass

EDIT Also worth noting I get the same results in both cases if I'm pulling the jar as a managed dependency in sbt or maven (i.e. from a private maven repository)

EDIT x2 If you are using IDEA the IDE may complain that the imports from your unmanaged sbt dependency are not found. You can safely ignore the syntax highlighting

thesamet · 2022-08-23T17:05:22Z

Thanks, I quickly read through. For step 3, can you provide that "another project" as well so and make the edits in your message above, just so the issue is self contained?

oroundtree · 2022-08-23T18:56:59Z

I've updated the steps with the small example project and more exact steps on how to reproduce the error. Hope it helps!

thesamet · 2022-08-26T22:35:01Z

Thanks for providing the detailed example. I was able to follow the instructions and see the issue. The example guides us into something that's a little tricky to reason about bringing : the assembled jar brings a shaded version of shapeless, and the parent project brings another unshaded copy. I think it was unintended, but the shaded jar brings also scalatest. The practice I want to encourage is to perform the assembly and shading as the final packaging step, just before it's shipped to a spark cluster.

Is it possible to reproduce this problem where the assembled jar causes the problem directly when submitted to spark? (I haven't tried)
Is there a reason why this specific set up to work (by that I mean having an assembled jar used as a dependency)

oroundtree · 2022-08-30T13:54:59Z

I was able to confirm that including the serialization/deserialization in the demo and then submitting the proto jar directly to a local cluster using spark-submit works.
Basically I've got a complex project with lots of proto definitions, including gRPC services. These are kept in a repo which is automatically assembled and pushed to an artifact repository when changes are made and this ensures that the projects importing and using these proto definitions are working from the same proto definitions.

If I didn't do this, every project that uses the proto definitions would need to have their individual .proto files edited when a change is made to a message definition

thesamet · 2022-08-30T15:52:10Z

If I didn't do this, every project that uses the proto definitions would need to have their individual .proto files edited when a change is made to a message definition

Trying to understand the above. The suggested practice is to have all the intermediate dependencies (which can contain protos) remained unshaded, and only perform the assembly/shading for the final artifacts you deploy. You write that this would lead to editing of protos that import other protos upon their change - I'm not following this part - can you explain in more detail? What edits would be necessary?

I would suggest to see how you can adopt your build to support the suggested practice of shading at the last step. sbt-assembly also calls out that introducing fat jars as dependencies is not a great idea.

Having said that, I did look deeper and it looks like the first failure that happens in the encoder derivation involves invoking a macro in the shaded copy of shapeless. I've filed a bug with sbt-assembly along with a reproducible example.

thesamet · 2023-02-18T04:02:51Z

Closing due to inactivity.

thesamet mentioned this issue Aug 30, 2022

Unable to invoke macro after shading sbt/sbt-assembly#477

Open

thesamet closed this as completed Feb 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assembly shaderules break serialization/deserialization with Dataset and Dataframe #279

Assembly shaderules break serialization/deserialization with Dataset and Dataframe #279

oroundtree commented Aug 23, 2022

thesamet commented Aug 23, 2022

oroundtree commented Aug 23, 2022 •

edited

Loading

thesamet commented Aug 23, 2022

oroundtree commented Aug 23, 2022

thesamet commented Aug 26, 2022

oroundtree commented Aug 30, 2022

thesamet commented Aug 30, 2022

thesamet commented Feb 18, 2023

Assembly shaderules break serialization/deserialization with Dataset and Dataframe #279

Assembly shaderules break serialization/deserialization with Dataset and Dataframe #279

Comments

oroundtree commented Aug 23, 2022

thesamet commented Aug 23, 2022

oroundtree commented Aug 23, 2022 • edited Loading

thesamet commented Aug 23, 2022

oroundtree commented Aug 23, 2022

thesamet commented Aug 26, 2022

oroundtree commented Aug 30, 2022

thesamet commented Aug 30, 2022

thesamet commented Feb 18, 2023

oroundtree commented Aug 23, 2022 •

edited

Loading