Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some issues migrating from Avro4s #442

Open
erikvanoosten opened this issue Apr 25, 2022 · 16 comments
Open

Some issues migrating from Avro4s #442

erikvanoosten opened this issue Apr 25, 2022 · 16 comments

Comments

@erikvanoosten
Copy link
Contributor

erikvanoosten commented Apr 25, 2022

Because Avro4s is too slow in writing binary (for our use case), we want to migrate to Vulcan. Compared to Avro4s the performance is very good! We see 10 to 12 times the throughput on a single thread.

Unfortunately, we do have some problems when migrating from Avro4s:

  1. optional union types get nested --> solved in Support Option of sum types #450
  2. missing the field mapper
  3. AvroName annotation is ignored --> split to vulcan-generic: Support @AvroName annotation for field names #443
  4. field defaults are ignored
  5. generic parameters are not included in a record name --> proposal in Include generic parameters in record name #444

Optional union types get nested

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    sealed abstract class Location extends Product with Serializable
    @AvroNamespace("ns")
    case class PostalAddress(city: String) extends Location
    @AvroNamespace("ns")
    case class GeographicLocation(lat: Double, long: Double) extends Location
    @AvroNamespace("ns")
    case class Thing(location: Option[Location])
    implicit val locationCodec: Codec[Location] = Codec.derive
    implicit val ThingCodec: Codec[Thing] = Codec.derive
    println(ThingCodec.schema.left.get.message)
  }

prints:

org.apache.avro.AvroRuntimeException: Nested union: [
    "null",
    [
        {"type":"record","name":"GeographicLocation","namespace":"ns","fields":[{"name":"lat","type":"double"},{"name":"long","type":"double"}]},
        {"type":"record","name":"PostalAddress","namespace":"ns","fields":[{"name":"city","type":"string"}]}
    ]
]

(with newlines added for clarity)

I expected Vulcan to silently add the null type to the union, not create a new union with the 'Location' union nested in it.

Missing a field mapper

Unfortunately we need to use case classes that have weird field names. They contain weird characters like @ and :. With Avro4s we defined a field mapper that would map these to valid names.

As a work around we could use the @AvroName annotation on the case class fields. Unfortunately we are still not allowed to use the @ character in the field name even though it is no longer relevant for the schema. This is probably related to the next item:

AvroName annotation is ignored

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    case class Item(@AvroName("id2") id1: String)
    println(Codec.derive[Item].schema.right.get.toString(true))
  }

prints:

{
  "type" : "record",
  "name" : "Item",
  "namespace" : "ns",
  "fields" : [ {
    "name" : "id1",       // <----- "id1" instead of expected "id2"
    "type" : "string"
  } ]
}

Field defaults are ignored

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    case class Item(id1: String = "foo")
    println(Codec.derive[Item].schema.right.get.toString(true))
  }

expected:

{
  "type" : "record",
  "name" : "Item",
  "namespace" : "ns",
  "fields" : [ {
    "name" : "id1",
    "type" : "string",
    "default": "foo"       // <-- expected but not present
  } ]
}

Generic parameters are not included in a record name

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    case class SomeData(foo: String)
    @AvroNamespace("ns")
    case class Event[A](data: A)
    implicit val someDataCodec: Codec[SomeData] = Codec.derive
    println(Codec.derive[Event[SomeData]].schema.right.get.toString(true))
  }

Prints: Event, Avro4s gives Event__SomeData.

The Avro4s approach has the advantage of giving each type instance of Event a different name. This is relevant if the consumer of the schema uses code generation.

Note that we can not solve this with an @AvroName annotation as then the name would be same for every type instance of Event. (If that is what @AvroName is supposed to do, currently it does not seem to change the name when set on a case class.)

All code in this issue tested with Vulcan 1.3.8, Scala 2.12.15.

@erikvanoosten erikvanoosten changed the title Two issues migrating when from Avro4s Some issues migrating when from Avro4s Apr 25, 2022
@erikvanoosten erikvanoosten changed the title Some issues migrating when from Avro4s Some issues migrating from Avro4s Apr 25, 2022
@erikvanoosten
Copy link
Contributor Author

If you like, with pointers into the code, I can attempt to fix some of these things.

@bplommer
Copy link
Member

That would be great for the AvroName feature (I've opened a ticket at #443). The relevant bit of code (for Scala 2) is at

name = param.label,
access = param.dereference,
doc = param.annotations.collectFirst {
case AvroDoc(doc) => doc
},
- you should be able to follow the pattern from AvroDoc for AvroName. It also needs the same to be done in the scala 3 version and a test to be added.

I have something in progress for flattening nested union codecs, I'll dig it out.

@bplommer
Copy link
Member

Re field mappers, the general philosophy in Vulcan is to encourage writing codecs manually and keep generic derivation as lightweight as possible, but I'm open to being persuaded 🙂

@erikvanoosten
Copy link
Contributor Author

I have added a 4th issue.

@erikvanoosten
Copy link
Contributor Author

erikvanoosten commented Apr 26, 2022

I have added a 5th issue. Also, I added a rationale of why we want to move to Vulcan (performance!).

Re. fieldmapper and keeping generic derivation lightweight, we have a workaround so I won't press the issue :)

Actually, we now have workarounds for all 5 issues. Some are not so pretty, in particular we would like to see issues 1 and 4 be resolved. I can look at no. 4 next week.

@bplommer
Copy link
Member

I'd like to keep the existing behaviour for defaults as it is, but we could add an @AvroDefault annotation to opt in to using the default field value from the case class. @AvroNullDefault does something similar in a more limited way.

@erikvanoosten
Copy link
Contributor Author

WDYT of adding variants of the derive method? (Or add a parameter to it?) I have not looked at the code much, I have no idea how hard that would be.

I have to follow a schema definitions that demands dozens of records. It is maintained by multiple people. If we need to put an annotation on every case class, there is a good chance one will be forgotten. In addition, I'd prefer to keep the Vulcan dependency out of the model code (but that is not a hard requirement).

@bplommer
Copy link
Member

bplommer commented May 5, 2022

WDYT of adding variants of the derive method? (Or add a parameter to it?) I have not looked at the code much, I have no idea how hard that would be.

I'd be down for that. Should be feasible. I'll try to look at your PR and write up some more thoughts in the next few days.

@bplommer
Copy link
Member

bplommer commented May 5, 2022

Compared to Avro4s the performance is very good! We see 10 to 12 times the throughput on a single thread. Even better, Avro4s seems to be doing some synchronization because we can't get higher throughput with multiple threads.

Interesting! The main issues with avro4s that motivated creating Vulcan were to do with slow compile time and undesired results from always-on autoderivation but I don't remember runtime performance being a problem. Have you opened an issue with avro4s?

@erikvanoosten
Copy link
Contributor Author

erikvanoosten commented May 5, 2022

Compared to Avro4s the performance is very good! We see 10 to 12 times the throughput on a single thread. Even better, Avro4s seems to be doing some synchronization because we can't get higher throughput with multiple threads.

Interesting! The main issues with avro4s that motivated creating Vulcan were to do with slow compile time and undesired results from always-on autoderivation but I don't remember runtime performance being a problem. Have you opened an issue with avro4s?

Actually, we have since found out that we were accidentally running our code in a mutex so no multi-threaded speedup was possible. I have to update this part of the story.

Update: removed this part from the description.

@erikvanoosten
Copy link
Contributor Author

Re. no 4, from https://github.com/softwaremill/magnolia#limitations (on the Scala 3 version):

Magnolia is not currently able to access default values for case class parameters.

Fortunately, the next version of Magnolia will support it :)

@erikvanoosten
Copy link
Contributor Author

Proposal to solve no. 5 in PR #447

erikvanoosten added a commit to erikvanoosten/vulcan that referenced this issue May 11, 2022
The type `Option[A]` for any sum type `A` should be supported. Right now this fails with `org.apache.avro.AvroRuntimeException: Nested union` (see fd4s#442 problem 1).

This is because `Codec[Option[_]]` is build from a `UnionCodec`. We solve the problem by creating a special `Codec` for `Option`s.
@erikvanoosten
Copy link
Contributor Author

We now run in production with a fork that contains #444 (without config) and #450. It works pretty well.

I did not get a chance to work on adding defaults yet. Although I found out that it is not that urgent. This is because we add null defaults to the schema before we use it (see sksamuel/avro4s#714 (comment)). This makes it compatible with the schema generated by Avro4s.

@yzia2000
Copy link

Hi, I would like to know more about performance difference between vulcan and avro4s. I currently use avro4s in a kafka streams application for reflection based schema generation. Is performance one of the primary considerations for the creation of this project or was it an afterthought in comparison to more of the developer experience based concerns?

@Zhen-hao
Copy link

hi @erikvanoosten, did you find a solution for no. 4?

@erikvanoosten
Copy link
Contributor Author

erikvanoosten commented Aug 31, 2023

hi @erikvanoosten, did you find a solution for no. 4?

@Zhen-hao Since I have a work around (see #442 (comment)), I no longer worked on a proper solution. However, now that Magnolia exposes default values, it could be baked directly into Vulcan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants