Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Schema Support #220

Closed
ottomata opened this issue Aug 31, 2015 · 52 comments
Closed

JSON Schema Support #220

ottomata opened this issue Aug 31, 2015 · 52 comments
Assignees

Comments

@ottomata
Copy link

Avro is cool!

But lots of people use JSON. JSON Schema allows folks to use JSON with a strictly defined schema. Wouldn't it be nice if schema-registry (and Kafka REST Proxy) supported JSON Schema?

JSON Schema wouldn't be as good as Avro, as there is no schema evolution feature. However, it would be nice if one could produce JSON to Kafka REST Proxy and be sure that the data matched a registered schema. Thoughts? How hard would this be?

@ewencp
Copy link
Contributor

ewencp commented Sep 1, 2015

JSON Schema is nice and in some ways more powerful than Avro -- the validations make it a more complex spec, but make the requirements for data much clearer.

I think the biggest gap for JSON Schema is that (as far as I know) it doesn't have the same compatibility specification and tools, so they'd probably have to be implemented from scratch. The rest of the registry should look largely the same between Avro and JSON Schema -- the schema format is JSON in both cases and otherwise the registry is mostly just doing storage/lookup. I would guess that creating the branches to handle JSON Schema should be mostly straightforward, and then the branches would reconverge when they hit the storage layer.

There would be some important design decisions, however, which might not be straightforward to resolve. For example, how do we differentiate between the types in the API (complex content types like in REST Proxy?). And do we need to support both Avro and JSON in one cluster or is the expectation that you go entirely with one or the other? That has implications on how data is stored, if there's a need for some additional namespacing, how you handle conflicts if you try to register different types to the same subject, etc.

In other words, there's probably a quick and dirty version that could be made to work pretty easily, but cleanly supporting all the features that the Avro version does would be a significantly more substantial time investment.

@ottomata
Copy link
Author

ottomata commented Sep 1, 2015

do we need to support both Avro and JSON in one cluster or is the expectation that you go entirely with one or the other?

In my ideal world, yes. I would love to be able to just convince everyone that they should use Avro and not worry about it, but there is so much JSON (and JSON Schema) in our org already, that in order to use confluent systems, I think we are going to have to support both JSON Schema and Avro.

@ewencp
Copy link
Contributor

ewencp commented Sep 1, 2015

Right, I totally get the need for JSON, it was more a question of whether you're adopting Avro at all such that a mixed mode is useful to you :)

@ottomata
Copy link
Author

ottomata commented Sep 3, 2015

Hm, related question.

Instead of JSON Schema, would it be easier to make the REST Proxy produce the JSON encoding of the Avro data and still validate that against a schema?

That is, instead of validating and converting the POSTed Avro-JSON records into Avro-Binary and then producing them to Kafka, could REST Proxy validate the Avro-JSON records and then produce them as is to Kafka, i.e. as JSON text data?

This would be the best of both worlds, and allow users to choose to use validated JSON directly, while still enforcing that the JSON conform to a schema.

@AdamDz
Copy link

AdamDz commented Nov 11, 2015

+1 for JSON Schema support. The Confluent Platform is Avro-centric, which limits its usage. We are building a platform that has API specification in RAML and JSON Schema, and don't want to maintain definitions of data types in two formats.

@yhilem
Copy link

yhilem commented May 7, 2016

+1 for JSON Schema support.

@hakamairi
Copy link

Any plans for this feature?

@manuelfu
Copy link

+1 for JSON Schema support.

@tim-jones-001
Copy link

+1 for JSON schema support

7 similar comments
@eparisca
Copy link

+1 for JSON schema support

@GreenAsh
Copy link

+1 for JSON schema support

@kolprogman
Copy link

+1 for JSON schema support

@lukoyanov
Copy link

+1 for JSON schema support

@dgvj-work
Copy link

+1 for JSON schema support

@lafolle
Copy link

lafolle commented Sep 25, 2017

+1 for JSON schema support

@fernandomoraes
Copy link

+1 for JSON schema support

@juguyard
Copy link

juguyard commented Nov 6, 2017

+1 for JSON schema support!

@jchrist31an
Copy link

+1 for JSON schema support

@Brian-Burns-Bose
Copy link

+1 especially since OpenAPI / Swagger utilizes JSON schema. JSON schema also has much better support for value constraints.

@solsson
Copy link

solsson commented Jan 18, 2018

In case this feature gets near a roadmap, https://github.com/joelittlejohn/jsonschema2pojo is pretty great. I used it for an ad-hoc topic-schema mapping outlined in Yolean/kubernetes-kafka#101 (comment).

@codeislyric
Copy link

+1 for JSON Schema support.

@ottomata
Copy link
Author

ottomata commented Jan 19, 2018

I've been thinking about this more, and for my own use case, as I said earlier, all I really want is to be able to use schema validated JSON strings in Kafka. Avro binary in Kafka adds a lot of complications, including a dependency on Confluent Schema Registry (or something) to consume data. However, I really want to be able to use Kafka Connect and other fancy tools, which are pretty difficult to use with JSON.

Perhaps augmenting Schema Registry + REST Proxy (maybe just REST Proxy?) to avoid converting incoming Avro-JSON to Avro-Binary before producing to Kafka would be much easier, and solve many of the use cases that JSON Schema support would. It could be easier to make Kafka Connect know how to convert from Avro-JSON to its internal data model than relying on JSONSchemas. I wonder how hard this would be...

@solsson
Copy link

solsson commented Jan 21, 2018

I'd be happy to see this feature (implemented, yes, but also... if I take the liberty of thinking out loud here...) discussed in a wider context of tooling for schemas and topics, geared towards microservices. @ept's https://www.confluent.io/blog/put-several-event-types-kafka-topic/ is a great starting point.

Asynchronous communication via Kafka is, as Confluent convincingly argues, a compelling alternative to REST for contracts between services. However, while REST has service discovery, Istio Mesh and Swagger, we Kafka-hopefuls have... well... topic names and opaque Serdes.

What is it we need to manage about topics?

  • Creation, with auto.create.topics.enable or without.
    • Naming
    • Partitions
    • Overrides for things like min.insync.replicas
  • The serialization format (I assume that multi-entity topics too will use one format)
  • Compression
  • One or more schemas, different mapping strategies
  • Generations of these schemas
  • Canaries, rolling upgrades, ...

What are the options for how a producer - be it protobuf or JSON or Avro - can specify the schema of a record?

  • Topic metadata? No such thing + only compatible with a single schema.
  • In the key? Nope, breaks ordering.
  • In the value? Works for Avro, but how can it be adapted to other formats?
    • Can JSON keep the compelling property that it's parseable as-is?
  • Using specifications exported by the dependency, i.e. the publishing service?
    • Could include heuristics for multi-schema topics.

Many services use JSON natively, and not all of them can easily adopt an Avro library, let alone serdes that speak Schema Registry. For example, kafkacat is a great tool for testing and troubleshooting, but the interest for avro there looks about as low as the interest for JSON here.

With REST + Swagger you might use build time code generation for the contracts your dependency exports. I haven't seen any such approaches for Kafka. Would it require service discovery for topics?

Has anyone seen this kind of discussion anywhere? I've been on the lookout, but no luck so far.

@ottomata
Copy link
Author

Had more time to think about this today. I just sent an email to the Kafka users mailing list to start a discussion, but since there are so many folks who have +1ed this, I'll post here too.

The more I think about this, I realize that I don't care that much about JSON support in Confluent products. What I really want is the ability to use Kafka Connect with JSON data. Kafka Connect does sort of support this, but only if your JSON messages conform to its very specific envelope schema format. (If Confluent products made this possible, I'd use them. :) )

What if…Kafka Connect provided a JSONSchemaConverter (not Connect’s JsonConverter), that knew how to convert between a provided JSONSchema and Kafka Connect internal Schemas?
This should allow for configuration of Connectors with JSONSchemas to read JSON messages directly from a Kafka topic. Once read and converted to a ConnectRecord, I believe the messages could be used with any Connector out there.

So, my question here is: is this what most of the +1ers here want too? Or do you specifically want JSONSchema support in Confluent Schema Registry?

@Tapppi
Copy link

Tapppi commented Feb 5, 2018

What if…Kafka Connect provided a JSONSchemaConverter (not Connect’s JsonConverter), that knew how to convert between a provided JSONSchema and Kafka Connect internal Schemas?

This would solve a lot of problems, and be more sensible than the current setup for connect. Using connect to output arbitrary processed data is a pain at the moment.

Or do you specifically want JSONSchema support in Confluent Schema Registry?

This would solve even more problems, but mainly, because I see this as a superset of the above. A JSONSchemaConverter would be a different (more primitive) way of getting this functionality in only Kafka Connect. If it would work as a stepping stone to remote JSONSchema support from Schema Registry, that would be even more wonderful.

Specifically, I am working on implementing a subset of Kafka Streams functionality for Node.js in TypeScript, and Avro is a strange beast to get in there. Writing up a Node.js libserdes wrapper (schema registry & avro support) and integrating that is on the roadmap, but that still comes with the hurdle of most Node.js shops using primarily JSON.

I think supporting JSON schemas would go a long way towards helping adoption of Kafka for orgs like ours that deal mainly in Node.js and JSON. There's definitely interest and lots of possibilities, but the ecosystem is currently quite far out of reach due to stack differences.

@joewood
Copy link

joewood commented Feb 9, 2018

@ottomata so your requirement is more about schema integration for sourcing JSON sources? One of the problems I have with the Connect Schema object is it's simplicity. Schema parity with JSON schema will be difficult.

@Tappi - your usecase is more similar to ours (and sounds very interesting, is it similar to https://github.com/nodefluent/kafka-streams?). The ability to generate TS types based on schema is definitely something I'm looking at.

One approach that we discussed with Confluent while they were last onsite was a best efforts conversion in the REST interface, maybe using the MIME type to render the appropriate meta-model format. Fore example: - a schema GET with Accept: application/schema+json would best efforts convert the stored Avro schema to JSON schema.

In addition, it would be much better to make Avro and JSON usage more consistent. The KafkaAvroConverter uses the schema registry to source the schema, but the JSONConverter sends the schema in every message - which is really impractical for anything other than trivially small message schemas. It would be great to have a way of propagating the schema without the overhead and message bloat. So, not necessarily full JSON schema, but enough to support the connect Schema object and validate a JSON payload. I guess this could be done by copying the bulk of KafkaAvroSerializer and providing a JSON equivalent, complete with schema cache.

@gaui
Copy link

gaui commented Mar 6, 2018

We rely heavily on JSON at our organization. Today we only use Avro for Kafka and Open API / Swagger (which complies to the JSON Schema spec http://json-schema.org). https://swagger.io/docs/specification/data-models/keywords/

By adding JSON Schema support to Schema Registry, we would be able to use a single schema standard in all our services - to make sure all our service contracts are in sync.

@suresh-krishnamurthy
Copy link

+1 for JSON Schema support.

@elric-k
Copy link

elric-k commented Mar 19, 2018

+1 for JSON schema support

@rhoeting
Copy link

+1 for JSON Schema Support

@pablo-ct
Copy link

+1 for adding JSON Schema support to Schema Registry

@bozidarpasagic
Copy link

+1 for JSON Schema Support

2 similar comments
@fhanin
Copy link

fhanin commented May 23, 2018

+1 for JSON Schema Support

@ccamel
Copy link

ccamel commented May 29, 2018

+1 for JSON Schema Support

@solsson
Copy link

solsson commented Jun 1, 2018

For everyone who's +1, how do you envision messages to identify the schema? As some kind of custom prefix to the value, or a $schema key, or based on topic names or something else? Any requirements on schema evolution?

The main benefit of JSON encoded values as I see it is that it does not require a schema to be decoded. I'd be hesitant to introduce a runtime dependency (like schema registry) for consuming JSON topics.

This is not an argument against JSON-schema support, I'm just curious what properties of JSON encoding+schema people prioritize.

@ottomata
Copy link
Author

ottomata commented Jun 1, 2018

What I mostly want, is an easy way to map from a JSON message to a JSONSchema. That, + Kafka Connect integration that maps from JSONSchema to a Connect Schema, allowing for easy integration of JSON messages into any system that has Kafka Connector sinks.

For my use case, we have a schema_uri field embedded in all of our messages. In another, older system, schemas can be looked up via this URI. Our schema_uris encompass a schema name and a version. E.g. mediawiki/revision/create/2. Leave off the version, and you get the latest.

I'd hope that the mechanism for discovering the schema/version for a given message would be pluggable.

As for schema evolution, I think we could only support additions of new optional fields. Any other schema changes would not be compatible.

@joewood
Copy link

joewood commented Jun 1, 2018

The convention of using the topic name with -key or -value suffix is the most common. My use cases would be:

  • Verifying the schema used for serializers match the runtime schema on the topic
  • Using additional metadata in the schema to easily and automatically convert between XML, JSON etc.. (e.g. if a field property in JSON should be transformed into an XML attribute or an XML child element)
  • Automated validation
  • Schema data discovery (more effective data presentation using schema information for browsing topic data)

To minimize the impact of this requirement, a simple extensibility point could be used in the Schema Registry to support other MIME types through an external JAR. This may be the simplest way to solve this and potential future requests.

@ottomata
Copy link
Author

ottomata commented Jun 26, 2018

FYI (especially for @Tapppi ), I've made an attempt at a JSONSchemaConverter for Kafka Connect here: https://github.com/ottomata/kafka-connect-jsonschema. It is very proof-of-concept at the moment, but works! There are still lots of pieces it needs to be even close to complete.

@philippbauer
Copy link

+1 for adding JSON Schema support to Schema Registry

1 similar comment
@ericfranckx
Copy link

+1 for adding JSON Schema support to Schema Registry

@sumannewton
Copy link

+1 for JSON schema support

@florintene
Copy link

+1 for adding JSON Schema support to Schema Registry

@migalho
Copy link

migalho commented Feb 5, 2019

+1 for JSON schema support

@gaui
Copy link

gaui commented Feb 6, 2019

Is this scheduled?

@rayokota
Copy link
Member

rayokota commented Feb 6, 2019

@gaui, this is being worked on, delivery date TBD.

@ottomata
Copy link
Author

ottomata commented Feb 7, 2019

FWIW, Wikimedia is going another route. We considered building in JSONSchema support to the Confluent Schema Registry, and decided that it would be too difficult for us to do. Avro doesn't really do 'validation', it just fails if serialization from JSON -> Avro fails. The modifications to Schema Registry seemed large enough that it would be too difficult to fork, modify and upstream changes. In hindsight this was a good decision, since Confluent has moved away from a fully open source license, and Wikimedia is very picky about these things :)

Since in general, JSON does not require schemas to read data, our use case is mostly around a service that will flexibly validate JSON events and then produce them to Kafka (or elsewhere). This use case is more like what Kafka REST Proxy provides than Schema Registry. We are implementing a library and service in NodeJS that will do just this.

https://github.com/wikimedia/eventgate

EventGate is still WIP, but we hope to deploy our first production use case of it in the next month or so.

EventGate is schema registry agnostic; as long as your event's schema can be looked up from a URI (local file:// is fine!) then it can find the schema and validate your event. It is only opinionated in that it expects that your event contains the schema URL in it somewhere.

We plan to host schemas simply using a git repo + an http file server. Schema evolution and compatibility will be enforced by a CI process. This will allow us to decentralize schemas, and allow developers to use and develop schemas in the same way that they do with code, rather than having to POST schemas to a centralized store before deploying code that uses them.

(If you are crazy and want to learn more, all of our project plans are public here https://phabricator.wikimedia.org/T185233)

@OneCricketeer
Copy link
Contributor

As far as having a Java SerDe goes, I was able to find one that supported Jackson and the latest drafts of JSON-schema. https://github.com/worldturner/medeia-validator

For the most part, I was able to wrap the existing json-serializer module of this repo with methods of that library.

Proof of concept, still, but first draft was that schemas definitions are only available on the classpath.

Base class

public abstract class AbstractKafkaJsonSchemaSerde {

    protected MedeiaJacksonApi api;

    public AbstractKafkaJsonSchemaSerde() {
        this.api = new MedeiaJacksonApi();
    }

    protected SchemaValidator getSchemaValidator(URL schemaURL) {
        // TODO: Hook in schema-registry here
        SchemaSource source = new UrlSchemaSource(schemaURL);
        return api.loadSchema(source);
    }

    public abstract SchemaValidator getSchemaValidator();
}

Serializer

    @Override
    public SchemaValidator getSchemaValidator() {
        // the configure method ensures the resource is non-null
        return getSchemaValidator(getClass().getResource(schemaResource));
    }

    @Override
    public byte[] serialize(String topic, T data) {
        if (data == null) {
            return null;
        }

        JsonGenerator validatedGenerator = api.decorateJsonGenerator(schemaValidator, unvalidatedGenerator);
        try {
            objectMapper.writeValue(validatedGenerator, data);
            byte[] bytes = baos.toByteArray();
            baos.reset(); // new calls to serialize would otherwise append onto the stream
            return bytes;
        } catch (IOException e) {
            throw new SerializationException(e);
        }
    }

Deserializer

    @Override
    public T deserialize(String ignoredTopic, byte[] bytes) {
        if (bytes == null || bytes.length == 0) {
            return null;
        }

        try {
            JsonParser unvalidatedParser = objectMapper.getFactory().createParser(bytes);
            JsonParser validatedParser = api.decorateJsonParser(schemaValidator, unvalidatedParser);
            return objectMapper.readValue(validatedParser, type);
        } catch (JsonParseException | JsonMappingException e) {
            throw new SerializationException("Unable to parse JSON into type \"" + type + "\".", e);
        } catch (IOException e) {
            throw new SerializationException(e);
        } catch (ValidationFailedException e) {
            throw new SerializationException("Failed to validate input data against schema resource "
                    + schemaResource, e);
        }
    }

Given this basic schema

{
  "$id": "https://example.com/person.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string",
      "description": "The person's first name."
    },
    "lastName": {
      "type": "string",
      "description": "The person's last name."
    },
    "age": {
      "description": "Age in years which must be equal to or greater than zero.",
      "type": "integer",
      "minimum": 0
    }
  }
}

And a Jackson class

    public static Person createPerson(int age) {
        Person person = new Person();
        person.setFirstName("fname");
        person.setLastName("lname");
        person.setAge(age);
        return person;
    }

We run a serializer

 KafkaJsonSchemaSerializer<Person> s = new KafkaJsonSchemaSerializer<>();
 configureSerializer(s, "/schemas/person/person-min-age.json");
 Person p = createPerson(-1); // cause a schema invalidation with a negative age
 s.serialize("topic", p);

Causes an exception, as expected like

org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.databind.JsonMappingException: [Validation Failure
------------------
Rule:     properties
Property: age
Message:  Property validation failed
Location: at 
Details:
    Rule:     minimum
    Message:  Value -1 is smaller than minimum 0
    Location: at 
    -----
] (through reference chain: model.Person["age"])

@gaui
Copy link

gaui commented Jun 7, 2019

Is there some rough ETA on this one?

@rayokota
Copy link
Member

This is currently targeted for Q1 2020

@rayokota
Copy link
Member

Fixed by #1289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests