Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce topic -> schema mapping. #221

Closed
ottomata opened this issue Aug 31, 2015 · 7 comments
Closed

Enforce topic -> schema mapping. #221

ottomata opened this issue Aug 31, 2015 · 7 comments
Labels

Comments

@ottomata
Copy link

It'd be handy if REST proxy could enforce producing only a certain series of a schema to a particular topic. This would require registering a topic to have a particular schema series, perhaps in schema registry.

Thoughts? Is this a bad idea?

@ewencp
Copy link
Contributor

ewencp commented Sep 1, 2015

@ottomata Not sure I understand how what you're suggesting is different from what we do today since the REST proxy uses the Avro serializers, which will require the schemas to be registered before sending the data? Or are the compatibility checks the schema registry will enforce not strong enough restrictions for you?

Would #122 do the trick? i.e. is the real issue that you just don't want auto registration at all, and that you'll handle registration via some other process such as in your build/deployment steps?

@ottomata
Copy link
Author

ottomata commented Sep 1, 2015

Hmm, I think #122 is important, but that's not what I mean here. I'm talking about limiting the 'subjects' (correct term?) that can be produced to any given topic.

Let's say:

Subject A has schema ids: 10
Subject B has schema ids: 20

We want topic-A to only accept production of schemas in Subject A, and topic-B to only accept production of schemas in Subject B.

# should be allowed
POST /topics/topic-A/partitions/1 { "value_schema_id": 10, "records": [{"value_a": 1}]}

# should be allowed
POST /topics/topic-B/partitions/1 { "value_schema_id": 20, "records": [{"value_b": 1}]}

# should NOT be allowed
POST /topics/topic-A/partitions/1 { "value_schema_id": 20, "records": [{"value_b": 1}]}

# should NOT be allowed
POST /topics/topic-B/partitions/1 { "value_schema_id": 10, "records": [{"value_a": 1}]}

This would let you be sure that the schema(-subjects) that you expect to be in a topic are the only subjects there when you consume. You can be sure that some consumer that only consumes from topic-A will only ever get messages in schema Subject A.

@ewencp
Copy link
Contributor

ewencp commented Sep 1, 2015

I see. Confluent's serializers just bake this into the scheme used for generating subject names. Producers map the key and value when producing to a topic as {topic}-key and {topic}-value. So with Avro data, this should already be the case. This is the reason we use different terminology -- subject vs Kafka's topic -- on the schema registry (and besides, you could easily use it to manage schemas elsewhere as well, e.g. in HDFS if you also produce data there).

Given that you're using JSON, I'm guessing you're looking for the REST proxy to provide this restriction when the serializer does not? Or did you want this to happen via the schema registry somehow? Regardless, I think this would require #220 to start with. Then implementing the equivalent of the Avro serializer's functionality shouldn't be a problem since we'd just integrate it with the JSON serializers (which have been added in #193).

@ottomata
Copy link
Author

ottomata commented Sep 1, 2015

Given that you're using JSON, I'm guessing you're looking for the REST proxy to provide this restriction when the serializer does not?

Yes, but I would like this restriction to be in place for any schema validated data, JSON Schema or Avro whatever future schema format. I just want to ensure that consumers only get schemas that they expect. For Avro, this would be anything within a versioned set of schemas (in a subject?). For JSON Schema, I'm not sure. Since JSON Schema doesn't have built in schema evolution, I'm not sure how it would be versioned, but perhaps that is something consumer's would just have to deal with. You could semantically evolve JSON Schemas, but there would be no built in compatibility support or validation. Anyway...I digress, and am now talking about #220 related stuff. :)

Confluent's serializers just bake this into the scheme used for generating subject names.

Hm, not sure I follow here. Confluent's serializers...in the REST Proxy? That is, if a schema (value) is auto registered via a produce message, the subject in the Schema registry will be {topic}-value, where {topic} is the topic provided during the REST produce request? Is that right? If so, I'm not sure this would help, as I would like to disable auto schema registration as per #122.

@mageshn
Copy link
Member

mageshn commented Nov 9, 2018

@ottomata I believe you could still disable auto registration and then just pass the schema id in your produce requests. Would that help?

@ottomata
Copy link
Author

Hi, I'm not following this so much anymore. Wikimedia has decided to not use Confluent products here, as they don't support JSONSchema and building JSONSchema support in to the right places was deemed more difficult than just writing a new service. We're currently working on a nodejs based HTTP POST JSONSchema validation -> Kafka produce service. I'll close this issue, thanks!

@mac2000
Copy link

mac2000 commented Dec 30, 2019

Land here after searching for a way to enforce avro messages in rest proxy

e.g.

I have a TestTopic1 with schema for it

curl -s "http://$rest/topics/TestTopic1" | jq # returns topic info
curl -s "http://$schema/subjects/TestTopic1-value/versions/latest/schema" | jq # returns schema

but still I can post json messages to this topic

curl -s -X POST "http://$rest/topics/TestTopic1" -H "Content-Type: application/vnd.kafka.json.v2+json" -H "Accept: application/vnd.kafka.v2+json" -d '{"records": [{"value": {"what": "ever"}}]}' | jq

which is not desired, now there is message in topic which is not schema compatible at all which might broke consumers which are not aware

at moment seems like the fastest possible way is to configure nginx reverse proxy and check content type header to be avro, otherwise reject requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants