Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted type literal added to data in textual #167

Open
junejosheeraz opened this issue Jun 21, 2019 · 11 comments
Open

Unwanted type literal added to data in textual #167

junejosheeraz opened this issue Jun 21, 2019 · 11 comments

Comments

@junejosheeraz
Copy link

I have looked at the #106 but I am not using any unions. My schema is as follows;

{
	"type": "record",
	"name": "Event",
	"namespace": "com.xyz.event",
	"fields": [
		{
			"name": "eventId",
			"type": "string",
			"doc": "Event ID"
		},
		{
			"name": "eventSize",
			"type": "int",
			"doc": "Number of events within the event"
		},
		{
			"name": "emittedTime",
			"type": "long",
			"doc": "Time when event was published"
		},
		{
			"name": "context",
			"type": [
				"null",
				{
					"type": "map",
					"values": "string"
				}
			],
			"doc": "context information",
			"default": null
		},
		{
			"name" : "companyTable",
			"type" : [ "null", "string" ],
			"doc" : "event table prefix",
			"default" : null
		},
		{
			"name": "entityName",
			"type": [
				"null",
				"string"
			],
			"doc": "Name of entity which triggered this event",
			"default": null
		},
		{
			"name": "entityId",
			"type": [
				"null",
				"string"
			],
			"doc": "Identifier of entity which triggered this event",
			"default": null
		}
	]
}

Code to decode is as follows;

codec, _ := goavro.NewCodec(schema)
native, _, _ := codec.NativeFromBinary(dataBytes)
textualbyte, err := codec.TextualFromNative(nil, native)
fmt.Println(Textual : string(textualbyte))

Above code produces following output;

{
	"context": {
		"map": {}
	},
	"eventSize": 2,
	"emittedTime": 1560440525396,
	"companyTable": null,
	"entityName": {
		"string": "FBNK.EB.CONTRACT.BALANCES"
	},
	"eventId": "96a485bd-7d95-443d-b604-ceb7c095be30",
	"entityId": {
		"string": "186079759760587"
	}
}

How can I get rid of string under entityId and entityName. Why type is present in textual and is there any way I can get rid of it? where eventSize and emittedTime has correct values

@karrick
Copy link
Contributor

karrick commented Jun 21, 2019

Thanks for posting your question. Truthfully, your schema is using a type union whenever a particular type can be one of an array of choices.

Whenever Avro decoding of unions are done, either for binary or text encoding, the Avro decoder needs to unambiguously know what the type is to use to decode the value. With binary, the Avro designers chose to encode the zero-based type index. With text, the Avro designers chose to encode the type information as a JSON object whose property name specifies the type of the value, and whose property value is the encoded value.

		{
			"name": "context",
			"type": [
				"null",
				{
					"type": "map",
					"values": "string"
				}
			],
			"doc": "context information",
			"default": null
		},

The snippet about shows a data type that is a union, which can either be a null value, or a map of string values.

When unions are binary encoded in Avro, the first value decoded is a long integer that specifies the ordinal number that matches the zero-based index of the actual type. For instance, if context here is null, then the first and only byte used to occupy this value in binary is 0. If the context here were a map that had two strings, in binary it would be encoded as 0x02 (to indicate index 1, this is the second type, namely the map of strings), then 0x04 to indicate the number of key-value pairs, in this case 2. Then each key-value pair would be encoded.

In Avro text encoding, because Avro does not use the zero-based index of the actual type, it instead uses a JSON object to specify the type being encoded. In particular, it encodes a JSON object with a single property, the type name of the value, and the value equal to the value being encoded. When context is null, the JSON object will be {"null":null} IIRC, because it needs to indicate the type name as the JSON property name, and the value is simply the JSON null value. When context is a map of string values with 2 strings, as in the binary example above, it would be encoded as {"map": {"key1":"value1","key2":"value2"}}.

It looks like in your case above, the encoder was given an empty map, which is distinctly different than a null value. A null value is encoded as {"null":null}, and an empty map is encoded for the above schema as {map:{}}.

@karrick karrick closed this as completed Jun 21, 2019
@junejosheeraz
Copy link
Author

If I like to get rid of the types, is there anyway i can provide a map to encoder, any example would be great. Thanks

@karrick
Copy link
Contributor

karrick commented Jun 21, 2019

I'm not quite sure what you mean.

I understand you want to remove types from the JSON output. That would require encoding data with a different schema that does not have a union type.

Is your desire to transcode a bunch of data from one schema to a different schema? Namely, you have binary data that was encoded with a schema that has union data types, and you want to encode that data using a schema that does not have union data types? Is your source of binary data some Avro files somewhere?

Or is this the only schema involved is whatever you create for this particular project, and you are not working with data outside of this particular effort, and you just need to change the schema you are using to not have union data types?

@junejosheeraz
Copy link
Author

OK... let me clarify. I am consuming messages from Kafka stream and my producer is using the same schema (UNIONed) and I want to keep it same. Also I cannot remove uinion types either

The problem is that once I have the data, I need to invoke my implementation where I want to forward pure simple jSON object instead of an AVRO object. This is to reduce complexity and dependency on AVRO. See below sample implementation because of this problem;

imprt (
	...
	"strings"
	"github.com/linkedin/goavro"
	"github.com/jmoiron/jsonq"
)

func main() {
	// Decode Binary AVRO to Native and Textual
	codec, _ := goavro.NewCodec(schema)
	native, _, _ := codec.NativeFromBinary(dataBytes)
	textualbyte, err := codec.TextualFromNative(nil, native)
	textaul := string(textualbyte)
	fmt.Println(Textual : textaul)
		
	// Now query on this data to get fields out for usage
	data := map[string]interface{}{}
	dec := json.NewDecoder(strings.NewReader(textaul))
	dec.Decode(&data)
	jq := jsonq.NewQuery(data)

	// Get entityName
	entityName, _ := jq.String("entityName", "string")	// This is the problem, my implementation have to take the hit!
	fmt.Println("EntityName	:", entityName)
	
	// Get entityId
	entityId, _ := jq.String("entityId", "string")		// This is the problem, my implementation have to take the hit!
	fmt.Println("Entity ID	:", entityId)
}

I am looking for a way where my implementation should be able to simply do the following to retrieve values from JSON;

  // Get entityName
entityName, _ := jq.String("entityName")
fmt.Println("EntityName	:", entityName)
	
// Get entityId
entityId, _ := jq.String("entityId")
fmt.Println("Entity ID	:", entityId)

NOTE: I have implemented the same in Java using Confluent libs and there I am dealing with an implementation of "org.apache.avro.generic.GenericRecord" and I call .toString() to get the record into org.json.JSONObject. This gives my implementation a capability to easily query on POJO json string easily.

@karrick
Copy link
Contributor

karrick commented Jun 24, 2019

I suppose what you are looking for is a new feature that converts data from Avro text encoding to JSON that does not encode the type names inside the JSON objects.

@karrick karrick reopened this Jun 24, 2019
@junejosheeraz
Copy link
Author

Exactly, thanks for making it easy to explain as well as accepting it as an enhancement. This would add huge value! I will keep an eye on this issue for further updates.

@karrick
Copy link
Contributor

karrick commented Jun 25, 2019

Please do not hold your breath on this enhancement. It is a bit outside the scope of this library, and I have a few other things that I'm working on. I do agree it's a useful feature, and I'm happy to do the work when I get some time.

shotat added a commit to shotat/goavro that referenced this issue Feb 19, 2020
@shotat
Copy link

shotat commented Feb 20, 2020

@junejosheeraz

Hi, I have the same problem and I've implemented not to embed type literals (#201).
If you still worried about this issue, just checkout the branch above and try it out.
Thanks.

@junejosheeraz
Copy link
Author

Thanks @shotat will try

@ggiill
Copy link

ggiill commented Sep 22, 2020

Hey - any update on #201 getting merged? (Thanks @shotat for putting in that PR.)

@mihaitodor
Copy link
Contributor

For anyone interested in this feature, I believe #249 should address it once it's merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants