Skip to content
This repository has been archived by the owner on Nov 8, 2024. It is now read-only.

Clarify source map type #25

Open
kylef opened this issue Jul 6, 2017 · 11 comments
Open

Clarify source map type #25

kylef opened this issue Jul 6, 2017 · 11 comments
Milestone

Comments

@kylef
Copy link
Member

kylef commented Jul 6, 2017

There are discrepancies in the API Blueprint parser and Swagger adapter because of source map positions are treated differently (https://github.com/apiaryio/fury.js/issues/63).

In Swagger adapter, the source map represents the position of a character in a JavaScript string. Whereas the API Blueprint parser (snowcrash/drafter) will craft source maps based on the byte offset in the underlying buffer.

We need to decide how source maps should be represented and then align the adaptors.

/cc @w-vi

@kylef kylef modified the milestone: 1.0 Jul 6, 2017
@w-vi
Copy link
Contributor

w-vi commented Jul 7, 2017

I'll come to this later but a quick note: Relying on JavaScript strings is little unfortunate as we rely on string implementation in one runtime and the rest then needs to treat it the same way. AFAIK JavaScript uses UTF-16.

@kylef
Copy link
Member Author

kylef commented Jul 7, 2017

Yes, and this could lead to different behaviours based on the language and string implementations. Some conversions through buffer -> string -> buffer may actually be lossy and for some implementations that handle unicode normalisation or have grapheme breaking rules.

@w-vi
Copy link
Contributor

w-vi commented Aug 25, 2017

I have thought about this a bit and what I think is best is to use offset in characters where character is not a byte but a logical unit whose size depends on the encoding of the document. In case of utf-8 it might be 3 characters (runes) but 13 bytes. What do you think @kylef ?

@kylef
Copy link
Member Author

kylef commented Aug 25, 2017

@w-vi If I understand you correctly, then I agree.

Due to language differences this can be confusing, especially when languages have different grapheme breaking rules. I could see it being very problematic for consumers to find the original source from a source map.

I think we would should alter our JS parsing libraries (Fury, Protagonist) to accept buffer instead of string to keep intact the original source document and how the graphemes are broken down.

For the Helium parsing service, the input document is embedded inside a JSON payload and already serialised. I fear we could loose the original encoding and source maps can become incorrect. api.apibleuprint.org deals with unserialised data and thus wouldn't have this problem.

Just to confirm we're on the same page. Here's an example:

Given I have five characters (e, \r\n, é, é and 👨‍👨‍👦‍👦), and I want to refer to the family (👨‍👨‍👦‍👦) and the characters have been encoded as utf-8. A character is equal to a unicode grapheme cluster. You would propose the source map be: 4 for location and 1 for length?

I want to point out that not all of those characters are the same, although they very well look identical (é vs é). Along with \r\n being a single grapheme cluster and not two (http://www.unicode.org/reports/tr29/#GB3).

Here is Swift showing the length of those areas and also the characters in base64.

let eAcute: Character = "\u{E9}"                         // é
let combinedEAcute: Character = "\u{65}\u{301}"          // e followed by
// eAcute is é, combinedEAcute is é

let characters: [Character] = ["e", "\r\n", eAcute, combinedEAcute, "👨‍👨‍👦‍👦"]
let string = String(characters)
let utf8Data = string.data(using: .utf8)

print(characters)
// ["e", "\r\n", "é", "é", "👨‍👨‍👦‍👦"]

print(characters.map { String($0).utf8.count })
// [1, 2, 2, 3, 25]

print(characters.count)
// 5
print(string.utf8.count)
// 33

print(string.utf16.count)
// 17

print(utf8Data)
// Optional(33 bytes)

print(utf8Data?.base64EncodedString())
// Optional("ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm")

Then lets take the base64 and decode in Python 3:

>>> import base64
>>> data = base64.decodebytes(b'ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm')
>>> string = data.decode('utf-8')
>>> data
b'e\r\n\xc3\xa9e\xcc\x81\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa6\xe2\x80\x8d\xf0\x9f\x91\xa6'
>>> len(data)
33
>>> string
'e\r\néé👨\u200d👨\u200d👦\u200d👦'
>>> len(string)
13
>>> string[1:2]
'\r'
>>> string[2:3]
'\n'

I am not sure how Python got 13 as the length, this would seem a bug in grapheme breaking. It is not the length of characters nor utf8 or utf16. Python is also treating \r and \n as separate characters.

Then in Node 6 (perhaps there is another way of doing this, I am not that proficient in Node):

> const { StringDecoder } = require('string_decoder')
> const data = Buffer.from('ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm', 'base64')
undefined
> data
<Buffer 65 0d 0a c3 a9 65 cc 81 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f 91 a6 e2 80 8d f0 9f 91 a6>
> data.length
33
> const decoder = new StringDecoder('utf8');
undefined
> console.log(decoder.write(data));
e
éé👨‍👨‍👦‍👦
undefined
> console.log(decoder.write(data).length);
17

It looks like strings are internally stored as utf16, which means that when it comes to serialising it back to utf-8 it may normalise the output (both é would become the same size if normalisation occurs). Since we are embedding the payload as a unicode string into a JSON payload for the parsing service request we have likely lossed the original form which can result in the source maps becoming incorrect from what the user requested. The APIs for Fury/Protagonist/Drafter.js accept strings and not buffer so that the deserialisation has already occurred and re-serialisation into utf-8 for the parser may also provide loss of information.

@w-vi
Copy link
Contributor

w-vi commented Aug 25, 2017

Yes, we are on the same page and mean the same thing. Question is if we should not impose utf-8 as the only acceptable encoding require the input to be base64 encoded in case of helium so we get the raw bytes instead of string and thus loosing the original document. Javascript uses utf-16 for strings not Node specific but Javascript in general. And the Python looks funny, I'll probably look into it in more detail.

@pksunkara
Copy link
Contributor

Since we are embedding the payload as a unicode string into a JSON payload for the parsing service request

Wait, we do that? I thought the string we are passing are utf-8 or something.

Other than this, can we all agree that there is no changed needed for the refract spec other than saying that the location and lenght of the source map element refers to the bytes and not characters.

@kylef
Copy link
Member Author

kylef commented Sep 4, 2017

Specification already states that the source maps contain character maps and not bytes:

This location SHOULD include the characters used to build the parent element
A source map is a series of character-blocks

Zero-based index of a character in the source document
Count of characters starting from the character index.

So no chance is needed in the specification. However neither API Blueprint nor Swagger parsers are following these rules.

@pksunkara
Copy link
Contributor

I thought we wanted to change it to bytes according to the above discussion.

@kylef
Copy link
Member Author

kylef commented Sep 22, 2017

After giving this a bit of though, I think we should use byte based source maps from the original source document. Using characters will be problematic for the following reasons:

  • Various programming languages / string implementations have different idea of what a character is. Swift seems to be the only one that follows the Unicode grapheme cluster boundary rules correctly (example: \r\n being a single character). Both Node/JS and Python are problematic and have a different idea of what a character should be (by design or not).
  • If the document has an invalid or non legal Unicode values, it will be problematic trying to refer to it via a source map. Or anything after it. Using a byte-based source map we have more control over providing source maps for an individual byte inside a unicode grapheme cluster.

Steps to Proceed

  • Update API Elements specification to use bytes (which is what the biggest implementation already does in Drafter)
  • Update Swagger adapter to use bytes
  • Test Apiary tools to ensure correct support for bytes, I think they already use bytes as that has been used in Snowcrash/Drafter for some time.

@kylef
Copy link
Member Author

kylef commented Sep 22, 2017

Additional note, I think we should provide convinces in Fury/Minim-API-Description to convert a source map to a specific line number as this a common pattern that various editors and tooling re-implements.

@tjanc
Copy link
Contributor

tjanc commented Sep 22, 2017

Ideally we'd like to index by character. But if we do that, we need to understand all supported encodings.
@kylef What are the encodings we officially support? What about the option of imposing utf-8, as mentioned by @w-vi ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants