-
Notifications
You must be signed in to change notification settings - Fork 10
Clarify source map type #25
Comments
I'll come to this later but a quick note: Relying on JavaScript strings is little unfortunate as we rely on string implementation in one runtime and the rest then needs to treat it the same way. AFAIK JavaScript uses UTF-16. |
Yes, and this could lead to different behaviours based on the language and string implementations. Some conversions through buffer -> string -> buffer may actually be lossy and for some implementations that handle unicode normalisation or have grapheme breaking rules. |
I have thought about this a bit and what I think is best is to use offset in characters where character is not a byte but a logical unit whose size depends on the encoding of the document. In case of utf-8 it might be 3 characters (runes) but 13 bytes. What do you think @kylef ? |
@w-vi If I understand you correctly, then I agree. Due to language differences this can be confusing, especially when languages have different grapheme breaking rules. I could see it being very problematic for consumers to find the original source from a source map. I think we would should alter our JS parsing libraries (Fury, Protagonist) to accept buffer instead of string to keep intact the original source document and how the graphemes are broken down. For the Helium parsing service, the input document is embedded inside a JSON payload and already serialised. I fear we could loose the original encoding and source maps can become incorrect. api.apibleuprint.org deals with unserialised data and thus wouldn't have this problem. Just to confirm we're on the same page. Here's an example: Given I have five characters ( I want to point out that not all of those characters are the same, although they very well look identical (é vs é). Along with Here is Swift showing the length of those areas and also the characters in base64. let eAcute: Character = "\u{E9}" // é
let combinedEAcute: Character = "\u{65}\u{301}" // e followed by
// eAcute is é, combinedEAcute is é
let characters: [Character] = ["e", "\r\n", eAcute, combinedEAcute, "👨👨👦👦"]
let string = String(characters)
let utf8Data = string.data(using: .utf8)
print(characters)
// ["e", "\r\n", "é", "é", "👨👨👦👦"]
print(characters.map { String($0).utf8.count })
// [1, 2, 2, 3, 25]
print(characters.count)
// 5
print(string.utf8.count)
// 33
print(string.utf16.count)
// 17
print(utf8Data)
// Optional(33 bytes)
print(utf8Data?.base64EncodedString())
// Optional("ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm") Then lets take the base64 and decode in Python 3: >>> import base64
>>> data = base64.decodebytes(b'ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm')
>>> string = data.decode('utf-8')
>>> data
b'e\r\n\xc3\xa9e\xcc\x81\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa6\xe2\x80\x8d\xf0\x9f\x91\xa6'
>>> len(data)
33
>>> string
'e\r\néé👨\u200d👨\u200d👦\u200d👦'
>>> len(string)
13
>>> string[1:2]
'\r'
>>> string[2:3]
'\n' I am not sure how Python got 13 as the length, this would seem a bug in grapheme breaking. It is not the length of characters nor utf8 or utf16. Python is also treating Then in Node 6 (perhaps there is another way of doing this, I am not that proficient in Node): > const { StringDecoder } = require('string_decoder')
> const data = Buffer.from('ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm', 'base64')
undefined
> data
<Buffer 65 0d 0a c3 a9 65 cc 81 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f 91 a6 e2 80 8d f0 9f 91 a6>
> data.length
33
> const decoder = new StringDecoder('utf8');
undefined
> console.log(decoder.write(data));
e
éé👨👨👦👦
undefined
> console.log(decoder.write(data).length);
17 It looks like strings are internally stored as utf16, which means that when it comes to serialising it back to utf-8 it may normalise the output (both |
Yes, we are on the same page and mean the same thing. Question is if we should not impose utf-8 as the only acceptable encoding require the input to be base64 encoded in case of helium so we get the raw bytes instead of string and thus loosing the original document. Javascript uses utf-16 for strings not Node specific but Javascript in general. And the Python looks funny, I'll probably look into it in more detail. |
Wait, we do that? I thought the string we are passing are utf-8 or something. Other than this, can we all agree that there is no changed needed for the refract spec other than saying that the location and lenght of the source map element refers to the bytes and not characters. |
Specification already states that the source maps contain character maps and not bytes:
So no chance is needed in the specification. However neither API Blueprint nor Swagger parsers are following these rules. |
I thought we wanted to change it to bytes according to the above discussion. |
After giving this a bit of though, I think we should use byte based source maps from the original source document. Using characters will be problematic for the following reasons:
Steps to Proceed
|
Additional note, I think we should provide convinces in Fury/Minim-API-Description to convert a source map to a specific line number as this a common pattern that various editors and tooling re-implements. |
There are discrepancies in the API Blueprint parser and Swagger adapter because of source map positions are treated differently (https://github.com/apiaryio/fury.js/issues/63).
In Swagger adapter, the source map represents the position of a character in a JavaScript string. Whereas the API Blueprint parser (snowcrash/drafter) will craft source maps based on the byte offset in the underlying buffer.
We need to decide how source maps should be represented and then align the adaptors.
/cc @w-vi
The text was updated successfully, but these errors were encountered: