go-toml v2: unmarshal #488

pelletier · 2021-03-30T00:13:06Z

pelletier
Mar 30, 2021
Maintainer

This document outlines the design and behavior changes go-toml v2's unmarshaler, as implemented in the v2-wip branch.

Behavior changes

Errors contain human readable context

When a parsing or unmarshaling error occurs, Unmarshal returns a toml.DecodeError, which exposes two methods:

Error() -- to satisfy the error interface. It returns the error message itself. For example: "invalid date-time timezone".

String() -- to satisfy the fmt.Stringer interface. It returns a contextualized, human readable version of the error. For example:

 6| [owner]
 7| name = "Tom Preston-Werner"
 8| organization = "GitHub"
 9| dob = 1979-05-27T07:32:00+1 # First class dates? Why not?
  |                          ~~ invalid date-time timezone
10| bio = "GitHub Cofounder & CEO\nLikes tater tots and beer."
11| [database]
12| server = "192.168.1.1"

Automatic field name guessing

When a unmarshaling to a struct, if a key in the TOML document does not exactly match the name of a struct field or any of the toml-tagged field, v1 tries multiple variations of the key (code).
This adds complexity in the Unmarshal code, resulting in potentially more allocations. The same result can effectively be achieved with setting the appropriate toml tags to the fields that don't exactly match the expected key in the document.
Stdlib's encoding/json similarily does not try to guess field names. It checks for either an exact match or a case-insensitive match. It provides the json tag to pick a different field that wouldn't follow this rule. This unmarshaler follows encoding/json's method.

Ignore pre-existing value in interface

When decoding into a non-nil interface{}, go-toml v1 uses the type of the element in the interface to decode the object. For example:

type inner struct {
  B interface{}
}
type doc struct {
  A interface{}
}

d := doc{
  A: inner{
    B: "Before",
  },
}

data := `
[A]                                                                                                                           
B = "After"
`

toml.Unmarshal([]byte(data), &d)
fmt.Printf("toml: %#v\n", d)

// toml: main.doc{A:main.inner{B:"After"}}

In this case, field A is of type interface{}, containing a inner struct. go-toml v1 sees that type and uses it when decoding the object.

Other libraries such as encoding/json and BurntSushi's toml have a different behavior. When decoding an object into an interface{}, they instead decide to disregard whatever value the interface{} may contain and replace it with a map[string]interface{}. With the same datastructure as above, here is what the result looks like:

json.Unmarshal([]byte(`{"A": {"B": "After"}}`), &d)
fmt.Printf("stdlib json: %#v\n", d)

// stdlib json: main.doc{A:map[string]interface {}{"B":"After"}}

I believe there are two motivations for that behavior. First, when the value is in an interface{} it is no addressable, so it needs to be rebuilt anyway. Second, it is to avoid the output types to change depending on the input (in this example, the program would have to deal with A containing either an inner struct or a map[string]interface{} depending on whether in put has a nil element in the interface or not).

go-toml v2's unmarshaler changes the behavior to match encoding/json's.

Example code: https://play.golang.org/p/VxkwavVlqo1

Values out of array bounds ignored

When decoding into an array, go-toml v1 returns an error when the number of elements contained in the doc is superior to the capacity of the array. For example:

type doc struct {
  A [2]string
}
d := doc{}
err := toml.Unmarshal([]byte(`A = ["one", "two", "many"]`), &d)
fmt.Println(err)

// (1, 1): unmarshal: TOML array length (3) exceeds destination array length (2)

In the same situation, encoding/json ignores the last value:

err := json.Unmarshal([]byte(`{"A": ["one", "two", "many"]}`), &d)
fmt.Println("err:", err, "d:", d)
// err: <nil> d: {[one two]}

go-toml v2 follows encoding/json's behavior, on the grounds that providing sticking to the standard library's bevhior reduces surprises when using go-toml. But I can easily be convinced this is a mistake.

Support for `toml.Unmarshaler` has been dropped

This method does not seem to be wildly used. This sourcegraph query shows that the main public user of this is influxdata. Apparently this method is being used to parse custom units without having to put them in strings. For example: duration = 1s is accepted. A similar effect can be achieved using the encoding.TextUnmarshaler and keeping the units in strings (note: their code actually accepts both the in-string unquoted versions). Of course not perfect, but we don't have more info on usage.

The flexible structure of TOML makes this method difficult to define: what bytes should be passed to the UnmarshalTOML method? This question needs to be answered because the canonical signature of this method does not allow to indicate how many bytes were parsed. Departing from this signature would go significantly against the goal of staying close to the behavior of encoding/json.

To avoid unnecessary complexity this feature has been left out. If there is an ask to bring this back we can reconsider this decision.

Implementation notes

The new parser does one pass over the input bytes to construct an AST. It does not however attempt any parsing or validation of values, which are left as byte slices pointing into the input data. As a result, the unmarshaler chooses to decode each value based on what structure it is building (and entirely skips whole branches of the AST when they do not exist in the target structure).

This two-passes design is reasonably performant, and allows to reuse the parsing structure in both the Unmarshaler and the Document interface. Initially I attempted a one-pass version, but it required to create set up and synchronize two stacks (the parser's and the unmarshaler's), which made the code difficult to follow and debug, especially when trying to skip over unused parts of the document.

As an optimization, the parser only parses one top level expression at a time. Coupled with the fact that the AST is stored in a single array, it allows to reuse storage from one expression to the next, maintaining allocations to a minimum.

When decoding into struct, reflection is used once per type to figure out fields and their acceptable keys. That information is cached in a global concurrent-safe map, which minimizes the overhead to reflecting on structs and should allow the cache to converge to all the types used by the program.

Verification of key uniqueness and types is implemented as a separate pass on the AST. This method decouples the TOML-specific semantic check from building the target structure. It can then be reused efficiently by the Document builder.

pelletier · 2021-03-30T01:11:45Z

pelletier
Mar 30, 2021
Maintainer Author

@moorereason @bep would love your sanity check on this, especially want to make sure that it works for Hugo.

1 reply

bep Mar 30, 2021

This looks sensible to me. Hugo always (currently at least) unmarshal TOML into a map[string]interface {}, which makes part of the discussion above not relevant (I think, such as slice length); but in general I like the idea that you try mimic encoding/json's behaviour. Having these subtle differences between the different encoding libraries is generally a pain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go-toml v2: unmarshal #488

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

go-toml v2: unmarshal #488

pelletier Mar 30, 2021 Maintainer

Behavior changes

Errors contain human readable context

Automatic field name guessing

Ignore pre-existing value in interface

Values out of array bounds ignored

Support for toml.Unmarshaler has been dropped

Implementation notes

Replies: 1 comment · 1 reply

pelletier Mar 30, 2021 Maintainer Author

bep Mar 30, 2021

pelletier
Mar 30, 2021
Maintainer

Support for `toml.Unmarshaler` has been dropped

Replies: 1 comment 1 reply

pelletier
Mar 30, 2021
Maintainer Author