Replies: 1 comment 1 reply
-
@moorereason @bep would love your sanity check on this, especially want to make sure that it works for Hugo. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This document outlines the design and behavior changes go-toml v2's unmarshaler, as implemented in the v2-wip branch.
Behavior changes
Errors contain human readable context
When a parsing or unmarshaling error occurs,
Unmarshal
returns atoml.DecodeError
, which exposes two methods:Error()
-- to satisfy theerror
interface. It returns the error message itself. For example: "invalid date-time timezone".String()
-- to satisfy thefmt.Stringer
interface. It returns a contextualized, human readable version of the error. For example:Automatic field name guessing
When a unmarshaling to a struct, if a key in the TOML document does not exactly match the name of a struct field or any of the
toml
-tagged field, v1 tries multiple variations of the key (code).This adds complexity in the Unmarshal code, resulting in potentially more allocations. The same result can effectively be achieved with setting the appropriate
toml
tags to the fields that don't exactly match the expected key in the document.Stdlib's encoding/json similarily does not try to guess field names. It checks for either an exact match or a case-insensitive match. It provides the
json
tag to pick a different field that wouldn't follow this rule. This unmarshaler followsencoding/json
's method.Ignore pre-existing value in interface
When decoding into a non-nil
interface{}
, go-toml v1 uses the type of the element in the interface to decode the object. For example:In this case, field
A
is of typeinterface{}
, containing ainner
struct. go-toml v1 sees that type and uses it when decoding the object.Other libraries such as
encoding/json
and BurntSushi's toml have a different behavior. When decoding an object into aninterface{}
, they instead decide to disregard whatever value theinterface{}
may contain and replace it with amap[string]interface{}
. With the same datastructure as above, here is what the result looks like:I believe there are two motivations for that behavior. First, when the value is in an
interface{}
it is no addressable, so it needs to be rebuilt anyway. Second, it is to avoid the output types to change depending on the input (in this example, the program would have to deal withA
containing either aninner
struct or amap[string]interface{}
depending on whether in put has anil
element in the interface or not).go-toml v2's unmarshaler changes the behavior to match
encoding/json
's.Example code: https://play.golang.org/p/VxkwavVlqo1
Values out of array bounds ignored
When decoding into an array, go-toml v1 returns an error when the number of elements contained in the doc is superior to the capacity of the array. For example:
In the same situation,
encoding/json
ignores the last value:go-toml v2 follows
encoding/json
's behavior, on the grounds that providing sticking to the standard library's bevhior reduces surprises when using go-toml. But I can easily be convinced this is a mistake.Support for
toml.Unmarshaler
has been droppedThis method does not seem to be wildly used. This sourcegraph query shows that the main public user of this is influxdata. Apparently this method is being used to parse custom units without having to put them in strings. For example:
duration = 1s
is accepted. A similar effect can be achieved using theencoding.TextUnmarshaler
and keeping the units in strings (note: their code actually accepts both the in-string unquoted versions). Of course not perfect, but we don't have more info on usage.The flexible structure of TOML makes this method difficult to define: what bytes should be passed to the
UnmarshalTOML
method? This question needs to be answered because the canonical signature of this method does not allow to indicate how many bytes were parsed. Departing from this signature would go significantly against the goal of staying close to the behavior ofencoding/json
.To avoid unnecessary complexity this feature has been left out. If there is an ask to bring this back we can reconsider this decision.
Implementation notes
The new parser does one pass over the input bytes to construct an AST. It does not however attempt any parsing or validation of values, which are left as byte slices pointing into the input data. As a result, the unmarshaler chooses to decode each value based on what structure it is building (and entirely skips whole branches of the AST when they do not exist in the target structure).
This two-passes design is reasonably performant, and allows to reuse the parsing structure in both the Unmarshaler and the Document interface. Initially I attempted a one-pass version, but it required to create set up and synchronize two stacks (the parser's and the unmarshaler's), which made the code difficult to follow and debug, especially when trying to skip over unused parts of the document.
As an optimization, the parser only parses one top level expression at a time. Coupled with the fact that the AST is stored in a single array, it allows to reuse storage from one expression to the next, maintaining allocations to a minimum.
When decoding into struct, reflection is used once per type to figure out fields and their acceptable keys. That information is cached in a global concurrent-safe map, which minimizes the overhead to reflecting on structs and should allow the cache to converge to all the types used by the program.
Verification of key uniqueness and types is implemented as a separate pass on the AST. This method decouples the TOML-specific semantic check from building the target structure. It can then be reused efficiently by the Document builder.
Beta Was this translation helpful? Give feedback.
All reactions