We should control the type of ak.Array.attrs
, and it should be immutable
#3277
Labels
policy
Choice of behavior
ak.Array.attrs
, and it should be immutable
#3277
Right now,we just take the
attrs
that a user gives us and pass it around, allowing it to be mutated arbitrarily.This is not how Awkward Array things usually work, usually we
ak.Array
object is mutable, but so far, none of the attributes hanging off of theak.Array
objects have been mutableThis is all just a matter of expectations and different users will have different expectations, so this issue is labeled "policy" and we can brainstorm in the comments. However, I think the current state of affairs becomes a problem when people serialize/deserialize through Arrow/Parquet, since serialization
MyDict
Here's what I think it should do instead. I think that, given an arbitrary
attrs
Mapping, we should copy it into a frozendict-like type that we control, pass that frozendict from arrays to derived arrays, and replace it when modified by__setitem__
, similar to the way thatak.Array.__setitem__
replaces its layout, rather than changing its layout in-place. (Because layout nodes are highly shared among arrays; theak.Array
outer shell is not.)Our
attrs
type would be something likeThe effect of this is that it would still allow
and
attrs
would be passed to derived arraysbut influence would not be propagated backward:
This is what I think Awkward Arrays should do, but it's open for discussion as to whether it's too weird for Python. It fits in better with what serialized data does: if you save an array to a file and open it in another process and start making changes, it absolutely won't affect the original array—this ensures the same semantics for derived arrays regardless of whether they're saved to files.
Side-note:
parameters
should do this, too. This issue would be resolved by ensuring thatattrs
andparameters
both have this semantics, or by ensuring thatparameters
are strictly immutable.A note on performance: frozendicts don't need to be copied deeply. They can be passed around by reference. If the Attrs class is implemented with a reference to a parent, then this is the only thing that needs to change when deriving one array from another.
A note on memory management: the way that I described it above, with a reference to parent, introduces a cyclic reference (because the
ak.Array
has to have a reference to the Attrs). I think this cycle can be broken with two types; one is a short-lived proxy.A note on deprecation: this would change semantics in two things that have already been "in the wild," both
attrs
andparameters
. But there's a natural place to put the deprecation warning—in the__setitem__
call. This can go through the normal two-version deprecation cycle.If this seems to be too big of a change, here's a more conservative suggestion: keep the value/reference semantics that we have now, but at least ensure that the internal representation is a dict. We could refuse to accept a general Mapping or MutableMapping and only accept simple dicts, since we know we'll be able to reconstitute such a thing when deserializing from a file. (This is the opposite of what I told @pfackeldey in #3238.)
The text was updated successfully, but these errors were encountered: