Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Values and Representation chapter #1664

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

chorman0773
Copy link
Contributor

This documents the values for most types (where it has been decided), as well as the representation of these values in memory. It also defines what a byte is in Rust (including initialized and uninitialized memory).

The chapter does not define what Provenance carries in Rust. repr(Rust) enums are also not fully elaborated, as there are things undecided.

This makes a normative reference to https://ieeexplore.ieee.org/document/8766229 for floating-point format.

@chorman0773 chorman0773 added S-waiting-on-review Status: The marked PR is awaiting review from a maintainer T-opsem Team: opsem T-spec Team: spec labels Oct 25, 2024
src/values.md Outdated Show resolved Hide resolved
src/values.md Outdated
A floating-point value consists of either a rational number, which is within the range and precision dictated by the type, an infinity, or a NaN value.

r[value.primitive.float-repr]
A floating-point value is represented the same as a value of the unsigned integer type with the same width given by its [IEEE 754-2019] encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked at this one sentence in particular in our lang-docs call today, and we were having a lot of trouble parsing it. Perhaps you could say this another way for us to better understand the intent here. Why does this need to reference "an unsigned integer type of the same width" at all, e.g.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to inherit the definition of endianess from unsigned integer types (where its the easiest to define). Signed integers also do the same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the entire lang team is confused by a sentence, I think it's fair to say that it needs to be rewritten.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworking this now.

@RalfJung would you know what the state of NaNs are? For example, could I say that rust has One NaN which is encoded non-deterministically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to word the correspondance better. I can't get rid of it entirely without just copying the same encoding/decoding step (which IMO is just more confusing and redundant).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaNs are fully stable bit patterns, see https://rust-lang.github.io/rfcs/3514-float-semantics.html.

So we can either say that f32 has the same values as u32 and leave the "float" part to the higher layers, or we need a concept of "float value" that exactly preserves the sign, quiet bit, and payload of a NaN (and the sign of a 0).

@traviscross
Copy link
Contributor

@rust-lang/opsem: We're interested in your review on this. From the lang-docs side, we're particularly interested to confirm that this text is both correct from your perspective and is not making an new guarantees about the language.

@traviscross
Copy link
Contributor

cc @rust-lang/lang

src/values.md Outdated Show resolved Hide resolved
src/SUMMARY.md Outdated Show resolved Hide resolved
Copy link
Contributor

@ehuss ehuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling uncertain about this approach of having a chapter specifically for these representations. In the past, we have placed these definitions in the chapters for the types they are defining (for example, the char chapter defines the char). Can we move these rules into those chapters (and avoid any duplication with things that are already there)?

src/values.md Outdated
r[value.pointer]

r[value.pointer.thin]
Each thin pointer consists of an address and an optional provenance. The address refers to which byte the pointer points to. The provenance refers to which bytes the pointer is allowed to access, and the allocation those bytes are within.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starts using the term provenance without defining what it means. Would it be possible to at least start that in the glossary? I'm also not sure if we'll need a more elaborate introduction, since rfc#3559 is quite weighty, so maybe there will need to be a more dedicated section for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be eventually, but it also an incredibly weakly defined topic right now. The main definition we have today is frankly, "It exists". Anything more requires a ton more litigation from T-opsem. Defining a byte requires referring to provenance, though, at least in the most abstract sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added as much as is really decided.

@chorman0773
Copy link
Contributor Author

Can we move these rules into those chapters (and avoid any duplication with things that are already there)?

We still have to define what a byte is, which is relatively small for its own chapter, but doesn't have any other chapter to go in (glossary shouldn't provide normative definitions). It's also going to duplicate content excessively for aggregate types, since tuples and structs have different chapters, but have the exact same values and representation (according to a particular layout).

src/values.md Outdated Show resolved Hide resolved
@chorman0773 chorman0773 added S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. and removed S-waiting-on-review Status: The marked PR is awaiting review from a maintainer labels Nov 21, 2024
@chorman0773
Copy link
Contributor Author

Need to figure out how to disentangle the aggregate representation rules to move them into a separate chapter.

@chorman0773
Copy link
Contributor Author

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: The marked PR is awaiting review from a maintainer and removed S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. labels Nov 22, 2024
@chorman0773
Copy link
Contributor Author

Just talked about this on the Community Discord, apparently there is one guarantee this makes that is "new". Namely, it guarantees that wide pointers are represented like some repr(Rust) struct of the data pointer and its metadata. While this seems like it's doing nothing, one thing it does say is that the fields exist somewhere in the representation (and the compiler isn't doing far more interesting shenanigans than it might do to a tuple or a struct).
Is this a problematic guarantee to make?

Comment on lines +15 to +16
> While bytes in Rust are typically lowered to hardware bytes, they may contain additional values,
> such as being uninitialized, or storing part of a pointer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> While bytes in Rust are typically lowered to hardware bytes, they may contain additional values,
> such as being uninitialized, or storing part of a pointer.
> While bytes in Rust are typically lowered to hardware bytes, Rust uses an "abstract"
> notion of bytes that can make distinctions which are absent in hardware,
> such as being uninitialized, or storing part of a pointer.

r[memory.byte]

r[memory.byte.intro]
The most basic unit of memory in Rust is a byte. All values in Rust are computed from 0 or more bytes read from an allocation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "value" defined anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values are defined constructively, with each "class" of types. As mentioned in another comment, these are present in different chapters at the request of T-lang-doc and T-spec.

Comment on lines 18 to 22
r[memory.byte.init]
Each byte may be initialized, and contain a value of type `u8`, as well as an optional pointer fragment. When present, the pointer fragment carries [provenance][type.pointer.provenance] information.

r[memory.byte.uninit]
Each byte may be uninitialized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems quite odd to split the definition of a "byte" over two separate items. It is very crucial that this list is exhaustive, and cannot be extended by some other clause somewhere else saying "a byte may also be X". So IMO this should be a single paragaph saying something like:

Each byte is described by one of the following cases:

  • It is uninitialized.
  • It is initialized, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately addressing two points:

  • I'd like there to be separate rule ids for init and uninit so they can be more easily cited separately (though the items of a list can have their own rule ids - there is ample precedent for that),
  • As noted, it may not yet be fully desirable to define the exhaustive list, though it is useful to define the initialized/uninitialized separation on its own (clearly set out that types like integers or raw pointers can have "invalid" representations)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want "machine-readable" names for the variants, fair. But making them separate paragraphs just feels entirely wrong. This should be a list, formatted like a list. Can we have lists where each item has a "name"?

r[memory.encoding]

r[memory.encoding.intro]
Each type in Rust has 0 or more values, which can have operations performed on them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be setting up for "memory.encoding", but really it defines the notion of a "value". I think that notion deserves a separate definition, in a r[memory.value], which can then be referenced elsewhere. I am not fully convinced putting this in "memory" is the best choice, since one defining characteristic of values is that they do not exist in memory, but I also don't know the structure well enough to be able to suggest a better place.

This should then also give some sort of definition of a value. Something like:

A value in Rust describes a high-level, "mathematical" view on data at a given type. Examples of values include:

  • mathematical integers
  • Boolean truth values
  • A tuple of values of potentially different types
  • A homogeneous list of values of identical type

This list is not exhaustive; as Rust evolves, more kinds of values can be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be one chapter that fully defined value. I was asked to move those definitions to the respective type chapters to avoid duplication.

Copy link
Member

@RalfJung RalfJung Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the encode/decode defined with each type makes sense IMO. But the concept of a value is a central concept that should be defined somewhere, it can't just be a decentral list. (Think: there is one place that defines trait Value, and then many places define what all the things inhabiting that trait are.) Having the list of "what is a value" spread out is fine (after all, my suggestion already said that the list is non-exhaustive), but the central definition should give some examples of what values are, just to make it understandable and to clearly differentiate it from "sequence of bytes".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that there is a good place for the definition of value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put it in memory-model.md for now. It should go together with encode/decode and those already live there.

Comment on lines 38 to 39
Each value of a type can be encoded into a sequence of bytes, and decoded from a sequence of bytes, which has a length equal to the size of the type.
The operation to encode or decode a value is determined by the representation of the type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The representation relation does not determine encode/decode, since there can be more than one representation of a value -- encode makes the choice of which representation to pick. So it's the other way around, encode/decode determine the representation relation.

I would phrase this to be more centered around types, not values. Something like:

Each type defines an operation to decode a sequence of bytes into a value, and to encode a value into a sequence of bytes. When a sequence of bytes decodes to a value, we say that it represents that value. A value that is valid for a given type may be represented by multiple different sequences of bytes, and a sequence of bytes may represent 0 or 1 values of any given type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing is that most of the other chapters will define just representation to get both at once. I'd like to preserve this ability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although in most cases, simply defining the decoding should suffice as decode(T, encode(T, val)) is defined to round-trip val later.

Copy link
Member

@RalfJung RalfJung Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing is that most of the other chapters will define just representation to get both at once. I'd like to preserve this ability.

Formally speaking, that's sloppy. If we say that every type has encode and decode, we have to say explicitly how they are defined anywhere.

Although in most cases, simply defining the decoding should suffice as decode(T, encode(T, val)) is defined to round-trip val later.

So then I would suggest that these types have something like

r[<>.repr.decode]
...

r[<>.repr.encode]
The encoding of a value is the unique sequence of bytes that decodes back to the value

However, note that for most types there is no unique such sequence (since decode often ignores provenance).

With more setup we could define the encoding as the unique minimum sequence of bytes that decodes back to the value, where "minimum" uses an ordering that sets "uninit < init(no-provenance) < init(with-provenance)". In fact, this may even be true for all types... it follows from provenance monotonicity + initialization monotonicity.

Each value of an integer type is a whole number. For unsigned types, this is a positive integer or `0`. For signed types, this can either be a positive integer, negative integer, or `0`.

r[type.numeric.repr.integer-width]
The range of values an integer type can represent depends on its signedness and its width, in bits. The width of type `uN` or `iN` is `N`. The width of type `usize` or `isize` is the value of the `target_pointer_width` property.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this say how it depends on that?

Specifically, unsigned integers carry values in 0 .. 2^N, and signed integers carry values in -2^(N-1) .. 2^(N-1) (left-inclusive, right-exclusive).

@@ -29,6 +29,32 @@ A _unit-like struct_ type is like a struct type, except that it has no fields.
The one value constructed by the associated [struct expression] is the only
value that inhabits such a type.

## Struct and aggregate values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an "aggregate" here? I would have expected that structs and enums are aggregates, but enums are defined elsewhere and structs seem to be a separate class as well, so I am confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuples are aggregates also. The current definition of a wide pointer also ends up using aggregate representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuples have their own section on this, though. So I find the current structure a bit confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuples would have the exact same set of values and the exact same representation formula (given the same chosen field offsets, size, and alignment constraints) as a struct with the same type fields, though, so this would just end up duplicating the same text. The tuple chapter does have a clause that explicitly delegates to this section, though, so it's not implicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array section also use to delegate in the same way, but I changed it in a later revision to simply do it manually since unlike structs or tuples, arrays can't shallowly have embedded padding bytes (only padding that exists inside of an element because of the elements representation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the goal of not duplicating this for tuples. I just think the way it is currently done is confusing.

Even if this here just says "Struct values", we can still say in the tuple section that they use the same values and encode/deocde as structs, can't we?

@@ -29,6 +29,32 @@ A _unit-like struct_ type is like a struct type, except that it has no fields.
The one value constructed by the associated [struct expression] is the only
value that inhabits such a type.

## Struct and aggregate values

r[type.struct.value]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say somewhere that a value of strict type is a list of values with one value for each field of the struct.

The representation of such a struct contains the representation of the value of each field at its corresponding offset.

r[type.struct.value.padding-uninit]
When a value of an aggregate is encoded, each padding byte is left as uninit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, this can IMO be defined more nicely without having to talk about padding at all:

To decode a struct value, decode each field at its respective offset in the byte sequence, and use that to form the decoded value. To encode a struct value, encode each field, and place the encoding at the respective offset, leaving all bytes that are not in any field uninitialized.

NOTE: the bytes that are not in any field are also often called "padding bytes". The representation defined above implies that the contents of padding bytes are lost and reset to uninitialized each time a typed copy of a struct value is performed.

(Have we defined "typed copy" yet?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need to define typed copy for this - a typed copy is just a decoding of some memory to a value, then re-encoding that value elsewhere. It may be useful to define non-normatively, or elsewhere, but we don't need to define it for this purpose.
As with enums, we need to define padding anyways b/c unions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to define non-normatively, or elsewhere, but we don't need to define it for this purpose.

Fair. I feel like (non-normatively) stating the fact that padding gets reset can be useful for understanding though. So maybe we can have something like

NOTE: As a consequence of this definition, if some sequence of bytes is decoded and then re-encoded, all information stored in padding bytes is lost (reset to uninitialized) as part of this round-trip.

Comment on lines 27 to 28
r[type.union.value]
A value of a union type consists of a sequence of bytes, corresponding to each [value byte][type.struct.value.value-bytes]. The value bytes of a union are represented exactly. Each [padding byte][type.struct.value.padding] is set to uninit when encoded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the fields of a union do not have to be structs, it is odd to refer to type.struct here. This definition requires there to be a general notion of "padding bytes in any type", and each type needs to say what its padding bytes are.

> such as being uninitialized, or storing part of a pointer.

r[memory.byte.init]
Each byte may be initialized, and contain a value of type `u8`, as well as an optional pointer fragment. When present, the pointer fragment carries [provenance][type.pointer.provenance] information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between a "pointer fragment" and "provenance"? Wouldn't it be easier to just say, "[...] as well as optional provenance information"?

@RalfJung
Copy link
Member

RalfJung commented Dec 1, 2024

We're interested in your review on this. From the lang-docs side, we're particularly interested to confirm that this text is both correct from your perspective and is not making an new guarantees about the language.

This makes several new guarantees:

  • We have never defined what Rust's concept of a "byte" is. Uninitialized memory is a thing (though AFAIK this was never explicitly FCP'd either), provenance is a thing, but this PR basically guarantees that that's it, and there's nothing else "odd" about our bytes. I can't immediately think of anything else, but e.g. maybe we need more extra magic state to give an abstract definition of the behavior of coroutines?
  • We have not defined that reading a pointer byte at integer type just ignores the provenance. It'd probably be better if this PR didn't make a commitment here.

@RalfJung
Copy link
Member

RalfJung commented Dec 1, 2024

I should also add that overall I am very happy with the structure and style of the new text here. :) I can finally see how this could hold together as a proper spec also for the operational aspects of the language.

@chorman0773
Copy link
Contributor Author

We have never defined what Rust's concept of a "byte" is. Uninitialized memory is a thing (though AFAIK this was never explicitly FCP'd either), provenance is a thing, but this PR basically guarantees that that's it, and there's nothing else "odd" about our bytes. I can't immediately think of anything else, but e.g. maybe we need more extra magic state to give an abstract definition of the behavior of coroutines?

I don't believe this does confine the definition of byte to the current list. As you pointed out, the current definition (over two separate clauses) is not clearly exhaustive. I don't think there's anything else that would clearly prohibit a new fancy type of byte from coming into existance (though once we do affirm that, I hope to reflect that in the text).

We have not defined that reading a pointer byte at integer type just ignores the provenance. It'd probably be better if this PR didn't make a commitment here.

This part is fair, though part of me wants to just ask for a T-opsem FCP on that (could be done over on ucg though), as I don't think there's any other rule that satisfies monotonicity (other than PVI which has been ruled out through other means).

@RalfJung
Copy link
Member

RalfJung commented Dec 1, 2024

I don't believe this does confine the definition of byte to the current list. As you pointed out, the current definition (over two separate clauses) is not clearly exhaustive. I don't think there's anything else that would clearly prohibit a new fancy type of byte from coming into existance (though once we do affirm that, I hope to reflect that in the text).

I still think "byte" should be defined in a single clause, spreading it out is IMO quite confusing. We can add that the list can be extended in future versions of the spec. (However, this might make formal reasoning about programs impossible, so at some point we have to nail this down.)

This part is fair, though part of me wants to just ask for a T-opsem FCP on that (could be done over on ucg though), as I don't think there's any other rule that satisfies monotonicity (other than PVI which has been ruled out through other means).

I'd rather not block this PR on making that commitment. I agree that if we want provenance monotonicity I don't see another option, but there might be designs that entirely forego the need for provenance monotonicity... OTOH we've already committed to some spec changes for offset and offset_from that may imply that provenance monotonicity is unavoidable at this point.

@rustbot

This comment has been minimized.

> Only certain byte sequences may decode into a value of a given type. For example, a byte sequence consisting of all zeroes does not decode to a value of a reference type.

r[memory.encoding.representation]
A sequence of bytes is said to represent a value of a type, if the decode operation for that type produces that value from that sequence of bytes. The representation of a type is the partial relation between byte sequences and values those sequences represent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A sequence of bytes is said to represent a value of a type, if the decode operation for that type produces that value from that sequence of bytes. The representation of a type is the partial relation between byte sequences and values those sequences represent.
A sequence of bytes is said to represent a value of a type, if the decode operation for that type produces that value from that sequence of bytes. The representation of a type is the partial relation between byte sequences and values those sequences represent.
This relation is functional, i.e., a given byte sequence represents at most one value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do decide to define decoding as the unique minium value, we could possibly even reduce this definition to just say, for each type there is a representation function, which is a partial function that decodes a sequence of bytes into a corresponding value. And then we once-and-forall define the corresponding encode operation across all types.

A sequence of bytes is said to represent a value of a type, if the decode operation for that type produces that value from that sequence of bytes. The representation of a type is the partial relation between byte sequences and values those sequences represent.

> [!NOTE]
> Representation is related to, but is not the same property as, the layout of the type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this more confusing than helpful. What even exactly is the "layout" of a type, i.e. what is the mathematical type of a "layout"?

I am not sure what you are trying to achieve with this sentence. Here's my attempt at rewording:

Suggested change
> Representation is related to, but is not the same property as, the layout of the type.
> The representation of a type is determined by its layout. For instance, the layout of a struct defines the offsets of all its fields, and this in turn defines the representation of struct values as sequences of bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm attempting to explain someting to less mathetmatically-focused consumers of the reference. This is a non-normative note and therefore serves to explain or exemplify something about the preceeding paragraph.

Copy link
Member

@RalfJung RalfJung Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I wrote wasn't even very mathematical. ;)

I find your current note confusing. I don't know what you are trying to say, so I made a guess. I am also fine with removing it entirely, but if you want to keep it then please explain which point you are trying to make. There's no point in keeping a note which is confusing to an expert; it will likely confuse many novices as well.

@rustbot
Copy link
Collaborator

rustbot commented Jan 14, 2025

☔ The latest upstream changes (possibly f80986b) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: The marked PR is awaiting review from a maintainer T-opsem Team: opsem T-spec Team: spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants