diff --git a/docs/architecture-decisions/2024-05-21_primitive-bytes-and-strings.md b/docs/architecture-decisions/2024-05-21_primitive-bytes-and-strings.md index c611a06d..595181e0 100644 --- a/docs/architecture-decisions/2024-05-21_primitive-bytes-and-strings.md +++ b/docs/architecture-decisions/2024-05-21_primitive-bytes-and-strings.md @@ -5,20 +5,13 @@ - **Deciders**: Alessandro Cappellato (Algorand Foundation), Joe Polny (Algorand Foundation), Rob Moore (MakerX) - **Date created**: 2024-05-21 - **Date decided**: N/A -- **Date updated**: 2024-05-22 +- **Date updated**: 2024-05-31 ## Context See [Architecture Decision Record - Primitive integer types](./2024-05-21_primitive-integer-types.md) for related decision and context. -The AVM's only non-integer type is a variable length byte array. When *not* being interpreted as a `biguint`, leading zeros are significant and length is constant unless explicitly manipulated. Strings can only be represented in the AVM if they are encoded as bytes. The AVM supports byte literals in the form of base16, base64, and utf8 encoded strings. Once a literal has been parsed, the AVM has no concept of the original encoding or of utf8 characters. As a result, whilst a byte array can be indexed to receive a single byte (or a slice of bytes); it cannot be indexed to return a single utf8 *character* - unless one assumes all characters in the original string were ASCII (ie. single byte) characters. - -EcmaScript provides two relevant types for bytes and strings. - - - **string**: The native string type. Supports arbitrary length, concatenation, indexation/slicing of characters plus many utility methods (upper/lower/startswith/endswith/charcodeat/trim etc). Supports concat with binary `+` operator. - - **Uint8Array**: A variable length mutable array of 8-bit numbers. Supports indexing/slicing of 'bytes'. - -TealScript uses a branded string to represent bytes. Base64/Base16 encoding/decoding is performed with specific ops. The prototype of these objects contains string specific apis that are not implemented. +The AVM's only non-integer type is a variable length byte array. When *not* being interpreted as a `biguint`, leading zeros are significant and length is constant unless explicitly manipulated. Strings can only be represented in the AVM if they are encoded as bytes. The AVM supports byte literals in the form of base16, base64, and UTF-8 encoded strings. Once a literal has been parsed, the AVM has no concept of the original encoding or of UTF-8 characters. As a result, whilst a byte array can be indexed to receive a single byte (or a slice of bytes); it cannot be indexed to return a single UTF-8 *character* - unless one assumes all characters in the original string were ASCII (i.e. single byte) characters. Algorand Python has specific [Bytes and String types](https://algorandfoundation.github.io/puya/lg-types.html#avm-types) that have semantics that exactly match the AVM semantics. Python allows for operator overloading so these types also use native operators (where they align to functionality in the underlying AVM). @@ -26,8 +19,8 @@ Algorand Python has specific [Bytes and String types](https://algorandfoundation ## Requirements - Support bytes AVM type and a string type that supports ASCII UTF-8 strings -- Use idiomatic TypeScript expressions for string expressions, including concatenation operator (`+`) -- Semantic compatibility when executing on Node.js (e.g. in unit tests) and AVM +- Use idiomatic TypeScript expressions for string expressions +- Semantic compatibility between AVM execution and TypeScript execution (e.g. in unit tests) ## Principles @@ -40,6 +33,13 @@ Algorand Python has specific [Bytes and String types](https://algorandfoundation ### Option 1 - Direct use of native EcmaScript types + +EcmaScript provides two relevant types for bytes and strings. + + - **string**: The native string type. Supports arbitrary length, concatenation, indexation/slicing of characters plus many utility methods (upper/lower/startswith/endswith/charcodeat/trim etc). Supports concat with binary `+` operator. + - **Uint8Array**: A variable length mutable array of 8-bit numbers. Supports indexing/slicing of 'bytes'. + + ```ts const b1 = "somebytes" @@ -52,7 +52,29 @@ Whilst binary data is often a representation of a utf-8 string, it is not always The Uint8Array type is fit for purpose as an encoding mechanism but the API is not as friendly as it could be for writing declarative contracts. The `new` keyword feels unnatural for something that is ostensibly a primitive type. The fact that it is mutable also complicates the implementation the compiler produces for the AVM. -### Option 2 - Define a class to represent Bytes + + +### Option 2 - Branded strings (TEALScript approach) + + +TEALScript uses a branded `string` to represent `bytes` and native `string` to represent UTF-8 bytes. Base64/Base16 encoding/decoding is performed with specific methods. + +```typescript +const someString = "foo" +const someHexValue = hex("0xdeadbeef") // branded "bytes" +``` + +Bytes and UTF-8 strings are typed via branded `string` types. UTF-8 strings are the most common use case for strings, thus have the JavaScript `String` prototype functions when working with byteslice, which provides a familiar set of function signatures. This option also enables the usage of `+` for concatenation. + +To differentiate between ABI `string` and AVM `byteslice`, a branded type, `bytes`, can be used to represent non-encoded byteslices that may or may not be UTF-8 strings. + +Additional functions can be used when wanting to have string literals of a specific encoding represent a string or byteslice. + + +The downside of this approach is that the string prototype has a huge range of methods that are not all relevant and some of them return `number`, which is [not a semantically relevant type](./2024-05-21_primitive-integer-types.md). + + +### Option 3 - Define a class to represent Bytes A `Bytes` class is defined with a very specific API tailored to operations which are available on the AVM: @@ -75,12 +97,14 @@ class Bytes { ``` -This solution provides great type safety and requires no transpilation to run _correctly_ on Node.js. However, non-primitive types in node have equality checked by reference. Again the `new` keyword feels unnatural. Due to lack of overloading, `+` will not work as expected however concatenations do not require the same understanding of "order of operations" and nesting as numeric operations, so a concat method isn't as unwieldy (but still isn't idiomatic). +This solution provides great type safety and requires no transpilation to run _correctly_ on Node.js. However, non-primitive types in Node.js have equality checked by reference. Again the `new` keyword feels unnatural. Due to lack of overloading, `+` will not work as expected however concatenations do not require the same understanding of "order of operations" and nesting as numeric operations, so a `concat` method isn't as unwieldy (but still isn't idiomatic). ```ts const a = new Bytes("Hello") const b = new Bytes("World") +const ab = a.concat(b) + function testValue(x: Bytes) { // No compile error, but will work on reference not value switch(x) { @@ -95,7 +119,7 @@ function testValue(x: Bytes) { To have equality checks behave as expected we would need a transpilation step to replace bytes values in certain expressions with a primitive type. -### Option 3 - Implement bytes as a class but define it as a type + factory +### Option 4 - Implement bytes as a class but define it as a type + factory We can iron out some of the rough edges of using a class by only exposing a factory method for `Bytes` and a resulting type `bytes`. This removes the need for the `new` keyword and lets us use a 'primitive looking' type alias (`bytes` versus `Bytes` - much like `string` and `String`). We can use [tagged templates](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#tagged_templates) to improve the user experience of multipart concat expressions in lieu of having the `+` operator. diff --git a/docs/architecture-decisions/2024-05-21_primitive-integer-types.md b/docs/architecture-decisions/2024-05-21_primitive-integer-types.md index 7a1a1ddb..3d0ff22d 100644 --- a/docs/architecture-decisions/2024-05-21_primitive-integer-types.md +++ b/docs/architecture-decisions/2024-05-21_primitive-integer-types.md @@ -5,7 +5,7 @@ - **Deciders**: Alessandro Cappellato (Algorand Foundation), Joe Polny (Algorand Foundation), Rob Moore (MakerX) - **Date created**: 2024-05-21 - **Date decided**: N/A -- **Date updated**: 2024-05-22 +- **Date updated**: 2024-05-31 ## Context @@ -30,7 +30,7 @@ Algorand Python has specific [UInt64 and BigUint types](https://algorandfoundati - Support uint64 and biguint AVM types - Use idiomatic TypeScript expressions for numeric expressions, including mathematical operators (`+`, `-`, `*`, `/`, etc.) -- Semantic compatibility when executing on Node.js (e.g. in unit tests) and AVM +- Semantic compatibility between AVM execution and TypeScript execution (e.g. in unit tests) ## Principles @@ -40,15 +40,15 @@ Algorand Python has specific [UInt64 and BigUint types](https://algorandfoundati ## Options -### Option 1 - Direct use of native EcmaScript types +### Option 1 - Native types EcmaScript's `number` type is ill-suited to representing either AVM type reliably as it does not have the safe range to cover the full range of a uint64. Being a floating point number, it would also require truncating after division. EcmaScript's `bigint` is a better fit for both types but does not underflow when presented with a negative number, nor does it overflow at any meaningful limit for the AVM types. -If we solved the over/under flow checking with transpilation we still face an issue that `uint64` and `biguint` would not have discrete types and thus, we would have no type safety against accidentally passing a `biguint` to a method that expects a `uint64` and vice versa. +If we solved the over/under flow checking with a custom TypeScript transformer we still face an issue that `uint64` and `biguint` would not have discrete types for the compiler to know the difference between them and also we would have no type safety against accidentally passing a `biguint` to a method that expects a `uint64` and vice versa. -### Option 2 - Define classes to represent the AVM types +### Option 2 - Wrapper classes A `UInt64` and `BigUint` class could be defined which make use of `bigint` internally to perform maths operations and check for over or under flows after each op. @@ -70,22 +70,22 @@ class UInt64 { ``` -This solution provides the ultimate in type safety and semantic/syntactic compatibility, and requires no transpilation to run _correctly_ on Node.js. The semantics should be obvious to anyone familiar with Object Oriented Programming. The downside is that neither EcmaScript nor TypeScript support operator overloading which results in more verbose and unwieldy math expressions. +This solution provides the ultimate in type safety and semantic/syntactic compatibility, and requires no custom TypeScript transformer to run _correctly_ on Node.js. The semantics should be obvious to anyone familiar with Object Oriented Programming. The downside is that neither EcmaScript nor TypeScript support operator overloading which results in more verbose and unwieldy math expressions. The lack of idiomatic TypeScript mathematical operators is a deal breaker that rules this option out. ```ts const a = UInt64(500n) const b = Uint64(256) -// Not supported (a compile error in TS, unhelpful behaviour in ES) +// Not supported (a compile error in TS) const c1 = a + b // Works, but is verbose and unwieldy for more complicated expressions and isn't idiomatic TypeScript const c2 = a.add(b) ``` -### Option 3 - Use tagged/branded number types +### Option 3 - Branded `bigint` -TypeScript allows you to intersect primitive types with a simple interface to brand a value in a way which is incompatible with another primitive branded with a different value within the type system. +TypeScript allows you to intersect primitive types with a simple interface to brand a value in a way which is incompatible with another primitive branded with a different value within the type system. In this option the base type that is branded is `bigint`, which aligns to th discussion in Option 1 about the logical type to represent `uint64` and `biguint`. ```ts // Constructors @@ -97,34 +97,37 @@ type uint64 = bigint & { __type?: 'uint64' } type biguint = bigint & { __type?: 'biguint' } -const a: uint64 = 323n // Declare with type annotation +const a: uint64 = 323n // Declare with type annotation and raw `bigint` literal const b = UInt64(12n) // Declare with factory +const b2 = UInt64(12) // Factory could also take `number` literals (compiler could check they aren't negative and are integers) // c1 type is `bigint`, but we can mandate a type hint with the compiler (c2) const c1 = a + b const c2: uint64 = a + b + // No TypeScript type error, but semantically ambiguous - is a+b performed as a biguint op or a uint64 one and then converted? // (We could detect this as a compiler error though) const c3: biguint = a + b // Type error on b: Argument of type 'uint64' is not assignable to parameter of type 'biguint'. Nice! test(a, b) - function test(x: uint64, y: biguint) { // ... } ``` -This solution looks most like natural TypeScript / EcmaScript and results in math expressions that are much easier to read. The factory methods mimic native equivalents and should be familiar to existing developers. +This solution looks like normal TypeScript and results in math expressions that are much easier to read. The factory methods (e.g. `UInt64(4n)`) mimics native equivalents and should be familiar to existing developers. The drawbacks of this solution are: - - Less implicit type safety as TypeScript will infer the type of any binary math expression to be the base numeric type (`number`). A type annotation will be required where ever an identifier is declared and additional type checking will be required by the compiler to catch instances of assigning one numeric type to the other. - - In order to have 'run on Node.js' semantics of a `uint64` or `biguint` match 'run on the AVM', a transpiler will be required to wrap numeric operations in logic that checks for over and under flows. + - Less implicit type safety for branded types as TypeScript will infer the type of any binary math expression to be the base numeric type (a type annotation will be required where ever an identifier is declared, and the compiler will need to enforce this) + - In order to have TypeScript execution semantics of a `uint64` or `biguint` match the AVM, a custom TypeScript transformer will be required to wrap numeric operations in logic that checks for over and under flows line-by-line; this is straightforward to write though and has been successfully spiked out + - Additional type checking will be required by the compiler to catch instances of assigning one numeric type to the other (accidental implicit assignment) e.g. assigning a `uint64` value to `biguint`. + -TEALScript uses a similar approach to this, but uses `number` as the underlying type rather than `bigint`, which has the aforementioned downside of not being able to safely represent a 64-bit unsigned integer, but does has the advantage of being directly compatible with raw numbers (e.g. `3` rather than `3n`) and JavaScript prototype methods that return `number` like `array.length`. The requirement for semantic compatibility dictates that we need to use `bigint` rather than `number` since it's the correct type to represent the data and we will be able to create wrapper classes for things like arrays that have more explicitly typed methods for things like length. +### Option 4 Explicitly tagged brand types -A variation of the above with non-optional `__type` tags would prevent accidental implicit assignment errors, but require explicit casting on all ops and any methods that return a `number` such as the `length` method on arrays. +A variation of option 3 with non-optional `__type` tags would prevent accidental implicit assignment errors when assigning between (say) `uint64` and `biguint`, but require explicit casting on all ops and any methods that return the base type. ```ts declare function Uint64(v): uint64 @@ -145,13 +148,29 @@ c2 = Uint64(a + b) // ok c2 = (a + b) as uint64 // ok ``` -This introduces a degree of type safety with the in-built TypeScript type system at the expense of legibility. +This introduces a degree of type safety with the in-built TypeScript type system at the significant expense of legibility and writability. + + +### Option 5 Branded `number` (TEALScript approach) + +TEALScript uses a similar approach to option 3, but uses `number` as the underlying type rather than `bigint`. This has the advantage of being directly compatible with casting raw numbers (e.g. `const x: uint64 = 3` rather than `const x: uint64 = 3n`). +Furthermore, any JavaScript prototype methods that return `number` like `array.length` will be similarly able to be directly used and casted to `uint64` rather than wrapping in a factory method (e.g. `const x: uint64 = [1, 2, 3].length` rather than `const x = UInt64([1, 2, 3].length)`). It's not currently clear if any such methods will be exposed within the stub types that emerge in Algorand TypeScript though; if option 1 isn't chosen then ideally we would want to avoid exposing `number` within Algorand TypeScript altogether. Key prototypes that have `number` include `string` (see [Primitive bytes and strings](./2024-05-21_primitive-bytes-and-strings.md)) and array (but TypeScript allows you to define wrapper classes that support [iteration](https://www.typescriptlang.org/docs/handbook/iterators-and-generators.html) [and](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax) [spreading](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Symbol/isConcatSpreadable) so we can likely avoid `Array` prototype). + +If `number` is used as the base brand type for `uint64` and `bigint` is used as the base brand type for `biguint` (a type that TEALScript doesn't implement, so not a breaking change) then accidental implicit assignment errors are prevented by the TypeScript type system. + +A key issue with using `number` as the base type is that per option 1, it's semantically a floating point number, not an integer. It is possible for the compiler to check for and disallow non-integer constant literals though, which would prevent a non-integer value appearing outside of division. A custom TypeScript transformer will need to wrap division operations to allow the result to be truncated as an integer; this is a violation of the semantic compatibility principle, but given a branded type would be used rather than `number` (the fact the base type is `number` is largely hidden from the developer) it probably doesn't violate the principle of least surprise and may be considered an acceptable compromise. + +The other problem with use of `number` as the base brand type is that you will lose precision and get linting errors when representing a number greater than 53-bits as a constant literal e.g. `const x: uint64 = 9007199254740992`. It *may* be possible for a custom TypeScript transformer to get the value before precision is lost (needs investigation) and then disable that particular linting tool, but that is a fairly clear violation of semantic compatibility. The workaround would have to be that the compiler detects numbers > `Number.MAX_SAFE_INTEGER` and complains and instead you would have to use the factory syntax with a `bigint` constant literal e.g. `const x = UInt64(9007199254740992n)`. ## Preferred option -TBD +Either option 3 or option 5 depending on comfort level in using a floating point number type as the base type for `uint64`, requiring extra compiler checks & more complex custom transformers to overcome this, and not being able to cleanly represent very large integers as a constant literal vs lack of TypeScript protection against accidental implicit assignment of `uint64` and `biguint` (but can be checked by the compiler), and needing to avoid prototype methods that return `number` (although this matches semantic compatibility so may be a good idea anyway). + +Option 3 is also a breaking change for TEALScript, which would require `number` literals to either be suffixed with the `bigint` suffix (`n`) or be wrapped in a `UInt64()` factory call. + +Option 1 and 2 are excluded because they don't meet the requirements of semantic compatibility and least surprise. Option 4 is excluded, because the resulting syntax is unpractical. ## Selected option