Skip to content

Latest commit

 

History

History
424 lines (337 loc) · 21.3 KB

Rationale.md

File metadata and controls

424 lines (337 loc) · 21.3 KB

Design Rationale

This document describes rationales for WebAssembly's design decisions, acting as footnotes to the main design text, keeping the main specification easier to read, and making it easier to revisit decisions later without having to plow through all the issues and pull requests. This rationale document tries to list how decisions were made, and where tradeoffs were made for the sake of language ergonomics, portability, performance, security, and Getting Things Done.

WebAssembly was designed incrementally, with multiple implementations being pursued concurrently. As the MVP stabilizes and we get experience from real-world codebases, we'll revisit the alternatives listed below, reevaluate the tradeoffs and update the design before the MVP is finalized.

Why AST?

Why not a stack-, register- or SSA-based bytecode?

Basic Types Only

WebAssembly only represents a few types.

  • More complex types can be formed from these basic types. It's up to the source language compiler to express its own types in terms of the basic machine types. This allows WebAssembly to present itself as a virtual ISA, and lets compilers target it as they would any other ISA.
  • These types are efficiently executed by all modern architectures.
  • Smaller types (such as i8 and i16) are usually no more efficient and in languages like C/C++ are only semantically meaningful for memory accesses since arithmetic get widened to i32 or i64. Avoiding them at least for MVP makes it easier to implement a WebAssembly VM.
  • Other types (such as f16, i128) aren't widely supported by existing hardware and can be supported by runtime libraries if developers wish to use them. Hardware support is sometimes uneven, e.g. some support load/store of f16 only whereas other hardware also supports scalar arithmetic on f16, and yet other hardware only supports SIMD arithmetic on f16. They can be added to WebAssembly later without compromising MVP.
  • More complex object types aren't semantically useful for MVP: WebAssembly seeks to provide the primitive building blocks upon which higher-level constructs can be built. They may become useful to support other languages, especially when considering garbage collection.

Load/Store Addressing

Load/store instructions include an immediate offset used for addressing. This is intended to simplify folding of offsets into complex address modes in hardware, and to simplify bounds checking optimizations. It offloads some of the optimization work to the compiler that targets WebAssembly, executing on the developer's machine, instead of performing that work in the WebAssembly compiler on the user's machine.

Alignment Hints

Load/store instructions contain alignment hints. This makes it easier to generate efficient code on certain hardware architectures.

Either tooling or an explicit opt-in "debug mode" in the spec could allow execution of a module in a mode that threw exceptions on misaligned access. This mode would incur some runtime cost for branching on most platforms which is why it isn't the specified default.

Out of Bounds

The ideal semantics is for out-of-bounds accesses to trap, but the implications are not yet fully clear.

There are several possible variations on this design being discussed and experimented with. More measurement is required to understand the associated tradeoffs.

  • After an out-of-bounds access, the instance can no longer execute code and any outstanding JavaScript ArrayBuffer aliasing the linear memory are detached.
    • This would primarily allow hoisting bounds checks above effectful operators.
    • This can be viewed as a mild security measure under the assumption that while the sandbox is still ensuring safety, the instance's internal state is incoherent and further execution could lead to Bad Things (e.g., XSS attacks).
  • To allow for potentially more-efficient memory sandboxing, the semantics could allow for a nondeterministic choice between one of the following when an out-of-bounds access occurred.
    • The ideal trap semantics.
    • Loads return an unspecified value.
    • Stores are either ignored or store to an unspecified location in the linear memory.
    • Either tooling or an explicit opt-in "debug mode" in the spec should allow execution of a module in a mode that threw exceptions on out-of-bounds access.

Resizing

To allow efficient engines to employ virtual-memory based techniques for bounds checking, memory sizes are required to be page-aligned. For portability across a range of CPU architectures and operating systems, WebAssembly defines a fixed page size. Programs can depend on this fixed page size and still remain portable across all WebAssembly engines. 64KiB represents the least common multiple of many platforms and CPUs. In the future, WebAssembly may offer the ability to use larger page sizes on some platforms for increased TLB efficiency.

Linear memory disabled if no linear memory section

See #107.

Control Flow

Structured control flow provides simple and size-efficient binary encoding and compilation. Any control flow—even irreducible—can be transformed into structured control flow with the Relooper algorithm, with guaranteed low code size overhead, and typically minimal throughput overhead (except for pathological cases of irreducible control flow). Alternative approaches can generate reducible control flow via node splitting, which can reduce throughput overhead, at the cost of increasing code size (potentially very significantly in pathological cases). Also, more expressive control flow constructs may be added in the future.

Locals

C/C++ makes it possible to take the address of a function's local values and pass this pointer to callees or to other threads. Since WebAssembly's local variables are outside the address space, C/C++ compilers implement address-taken variables by creating a separate stack data structure within linear memory. This stack is sometimes called the "aliased" stack, since it is used for variables which may be pointed to by pointers.

This prevents WebAssembly from performing clever optimizations on the stack and liveness of such variables, but this loss isn't expected to be consequential. Common C compiler optimizations such as LLVM's global value numbering effectively split address-taken variables into parts, shrinking the range where they actually need to have their address taken, and creating new ranges where they can be allocated as local variables.

Conversely, non-address taken values which are usually on the stack are instead represented as locals inside functions. This effectively means that WebAssembly has an infinite set of registers, and can choose to spill values as it sees fit in a manner unobservable to the hosted code. This implies that there's a separate stack, unaddressable from hosted code, which is also used to spill return values. This allows strong security properties to be enforced, but does mean that two stacks are maintained (one by the VM, the other by the compiler which targets WebAssembly) which can lead to some inefficiencies.

Local variables are not in Static Single Assignment (SSA) form, meaning that multiple incoming SSA values which have separate liveness can "share" the storage represented by a local through the set_local operator. From an SSA perspective, this means that multiple independent values can share a local variable in WebAssembly, which is effectively a kind of pre-coloring that clever producers can use to pre-color variables and give hints to a WebAssembly VM's register allocation algorithms, offloading some of the optimization work from the WebAssembly VM.

Variable-Length Argument Lists

C and C++ compilers are expected to implement variable-length argument lists by storing arguments in a buffer in linear memory and passing a pointer to the buffer. This greatly simplifies WebAssembly VM implementations by punting this ABI consideration to the front-end compiler. It does negatively impact performance, but variable-length calls are already somewhat slow.

Multiple Return Values

TODO

Indirect Calls

The table-based scheme for indirect function calls was motivated by the need to represent function pointers as integer values that can be stored into the linear memory, as well as to enforce basic safety properties such as calling a function with the wrong signature does not destroy the safety guarantees of WebAssembly. In particular, an exact signature match implies an internal machine-level ABI match, which some engines require to ensure safety. An indirection also avoids a possible information leak through raw code addresses.

Languages like C and C++ that compile to WebAssembly also imposed requirements, such as the uniqueness of function pointers and the ability to compare function pointers to data pointers, or treat data as function pointers.

Several alternatives to direct indices with a heterogeneous indirect function table were considered, from alternatives with multiple tables to statically typed function pointers that can be mapped back and forth to integers. With the added complication of dynamic linking and dynamic code generation, none of these alternatives perfectly fit the requirements.

The current design requires two dynamic checks when invoking a function pointer: a bounds check against the size of the indirect function table and a signature check for the function at that index against an expected signature. Some dynamic optimization techniques (e.g. inline caches, or a one-element cache), can reduce the number of checks in common cases. Other techniques such as trading a bounds check for a mask or segregating the table per signature to require only a bounds check could be considered in the future. Also, if tables are small enough, an engine can internally use per-signature tables filled with failure handlers to avoid one check.

Expressions with Control Flow

Expression trees offer significant size reduction by avoiding the need for set_local/get_local pairs in the common case of an expression with only one immediate use. Control flow "statements" are in fact expressions with result values, thus allowing even more opportunities to build bigger expression trees and further reduce set_local/get_local usage (which constitute 30-40% of total bytes in the polyfill prototype). Additionally, these primitives are useful building blocks for WebAssembly-generators (including the JavaScript polyfill prototype).

Limited Local Nondeterminism

There are a few obvious cases where nondeterminism is essential to the API, such as random number generators, date/time functions or input events. The WebAssembly specification is strict when it comes to other sources of limited local nondeterminism of operators: it specifies all possible corner cases, and specifies a single outcome when this can be done reasonably.

Ideally, WebAssembly would be fully deterministic because a fully deterministic platform is easier to:

  • Reason about.
  • Implement.
  • Test portably.

Nondeterminism is only specified as a compromise when there is no other practical way to:

  • Achieve portable native performance.
  • Lower resource usage.
  • Reduce implementation complexity (both of WebAssembly VMs as well as compilers generating WebAssembly binaries).
  • Allow usage of new hardware features.
  • Allows implementations to security-harden certain usecases.

When nondeterminism is allowed into WebAssembly it is always done in a limited and local manner. This prevents the entire program from being invalid, as would be the case with C++ undefined behavior.

As WebAssembly gets implemented and tested with multiple languages on multiple architectures there may be a need to revisit some of the decisions:

  • When all relevant hardware implement features the same way then there's no need to add nondeterminism to WebAssembly when realistically there's only one mapping from WebAssembly expression to ISA-specific operators. One such example is floating-point: at a high-level most basic instructions follow IEEE-754 semantics, it is therefore not necessary to specify WebAssembly's floating-point operators differently from IEEE-754.
  • When different languages have different expectations then it's unfortunate if WebAssembly measurably penalizes one's performance by enforcing determinism which that language doesn't care about, but which another language may want.

NaN bit pattern propagation

In general, WebAssembly's floating point operations propagate NaN bit patterns. When an operation has a NaN operand, it returns a NaN result with the same bit pattern. This is done in accordance with IEEE 754-2008, and it also has the desirable property of making it easier to port interpreters to WebAssembly that use NaN-boxing, because they can rely on the property that if an arithmetic operation has no non-canonical NaNs as input, its output is also canonical.

The specific bit-pattern rules are modeled after what numerous popular hardware architectures do. Note that IEEE 754-1985 had looser rules for NaN bit pattern encodings than IEEE 754-2008, and some hardware architectures, notably MIPS, historically behaved differently than other architectures. However, since the publication of IEEE 754-2008, MIPS has added a configuration mode (NAN2008) which enables support for the new rules.

In particular, the sign bit of generated NaNs is nondeterministic since x86 generates NaNs with it set to 1 while other architectures generate NaNs with it set to 0.

Integer operations

WebAssembly's signed integer divide rounds its result toward zero. This is not because of a lack of sympathy for better alternatives, but out of practicality. Because all popular hardware today implements rounding toward zero, and because C and many other languages now specify rounding to zero, having WebAssembly in the middle doing something different would mean divisions would have to be doubly complicated.

Similarly, WebAssembly's shift operators mask their shift counts to the number of bits in the shifted value. Confusingly, this means that shifting a 32-bit value by 32 bits is an identity operation, and that a left shift is not equivalent to a multiplication by a power of 2 because the overflow behavior is different. Nevertheless, because several popular hardware architectures today implement this masking behavior, and those that don't can typically emulate it with a single extra mask instruction, and because several popular source languages, including JavaScript and C#, have come to specify this behavior too, we reluctantly adopt this behavior as well.

Motivating Scenarios for Feature Testing

  1. Post-MVP, i32.min_s is introduced. A WebAssembly developer updates their toolkit so that the compiler may leverage i32.min_s. The developer's WebAssembly module works correctly both on execution environments at MVP, as well as those supporting i32.min_s.
  • A variant of this, where a few more new opcodes are available, the compiler is updated to be able to leverage all of them, but not all execution targets support all of them. The developer wants to reach as many of their customers as possible, while at the same time providing them with the best experience possible. The developer has to balance the cost of the test matrix resulting from the combinations of possible feature configurations.
  1. Post-MVP, module authors may now use Threading APIs in the browser. A developer wants to leverage multithreading in their module.
  • In one variant of the scenario, our developer does not want to pay the engineering cost of developing and supporting a threaded and non-threaded version of their code. They opt not to support MVP targets, and only support post-MVP targets. End-users (browser users) get some message indicating they need MVP support.

  • In another variant, our developer explicitly authors both MVP-only and post- MVP (with threads) code.

  1. SIMD support is not universally equivalent on all targets. While polyfill variants of SIMD APIs are available, a developer prefers writing dedicated SIMD and non-SIMD versions of their compression algorithm, because the non-SIMD version performs better in environments without SIMD support, when compared to the SIMD polyfill. They package their compression code for reuse by third parties.

  2. An application author is assembling together an application by reusing modules such as those developed in the scenarios above. The application author's development environment is able to quickly and correctly identify the platform dependencies (e.g. threading, SIMD) and communicate back to the application author the implications these dependencies have on the end-application. Some APIs exposed from the threading-aware module are only pertinent to environments supporting threading. As a consequence, the application author needs to write specialized code when threads are/are not supported. (Note: we should understand this scenario for both forms of WebAssembly reuse currently imagined: dynamic linking and static imports.)

  3. The compression algorithm described in scenario 3 is deployed on a restrictive execution environment, as part of an application. In this environment, a process may not change memory page access protection flags (e.g. certain gaming consoles, to investigate server side deployment scenarios). The compression module is compiled by the WebAssembly environment, enabling the configuration most specific to the target (i.e. with/without Threads, SIMD, etc).

  • A variant of this scenario where the environment is additionally separating storage into system-visible and application-visible, the latter not being able to contain machine-executable code (certain phones, to investigate if gaming consoles or server side have a similar sandboxing mechanism).

Why a binary encoding?

Given that text is so compressible and it is well known that it is hard to beat gzipped source, is there any win from having a binary format over a text format? Yes:

  • Large reductions in payload size can still significantly decrease the compressed file size.
    • Experimental results from a polyfill prototype show the gzipped binary format to be about 20-30% smaller than the corresponding gzipped asm.js.
  • A binary format that represents the names of variables and functions with raw indices instead of strings is much faster to decode: array indexing vs. dictionary lookup.
    • Experimental results from a polyfill prototype show that decoding the binary format is about 23× faster than parsing the corresponding asm.js source (using this demo, comparing just parsing in SpiderMonkey (no validation, IR generation) to just decoding in the polyfill (no asm.js code generation).
  • A binary format enables optimizations that reduce the memory usage of decoded ASTs without increasing size or reducing decode speed.

Why a layered binary encoding?

  • We can do better than generic compression because we are aware of the AST structure and other details:
    • For example, macro compression that deduplicates AST trees can focus on AST nodes + their children, thus having O(nodes) entities to worry about, compared to generic compression which in principle would need to look at O(bytes*bytes) entities. Such macros would allow the logical equivalent of #define ADD1(x) (x+1), i.e., to be parametrized. Simpler macros (#define ADDX1 (x+1)) can implement useful features like constant pools.
    • Another example is reordering of functions and some internal nodes, which we know does not change semantics, but can improve general compression.
  • JITs and simple developer tooling do not benefit from compression, so layering allows the related development and maintenance burden to be offloaded to reusable tools/libraries.
  • Each of the layers works to find compression opportunities to the best of its abilities, without encroaching upon the subsequent layer's compression opportunities.
  • Existing web standards demonstrate many of the advantages of a layered encoding strategy.