Path to the Screen

In this document, we'll describe the path WR (short for WebRender) takes to get actual pages on screen.

Initializing and working with the frontend

Renderer::new() creates the WR instance with the following components:

RenderBackend (RB). It's not returned from the function, but instead put to work in a separate thread and communicate via messages with the following objects.
RenderApiSender, needed to produce RenderApi instances, each with a unique namespace.
Renderer owns the graphics context and does the actual rendering.

When issuing commands through RenderApi, they get serialized and sent over a channel (IPC channel in Servo, MPSC channel in Gecko) to the RenderBackend.

Resource management

RenderApi attempts to be consistent with regards to the resource API:

generate_something_key() - get a new unique key that is not yet associated with any data.
add_something(key, ...) - associate the key with actual data (e.g. texels of an image).
push_something_else - there can be several methods that use our key in one way or another. For example, an image can be provided as a part of ClipRegion for any primitive.
update_something(key, ...) - update a part of a resource associate with the key.
delete_*(key) - destroy the associated resource with all its contents.

Setting pipelines and display lists (DL)

A frame is considered complete when it has a root pipeline, set by set_root_pipeline, and a display list, set by set_display_list. The frame doesn't get rendered until generate_frame is called, or any sort of scrolling is requested (scroll, scroll_layer_with_id, etc).

When calling generate_frame, the user can provide a list of new values for the animated properties of a frame. This doesn't force the re-generation of a frame by the RenderBackend, and re-uses the last formed frame that Renderer received.

Render Backend (RB)

The backend listens to the user commands (issued via RenderApi) via a channel. It's task is to process the data from the representation the user sees it to the one GPU can consume directly in order to draw it on screen. The result of the process is a Frame object that gets sent to the Renderer.

Threading of the backend, or even its existence is opaque to the user. It's currently running in a separate thread, but we may transition towards a thread pool with some sort of a job system in the future.

Flattening the contexts

The data is provided in a form of a tree, where nodes are stacking contexts and iframes, and leafs are primitives. Flattening is the process that RB does first when processing a scene, it records layers and primitives, saving all the associated CPU/GPU data into a number of containers:

stacking_context_store - records all stacking contexts information for the CPU access
packed_layers - all GPU information about transformations and bounds of document layers
prim_store

Primitive Store (PS)

PS contains all the information about actual primitives. On the CPU side, it knows about bounding rectangles, common meta-data, as well as specific bits of information for text runs, images, gradients, etc. Meta-data hooks up each primitive with the relevant clip information, CPU & GPU primitive indices, GPU data addresses, and required render tasks.

On the GPU side, the data is split into:

geometry - contains actual rectangle shapes and clipping bounds.
resource_rects - stores rectangle coordinates for various primitives. This is required for late update of those coordinates, for example when an external image is getting updated before shown on screen.
generic blocks of data for 16, 32, 64, and 128 bytes length.

Each of these GPU containers is associated with a texture on the shader side. These textures are typically accessed from the vertex shader, in order to read the data about the primitive and place it properly.

Building a frame

When all the data is nicely laid out in arrays, we can start converting it into an actual task tree. This is what FrameBuilder::build is doing.

We recalculate the clip/scroll groups and nodes, compute the visibility of stacking contexts, and then derive the following information for all visible items:

clipping mask that needs to be computed beforehand and applied when rendering
box shadow tasks
bounding rectangles
actual text glyphs for visible framgents of text
gradient stops

All the missing data, like the contents of images being loaded, or rasterized glyphs, is requested simultaneously and then waited upon during the build procedure. Some primitives need a separate update pass in order to patch the bits of data that depended on the unknowns during the previous phases, this is done in resolve_primitives, e.g. texture IDs and GPU rectangles.

The function of build() is to build a single root render task. When primitives are registered there, they get added to the relevant batchers, which split the primitives into homogeneous groups and lay out the unique data sequentially, allowing for many original primitives to be drawn in just a few batched render calls.

Some stacking contexts establish a composite operation, meaning that some processing of the temporary render result need to occur. For example, a context may need to be alpha-blended on top of the already rendered items, or it mixes two other contexts in a specific way.

You can find more information about the task lifetime on the Life of a Task page.

Depth-sorting

Most of the primitives end up in one of 2 rendering groups: opaque and transparent. We consider the Z index to correspond to the order a primitive came in with, so that primitives added later will be drawn behind.

First, opaque primitives are drawn with Z testing and writing, in the order from front to back. This allows the rasterizer to quickly skip pixels already covered by preceding primitives. We strive to draw as much as possible in the opaque pass.

Secondly, transparent primitives are drawn with Z testing only, in the order from back to front. This is where we can get alpha blending as well as overdraw, depending on the efficiency of the opaque pass.

One of the optimizations that we do is - splitting a simple rounded cornered rectangle into the opaque one and a number of smaller transparent ones. This allows filling more screen space in the opaque pass and improve the shading performance of a document.

Assigning to render passes

The task tree represents all the work GPU needs to do in a shape of a tree, where children nodes are dependencies. The depth of this tree determines the number of passes that will need to occur in order to execute all the tasks. Each pass is chunk of work that doesn't have inner dependencies, and only depends on the result of the previous pass (if there is any).

For example, considering this task tree:

A -> B -> C
  -> D -> E -> F

The number of passes will be 4, including the following tasks: [[F], [C, E], [B, D], [A]]. This is what assign_to_passes method is doing - flattening the task tree into passes.

Each pass is associated with a multi-layer texture that stores its results. Except for the last pass, which always draws to the screen. The texture needs to be multi-layered, since we don't statically know how much texel space we'll need to be assign to it. When traversing the task tree, we gather all required render targets targets, and from there, calling RenderPass::allocate_target for each.

Texture allocation

Texture cache is a part of Render Backend. It stores a set of texture pages that are split depending on the format: A8, RGB8, and RGBA8. There may be more in the future.

When a request is received by the RB to add an image, for example, the set of texture pages with the relevant format is considered by TextureCache::allocate. If we aren't able to fit the data into one of the existing pages, we allocated a new one.

Each page serves as a texture atlas holding multiple textures. If a requested texture is too big for even an empty page, it gets split into multiple tiles.

Texture data is stored within the cache as-is. There are no safe borders added from the sides, so any sampling from the texture data needs to take the actual bounds into account and force the pixel shader to avoid sampling outside of the allocated rectangle. Failure to do so will result in the artefacts across the image borders.

One way to prevent sampling outside the bounds is to clamp the texture coordinates to the rectangle half a texel inside the allocated region, while making sure that only lod[0] is getting sampled (e.g. with textureLod). Here is an extract from the image rendering code showing this:

// in VS, having `st0` and `st1` as the original image bounds
vStRect = vec4(min(st0, st1) + half_texel, max(st0, st1) - half_texel);
// in FS, having `st` as the texture coordinate
st = clamp(st, vStRect.xy, vStRect.zw);

Renderer

The renderer receives a RendererFrame object from RB, which contains a list of RenderPass objects and all the GPU data that needs to be consumed by the passes.

Texture cache also provides a list of requests to update chunks of the cached textures. We go through these requests in update_texture_cache and issue GPU commands to upload data into the textures.

The render passes know how much target space they need, and the tasks have already been assigned exact rectangles to render to. All that the Renderer needs to do is actually allocating the texture space on the GPU and clearing it. This is the first thing happening in draw_tile_frame, after which we go through the actual batches in the pass and issue corresponding instanced draw call with the right shaders by calling draw_instanced_batch.

Fetching primitive data

Essential instance data (that differentiates one instance from another in an instanced draw call) is provided via instanced vertex attributes: aGlobalPrimId, aPrimitiveAddress, aTaskIndex, etc.

The first thing a vertex shader typically does is reading that instance data into PrimitiveInstance struct.

Then, the Primitive is composed by fetching the relevant data from the layer storage, clip areas, geometry, and others:

Primitive load_primitive_custom(PrimitiveInstance pi) {
    Primitive prim;

    prim.layer = fetch_layer(pi.layer_index);
    prim.clip_area = fetch_clip_area(pi.clip_task_index);
    prim.task = fetch_alpha_batch_task(pi.render_task_index);

    PrimitiveGeometry pg = fetch_prim_geometry(pi.global_prim_index);
    prim.local_rect = pg.local_rect;
    prim.local_clip_rect = pg.local_clip_rect;

    prim.prim_index = pi.specific_prim_index;
    prim.sub_index = pi.sub_index;
    prim.user_data = pi.user_data;
    prim.z = float(pi.z);

    return prim;

Some vertex shaders then fetch more specific GPU data from the texture storages, e.g. gradients, dependent tasks, glyphs, etc. Most of the actual transformation work is done in write_vertex, which is defined in prim_shared.glsl.

Each shader may have different variants depending on the features passed in during shader linking. WR prepends the shader code with the feature declarations before passing the code for linking.

WR_FEATURE_TRANSFORM, for example assumes the layer transform is no longer axis-aligned, and provides write_transform_vertex alternative to write_vertex. The client is also expected to call init_transform_fs from the fragment shader to obtain the transparency value, which depends on whether the pixel is on the transformed primitive or not.

WR_FEATURE_CLIP computes the coordinate of a pixel within the clip task in write_clip, if there is any. Consequent do_clip() call from the fragment shader returns the clip transparency value. A special task index is reserved for non-clipped primitives, which write_clip recognizes.

Generating clip masks

A primitive may be associated with multiple clip items, which can be image masks or rounded corners. The way we combine these clips is not trivial.

First, we intersect all of the clip items and find out the bounding box of the intersection. The space for this box is allocated in the render target space, from A8 format heap, and a dependent clip task is created to fill it up.

Considering that the target box is filled with the value of 1, we draw each clip instance into this box, so that the vertex shader ensures that the whole box is covered. The resulting values from the pixel shader (0 for transparent, 1 for opaque) are then multiplicatively blended on top of the previous contents of this intersection box. The shaders used for clip rendering start with "cs_clip_*" prefix.

As a result, the box contains alpha values for individual pixels of the resulting image. The actual pixel shader, when drawing a primitive, then checks if its coordinate has an associated value from the clip task, and either reads it to produce the final alpha, or assumes full transparency if it's outside.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly