feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

seia-soto · 2024-12-10T08:29:19Z

This PR replaces punycode encoder and decoder with TextEncoder and TextDecoder for utf8 strings.

BOM character \ufeff should be skipped when decoding to ensure the original form
Uint8Array.subarray doesn't copy the array but provides a direct interface to subarray
TextEncoder.encodeInto doesn't produce EOL character ( NULL, U+0000 ) but we don't care because TextDecoder can stop nicely when provided buffer ends

To be safe:

The performance should be evaluated carefully
The output binary size needs to be evaluated carefully

seia-soto · 2024-12-11T07:22:57Z

> (6635933 / 6634657) * 100 // raw > ads + trackers + annoyances (104550 network + 69076 hide)
100.01923234313395

master, af1a9c9

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49534 network + 38171 hide)
 + raw 3358745 bytes
 + gzip 1783830 bytes
 + brotli 1467652 bytes
> ads (0 network + 38171 hide)
 + raw 1825421 bytes
 + gzip 880805 bytes
 + brotli 708974 bytes
> ads (49534 network + 0 hide)
 + raw 1533629 bytes
 + gzip 905016 bytes
 + brotli 765458 bytes
> ads + trackers (102405 network + 38324 hide)
 + raw 4958193 bytes
 + gzip 2635039 bytes
 + brotli 2172546 bytes
> ads + trackers (0 network + 38324 hide)
 + raw 1847893 bytes
 + gzip 890432 bytes
 + brotli 715577 bytes
> ads + trackers (102405 network + 0 hide)
 + raw 3110601 bytes
 + gzip 1745248 bytes
 + brotli 1459959 bytes
> ads + trackers + annoyances (104585 network + 69043 hide)
 + raw 6635933 bytes
 + gzip 3468626 bytes
 + brotli 2862312 bytes
> ads + trackers + annoyances (0 network + 69043 hide)
 + raw 3455897 bytes
 + gzip 1674036 bytes
 + brotli 1363518 bytes
> ads + trackers + annoyances (104585 network + 0 hide)
 + raw 3180345 bytes
 + gzip 1793551 bytes
 + brotli 1501761 bytes

seia-soto:textencoder, 1de005e

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49501 network + 38172 hide)
 + raw 3357509 bytes
 + gzip 1782885 bytes
 + brotli 1468373 bytes
> ads (0 network + 38172 hide)
 + raw 1824997 bytes
 + gzip 880743 bytes
 + brotli 706714 bytes
> ads (49501 network + 0 hide)
 + raw 1532817 bytes
 + gzip 904485 bytes
 + brotli 764578 bytes
> ads + trackers (102370 network + 38325 hide)
 + raw 4957369 bytes
 + gzip 2634088 bytes
 + brotli 2171825 bytes
> ads + trackers (0 network + 38325 hide)
 + raw 1847925 bytes
 + gzip 890324 bytes
 + brotli 719134 bytes
> ads + trackers (102370 network + 0 hide)
 + raw 3109745 bytes
 + gzip 1744659 bytes
 + brotli 1459817 bytes
> ads + trackers + annoyances (104550 network + 69076 hide)
 + raw 6634657 bytes
 + gzip 3468231 bytes
 + brotli 2859315 bytes
> ads + trackers + annoyances (0 network + 69076 hide)
 + raw 3455473 bytes
 + gzip 1674098 bytes
 + brotli 1362984 bytes
> ads + trackers + annoyances (104550 network + 0 hide)
 + raw 3179489 bytes
 + gzip 1793145 bytes
 + brotli 1501288 bytes

packages/adblocker/src/data-view.ts

remusao · 2024-12-11T13:02:21Z

packages/adblocker/src/data-view.ts

-      this.buffer[this.pos++] = str.charCodeAt(i);
-    }
+    const { written } = TEXT_ENCODER.encodeInto(raw, this.buffer.subarray(this.pos + 4));
+    this.pushUint32(written);


Isn't that wasteful in terms of final engine size?

encodeInto doesn't emit C-style NULL character which indicates the end of string at the end. We can consume the string dynamically when reading the binary but I think it's simpler to have a final length of string at the front.

My comment was about using 32-bits instead of potentially 8-bits or 16-bits for smaller strings.

I think we can change this to 16 bits (= ~65535 chars). 👍

Reduced in d6032eb (#4513)

I think they're barely called which is not my expect. I only got 5 logs:

17 65 989 117 46 93

Preferring to use getLength and pushLength as pushUint16 is breaking the test.

I think they're barely called which is not my expect. I only got 5 logs:

17 65 989 117 46 93

@seia-soto Did you figure out why they are barely called and if it's expected? Because if this method is almost not called maybe it's preferable to keep a simpler code for the variable length part.

@remusao I got all pushUTF8 calls and these are the entries:

Tracker DB properties

Cosmetic selectors with unicode

Use of optionValue in Network filters (csp, redirect, and replace)

Resource content (including scriptlets)

Preprocessor conditions

I don't think we will see those frequently in testing or benchmarking setup.

I don't think we will see those frequently in testing or benchmarking setup.

We need to check with production assets being loaded. I don't think you will learn anything otherwise. With which lists did you do the measurement above?

The point being: if when loading full assets with all lists/resources/etc. this method is not called a lot, then I think it's best to keep the simpler version without the variadic length logic (that only makes sense if it has a measurable impact on final binary size)

- ~65535 ASCII only characters

seia-soto · 2024-12-12T07:37:08Z

> 147.50420889870574 / 156.1296767089117 // benchEngineDeserialization
0.944754463135876
> 147.50420889870574 / 148.7394726802865 // benchEngineSerialization
0.9916951179177841

seia-soto:textencoder, 65764e9

benchEngineDeserialization: 147.50420889870574 op/s
benchEngineSerialization: 147.50420889870574 op/s

master, de7bfb5

benchEngineDeserialization: 156.1296767089117 op/s
benchEngineSerialization: 148.7394726802865 op/s

packages/adblocker/src/data-view.ts

seia-soto · 2024-12-24T05:22:07Z

This PR is awaiting for the final review. Further changes are expected to be categorized as a performance improvement.

packages/adblocker/src/data-view.ts

chrmod

What about other uses of encode in this file?

packages/adblocker/src/data-view.ts

chrmod · 2025-01-08T13:53:57Z

packages/adblocker/src/data-view.ts

+    // Restore pos to push length
+    this.setPos(pos);
+    this.pushLength(written);
+    // Reflect written bytes to pos
+    this.setPos(this.pos + written);


Suggested change

// Restore pos to push length

this.setPos(pos);

this.pushLength(written);

// Reflect written bytes to pos

this.setPos(this.pos + written);

this.pushLength(written);

this.setPos(pos + written);

We can't do that here. Despite pushLength is located later than encodeInto, we still want the length byte to be earlier than the actual payload. In that case, we're writing the length after the payload. Therefore, we need to recall the pos before we wrote the payload.

chrmod · 2025-01-08T13:57:42Z

packages/adblocker/src/data-view.ts

@@ -20,6 +20,9 @@ export const EMPTY_UINT32_ARRAY = new Uint32Array(0);
 // Check if current architecture is little endian
 const LITTLE_ENDIAN: boolean = new Int8Array(new Int16Array([1]).buffer)[0] === 1;

+// TextEncoder doesn't need to be recreated every time unlike TextDecoder
+const TEXT_ENCODER = new TextEncoder();


alternatively you could do:

const encoder = new TextEncoder(); const encode = encoder.encode.bind(encoder);

Co-authored-by: Krzysztof Modras <[email protected]>

feat: use TextEncoder and TextDecoder for utf8 strings

05063b5

seia-soto added the PR: Internal 🏠 Changes only affect internals label Dec 10, 2024

seia-soto self-assigned this Dec 10, 2024

seia-soto requested a review from remusao as a code owner December 10, 2024 08:29

refactor: pos calculation in pushUTF8

1de005e

seia-soto requested a review from chrmod December 11, 2024 07:45

chrmod reviewed Dec 11, 2024

View reviewed changes

packages/adblocker/src/data-view.ts Show resolved Hide resolved

chrmod approved these changes Dec 11, 2024

View reviewed changes

philipp-classen approved these changes Dec 11, 2024

View reviewed changes

seia-soto commented Dec 11, 2024

View reviewed changes

packages/adblocker/src/data-view.ts Outdated Show resolved Hide resolved

remusao reviewed Dec 11, 2024

View reviewed changes

seia-soto added 2 commits December 11, 2024 22:11

chore: save length of string in 16 bits unsigned integer

d6032eb

- ~65535 ASCII only characters

fix: use getLength and pushLength for utf8

65764e9

remusao reviewed Dec 12, 2024

View reviewed changes

packages/adblocker/src/data-view.ts Outdated Show resolved Hide resolved

seia-soto added 3 commits December 12, 2024 17:53

fix: calculate length of utf8 encoded string

0339bdc

chore: drop useless fast exit

3270e6a

refactor: reuse sizeOfLength

47dc63a

seia-soto requested review from chrmod, remusao and philipp-classen December 12, 2024 09:23

chrmod added PR: Bug Fix 🐛 Increment patch version when merged and removed PR: Internal 🏠 Changes only affect internals labels Dec 23, 2024

chrmod reviewed Jan 8, 2025

View reviewed changes

packages/adblocker/src/data-view.ts Outdated Show resolved Hide resolved

chrmod requested changes Jan 8, 2025

View reviewed changes

seia-soto and others added 3 commits January 10, 2025 22:59

Update packages/adblocker/src/data-view.ts

72e5b0d

Co-authored-by: Krzysztof Modras <[email protected]>

Update packages/adblocker/src/data-view.ts

dd8332a

Co-authored-by: Krzysztof Modras <[email protected]>

chore: remove unused ts-ignore

c7b3cbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

seia-soto commented Dec 10, 2024

seia-soto commented Dec 11, 2024 •

edited

Loading

remusao Dec 11, 2024

seia-soto Dec 11, 2024 •

edited

Loading

remusao Dec 11, 2024

seia-soto Dec 11, 2024

seia-soto Dec 11, 2024

seia-soto Dec 11, 2024 •

edited

Loading

seia-soto Dec 12, 2024

remusao Dec 25, 2024

seia-soto Jan 2, 2025 •

edited

Loading

remusao Jan 4, 2025

seia-soto commented Dec 12, 2024 •

edited

Loading

seia-soto commented Dec 24, 2024

chrmod left a comment

chrmod Jan 8, 2025

seia-soto Jan 10, 2025

chrmod Jan 8, 2025

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Are you sure you want to change the base?

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Conversation

seia-soto commented Dec 10, 2024

seia-soto commented Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

seia-soto Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seia-soto Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seia-soto Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seia-soto commented Dec 12, 2024 • edited Loading

seia-soto commented Dec 24, 2024

chrmod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

seia-soto commented Dec 11, 2024 •

edited

Loading

seia-soto Dec 11, 2024 •

edited

Loading

seia-soto Dec 11, 2024 •

edited

Loading

seia-soto Jan 2, 2025 •

edited

Loading

seia-soto commented Dec 12, 2024 •

edited

Loading