Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
28 changes: 12 additions & 16 deletions packages/adblocker/src/data-view.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ export const EMPTY_UINT32_ARRAY = new Uint32Array(0);
// Check if current architecture is little endian
const LITTLE_ENDIAN: boolean = new Int8Array(new Int16Array([1]).buffer)[0] === 1;

// TextEncoder doesn't need to be recreated every time unlike TextDecoder
const TEXT_ENCODER = new TextEncoder();
seia-soto marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively you could do:

const encoder = new TextEncoder();
const encode = encoder.encode.bind(encoder);


// Store compression in a lazy, global singleton
let getCompressionSingleton: () => Compression = () => {
const COMPRESSION = new Compression();
Expand Down Expand Up @@ -87,8 +90,7 @@ export function sizeOfASCII(str: string): number {
* Return number of bytes needed to serialize `str` UTF8 string.
*/
export function sizeOfUTF8(str: string): number {
const encodedLength = encode(str).length;
return encodedLength + sizeOfLength(encodedLength);
return 4 + TEXT_ENCODER.encode(str).length;
seia-soto marked this conversation as resolved.
Show resolved Hide resolved
}

/**
Expand Down Expand Up @@ -389,23 +391,17 @@ export class StaticDataView {
}

public pushUTF8(raw: string): void {
const str = encode(raw);
this.pushLength(str.length);

for (let i = 0; i < str.length; i += 1) {
this.buffer[this.pos++] = str.charCodeAt(i);
}
const { written } = TEXT_ENCODER.encodeInto(raw, this.buffer.subarray(this.pos + 4));
this.pushUint32(written);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that wasteful in terms of final engine size?

Copy link
Member Author

@seia-soto seia-soto Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeInto doesn't emit C-style NULL character which indicates the end of string at the end. We can consume the string dynamically when reading the binary but I think it's simpler to have a final length of string at the front.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about using 32-bits instead of potentially 8-bits or 16-bits for smaller strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change this to 16 bits (= ~65535 chars). 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced in d6032eb (#4513)

Copy link
Member Author

@seia-soto seia-soto Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're barely called which is not my expect. I only got 5 logs:

17
65
989
117
46
93

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferring to use getLength and pushLength as pushUint16 is breaking the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're barely called which is not my expect. I only got 5 logs:

17
65
989
117
46
93

@seia-soto Did you figure out why they are barely called and if it's expected? Because if this method is almost not called maybe it's preferable to keep a simpler code for the variable length part.

Copy link
Member Author

@seia-soto seia-soto Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remusao I got all pushUTF8 calls and these are the entries:

  1. Tracker DB properties
  2. Cosmetic selectors with unicode
  3. Use of optionValue in Network filters (csp, redirect, and replace)
  4. Resource content (including scriptlets)
  5. Preprocessor conditions

I don't think we will see those frequently in testing or benchmarking setup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will see those frequently in testing or benchmarking setup.

We need to check with production assets being loaded. I don't think you will learn anything otherwise. With which lists did you do the measurement above?

The point being: if when loading full assets with all lists/resources/etc. this method is not called a lot, then I think it's best to keep the simpler version without the variadic length logic (that only makes sense if it has a measurable impact on final binary size)

this.setPos(this.pos + written);
}

public getUTF8(): string {
const byteLength = this.getLength();
this.pos += byteLength;
return decode(
String.fromCharCode.apply(
null,
// @ts-ignore
this.buffer.subarray(this.pos - byteLength, this.pos),
),
const byteLength = this.getUint32();
const pos = this.getPos();
this.setPos(pos + byteLength);
return new TextDecoder('utf8', { ignoreBOM: true }).decode(
this.buffer.subarray(pos, pos + byteLength),
seia-soto marked this conversation as resolved.
Show resolved Hide resolved
);
}

Expand Down
Loading