ParseTTC duplicates work for tables shared between fonts #147

dominikh · 2024-03-25T20:42:07Z

In an OpenType font collection, some tables might be referenced by multiple fonts. For example, in the Noto Sans CJK font collection, all fonts refer to the same CFF2 table (and several others, but the CFF2 table is by far the largest). However, ParseTTC treats each font as an individual object, loading and parsing the same tables repeatedly. For Noto Sans CJK, this results in a 5x increase in I/O and processing time, loading the 30 MB CFF2 table five times, once per font.

I'm not sure that the ParseTTC API is a good idea in the first place (we may only ever want one font from the collection), but if it is to exist, it should at least exploit data deduplication.

benoitkugler · 2024-03-26T08:44:13Z

Huh..

We could do it, but it would involve a new API for loading collections, since we would need to track the shared tables. And the NewFont constructor would have to be adapted quite heavily..

whereswaldon · 2024-03-28T13:35:40Z

It seems worth doing given the potential savings. We could (potentially) still offer the simpler, less performant API for use cases that don't need the extra complexity.

dominikh · 2024-03-29T01:49:09Z

but it would involve a new API for loading collections

Which is IMO warranted, anyway, to make it easier to load fonts from a collection on demand, instead of all at once.

I'm currently tinkering on such a new API, I can send an RFC PR in a couple days if you'd like.

Edit: I take that back. The parsing of some tables depends on other tables, which makes it harder to implement table reuse cleanly, as different tables would need different cache keys to encode the dependencies. Being on the "receiving end" of trying to implement it, I'd probably want to see some stats as to how often large tables get reused. My intuition tells me that this is only really the case for CJK fonts with language defaults. Most uses of collections vary fonts by weight, width, slant, etc, which all require unique glyphs.

benoitkugler · 2024-03-29T10:54:18Z

Here hare some numbers to illustrate @dominikh point .

Details

/usr/share/fonts/opentype/noto/NotoSansCJK-Bold.ttc 10 faces
CFF : 16023 KB -> used 10 times
hmtx : 262 KB -> used 10 times
vmtx : 261 KB -> used 10 times
VORG : 0 KB -> used 10 times
BASE : 0 KB -> used 10 times
vhea : 0 KB -> used 10 times
hhea : 0 KB -> used 10 times
post : 0 KB -> used 10 times
GDEF : 0 KB -> used 10 times
maxp : 0 KB -> used 10 times
OS/2 : 0 KB -> used 6 times
OS/2 : 0 KB -> used 4 times
GSUB : 177 KB -> used 2 times
GSUB : 171 KB -> used 2 times
GSUB : 167 KB -> used 2 times
GSUB : 166 KB -> used 2 times
GSUB : 166 KB -> used 2 times

/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc 10 faces
CFF : 15458 KB -> used 10 times
hmtx : 262 KB -> used 10 times
vmtx : 261 KB -> used 10 times
VORG : 0 KB -> used 10 times
BASE : 0 KB -> used 10 times
hhea : 0 KB -> used 10 times
vhea : 0 KB -> used 10 times
post : 0 KB -> used 10 times
GDEF : 0 KB -> used 10 times
maxp : 0 KB -> used 10 times
OS/2 : 0 KB -> used 6 times
OS/2 : 0 KB -> used 4 times
GSUB : 177 KB -> used 2 times
GSUB : 171 KB -> used 2 times
GSUB : 167 KB -> used 2 times
GSUB : 166 KB -> used 2 times
GSUB : 166 KB -> used 2 times

/usr/share/fonts/opentype/noto/NotoSerifCJK-Bold.ttc 5 faces
CFF : 24427 KB -> used 5 times
hmtx : 261 KB -> used 5 times
vmtx : 261 KB -> used 5 times
VORG : 0 KB -> used 5 times
BASE : 0 KB -> used 5 times
hhea : 0 KB -> used 5 times
vhea : 0 KB -> used 5 times
post : 0 KB -> used 5 times
GDEF : 0 KB -> used 5 times
maxp : 0 KB -> used 5 times
OS/2 : 0 KB -> used 3 times
OS/2 : 0 KB -> used 2 times

/usr/share/fonts/opentype/noto/NotoSerifCJK-Regular.ttc 5 faces
CFF : 23442 KB -> used 5 times
hmtx : 261 KB -> used 5 times
vmtx : 261 KB -> used 5 times
VORG : 1 KB -> used 5 times
BASE : 0 KB -> used 5 times
vhea : 0 KB -> used 5 times
hhea : 0 KB -> used 5 times
post : 0 KB -> used 5 times
GDEF : 0 KB -> used 5 times
maxp : 0 KB -> used 5 times
OS/2 : 0 KB -> used 3 times
OS/2 : 0 KB -> used 2 times

(I've not found any other collections on my system though.)

Perhaps a first step would be to only consider CFF, CFF2, and glyf tables (which are by far the most heavy ones) ?

andydotxyz · 2024-04-02T13:44:23Z

Which is IMO warranted, anyway, to make it easier to load fonts from a collection on demand, instead of all at once.

I appreciate that this is complex - but I agree that a collections based API may be a good thing, so we can lazy load less than a full collection.

I recently found that many OS provide all languages in a single file including all script based glyphs meaning big files and not particularly fast parses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParseTTC duplicates work for tables shared between fonts #147

ParseTTC duplicates work for tables shared between fonts #147

dominikh commented Mar 25, 2024

benoitkugler commented Mar 26, 2024

whereswaldon commented Mar 28, 2024

dominikh commented Mar 29, 2024 •

edited

Loading

benoitkugler commented Mar 29, 2024

andydotxyz commented Apr 2, 2024

ParseTTC duplicates work for tables shared between fonts #147

ParseTTC duplicates work for tables shared between fonts #147

Comments

dominikh commented Mar 25, 2024

benoitkugler commented Mar 26, 2024

whereswaldon commented Mar 28, 2024

dominikh commented Mar 29, 2024 • edited Loading

benoitkugler commented Mar 29, 2024

andydotxyz commented Apr 2, 2024

dominikh commented Mar 29, 2024 •

edited

Loading