You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is needed to get #473 to work reasonably well, especially when one user perceived character is many unicode codepoints. The word splitting is also a lot better than trying to do it by codepoint category as in existing unicode61 (eg that messes up don't) and sentence splitting is good for a better snippet function.
Generate break tables
Test code from tables
Grapheme cluster
Word
Sentence
Unicode categories\
Equivalent_Unified_Ideograph
Emoji
Regional Indicator
Case folding
Investigate tr14 line breaking for implementing textwrap
textwrap
Wide codepoints (east asian width == F or W)
Grapheme cluster count
Grapheme cluster range substring
Grapheme cluster width
Startswith / endswith
find/index
Set doc order to bysource and rearrange functions into logical doc order
Grapheme cluster base char (remove diacritics equivalent)
Compatibility codepoints (eg roman numeral ⅲ becomes latin iii)
Update apsw.ext.format_query_table to use textwrap
Update tests
Convert important bits to C
Convert other bits to C
Rename module to unicode
The text was updated successfully, but these errors were encountered:
rogerbinns
changed the title
Consider implementing tr-29 since anything less is problematic. Of note implementation needs to be able track byte/character offsets
Implement Unicode TR-29
Feb 21, 2024
rogerbinns
changed the title
Implement Unicode TR-29
Implement Unicode TR-29 and TR-14
Mar 23, 2024
This is needed to get #473 to work reasonably well, especially when one user perceived character is many unicode codepoints. The word splitting is also a lot better than trying to do it by codepoint category as in existing unicode61 (eg that messes up
don't
) and sentence splitting is good for a better snippet function.The text was updated successfully, but these errors were encountered: