Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Unicode TR-29 and TR-14 #509

Closed
26 tasks done
Tracked by #473
rogerbinns opened this issue Feb 21, 2024 · 1 comment
Closed
26 tasks done
Tracked by #473

Implement Unicode TR-29 and TR-14 #509

rogerbinns opened this issue Feb 21, 2024 · 1 comment

Comments

@rogerbinns
Copy link
Owner

rogerbinns commented Feb 21, 2024

This is needed to get #473 to work reasonably well, especially when one user perceived character is many unicode codepoints. The word splitting is also a lot better than trying to do it by codepoint category as in existing unicode61 (eg that messes up don't) and sentence splitting is good for a better snippet function.

  • Generate break tables
  • Test code from tables
  • Grapheme cluster
  • Word
  • Sentence
  • Unicode categories\
  • Equivalent_Unified_Ideograph
  • Emoji
  • Regional Indicator
  • Case folding
  • Investigate tr14 line breaking for implementing textwrap
  • textwrap
  • Wide codepoints (east asian width == F or W)
  • Grapheme cluster count
  • Grapheme cluster range substring
  • Grapheme cluster width
  • Startswith / endswith
  • find/index
  • Set doc order to bysource and rearrange functions into logical doc order
  • Grapheme cluster base char (remove diacritics equivalent)
  • Compatibility codepoints (eg roman numeral ⅲ becomes latin iii)
  • Update apsw.ext.format_query_table to use textwrap
  • Update tests
  • Convert important bits to C
  • Convert other bits to C
  • Rename module to unicode
@rogerbinns rogerbinns changed the title Consider implementing tr-29 since anything less is problematic. Of note implementation needs to be able track byte/character offsets Implement Unicode TR-29 Feb 21, 2024
@rogerbinns rogerbinns changed the title Implement Unicode TR-29 Implement Unicode TR-29 and TR-14 Mar 23, 2024
rogerbinns added a commit that referenced this issue Apr 19, 2024
@rogerbinns
Copy link
Owner Author

Equivalent_Unified_Ideograph should be investigated. Looks useful for stripped function for getting compatibility codepoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant