Explicitly express that any core-strings are in UTF-8 #1788

375gnu · 2023-07-21T12:12:42Z

Celestia runs on different platform with different locales. While most of desktop GNU/Linux installations prefer UTF-8 locales, some other systems, e.g. older Windows, prefer pure 8-bit encodings such as CP1251. On the other hand our core code always assumes that all strings are in UTF-8, so in win32 frontend we have conversion functions here and there.

C++ 20 introduced a new primitive type: char8_t (char8_t — type for UTF-8 character representation, required to be large enough to represent any UTF-8 code unit (8 bits). It has the same size, signedness, and alignment as unsigned char (and therefore, the same size and alignment as char and signed char), but is a distinct type) and a new std::basic_string<T>/std::basic_string_view<T> specializations using char8_t: std::u8string and std::u8string_view.

So my proposal is too use this types in our core routines and use std::string and std::string_view only in frontends, i.e. there where non-UTF-8 characters can be used. Of course as we target C++17 we need to implement required types ourselves.

@ajtribick @levinli303 what do you think?

The text was updated successfully, but these errors were encountered:

ajtribick · 2023-07-21T19:42:04Z

I think the usual convention pre-C++20 is to use std::basic_string<unsigned char, CustomTraitsType> for this purpose, which makes sense to me, then ensure we use the various workarounds detailed in P2513R4 Compatibility and Portability Fix.

Is it too early to bump the required standard to C++20?

375gnu · 2023-07-21T20:45:17Z

Is it too early to bump the required standard to C++20?

Definitely. I suppose we can switch (if we need) not earlier than 2025. Personally I want only designated initializers and maybe concepts but sometimes they're too ugly (infamous requires requires) and maybe modules, but gcc doesn't support them.

I think the usual convention pre-C++20 is to use std::basic_string<unsigned char, CustomTraitsType> for this purpose, which makes sense to me, then ensure we use the various workarounds detailed in P2513R4 Compatibility and Portability Fix.

I hoped such std::basic_string<unsigned char, CustomTraitsType> can be compatible with char8_t, but it seems not, maybe in C++26 they fix all issue or make everything so bad that rewriting in Rust/Zig/Carbon will be the best solution.

But anyway it makes sense to evaluate std::basic_string<unsigned char, CustomTraitsType>, especially taking into account that char on some platforms is signed while on others it's unsigned so this may lead to weird bugs.

375gnu added the question Further information is requested label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly express that any core-strings are in UTF-8 #1788

Explicitly express that any core-strings are in UTF-8 #1788

375gnu commented Jul 21, 2023

ajtribick commented Jul 21, 2023 •

edited

Loading

375gnu commented Jul 21, 2023

Explicitly express that any core-strings are in UTF-8 #1788

Explicitly express that any core-strings are in UTF-8 #1788

Comments

375gnu commented Jul 21, 2023

ajtribick commented Jul 21, 2023 • edited Loading

375gnu commented Jul 21, 2023

ajtribick commented Jul 21, 2023 •

edited

Loading