author | title | date |
---|---|---|
Lukas Prokop |
Rust Graz – 06 Unicode |
18th of December, 2019 |
fn count_calls(n: u64) -> u64 {
println!("{:p}", &n);
if n < 1 {
0
} else {
1 + count_calls(n - 1)
}
}
fn main() {
println!("{}", count_calls(174470))
}
% cargo run
…
0x7ffc9324f6b0
0x7ffc9324f610
0x7ffc9324f570
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1] 11645 abort cargo run
⇒ result of last time: 160 bytes per stackframe
% cargo run --release
…
0x7ffe628fe5a8
0x7ffe628fe548
0x7ffe628fe4e8
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1] 11803 abort cargo run --release
⇒ 96 bytes per stackframe
“The Manifest Format” via Cargo book
# The development profile, used for `cargo build`.
[profile.dev]
# controls the `--opt-level` the compiler builds with.
# 0-1 is good for debugging. 2 is well-optimized. Max is 3.
# 's' attempts to reduce size, 'z' reduces size even more.
opt-level = 0
# (u32 or bool) Include debug information (debug symbols).
# Equivalent to `-C debuginfo=2` compiler flag.
debug = true
# Link Time Optimization usually reduces size of binaries
# and static libraries. Increases compilation time.
# If true, passes `-C lto` flag to the compiler, and if a
# string is specified like 'thin' then `-C lto=thin` will
# be passed.
lto = false
# The release profile, used for `cargo build --release`
# (and the dependencies for `cargo test --release`,
# including the local library or binary).
[profile.release]
opt-level = 3
debug = false
lto = false
# The testing profile, used for `cargo test` (for `cargo
# test --release` see the `release` and `bench` profiles).
[profile.test]
opt-level = 0
debug = 2
lto = false
# The benchmarking profile, used for `cargo bench` (and the
# test targets and unit tests for `cargo test --release`).
[profile.bench]
opt-level = 3
debug = false
lto = false
debug build (160 bytes, 2.4MB):
% ls -l ./target/debug/buildtest
-rwxrwxr-x 2 user user 2514680 Dec 17 22:15 ./target/debug/buildtest
release build (96 bytes, 2.4MB):
% ls -l ./target/release/buildtest
-rwxrwxr-x 2 user user 2497912 Dec 17 22:22 ./target/release/buildtest
Old Cargo.toml
:
[package]
name = "buildtest"
version = "0.1.0"
authors = ["meisterluk <[email protected]>"]
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
New Cargo.toml
:
[package]
name = "buildtest"
version = "0.1.0"
authors = ["meisterluk <[email protected]>"]
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[profile.release]
opt-level = 3
debug = true
lto = true
[dependencies]
custom release build (always opt-level=3)
debug=true, lto=false
:
96 bytes stackframe & 2507520 bytes executabledebug=true, lto=true
: 96 & 974440 bytesdebug=false, lto=true
: 96 & 965904 bytes
I don't know how to go below 96 bytes stack frames.
PartialEq
- symmetric
a == b
impliesb == a
; and- transitive
a == b
andb == c
impliesa == c
.
Eq
- additionally reflexive
a == a
;
use std::f64;
fn main() {
println!("{}", f64::NAN == f64::NAN); // false
}
⇒ If you implement PartialEq
then #[derive(Eq)]
as well unless you can’t
Related traits: Hash
, PartialOrd
, Ord
Unicode maps numbers to code points.
Unicode 12.1 maps numbers to 137,994 unicode code points.
How can we encode 137,994 Unicode code points to bytes? ⇒ Unicode Transformation Format (UTF).
- 2019/04/30: Emperor Akihito abdicated. 2019/05/01: Emperor Naruhito ascended the throne. 1 character added in 12.1
- Previously 平成 ⇒ ㍻: U+5E73 CJK UNIFIED IDEOGRAPH-5E73 and U+6210 CJK UNIFIED IDEOGRAPH-6210 merged into U+337B SQUARE ERA NAME HEISEI
I came up with some UTF. I will introduce 5 versions of a “Complementary Properties Encoding” (CPE). Let's discuss its properties.
2 bytes = 16 bits. Fixed-width encoding. What are potential problems?
- Backward/ASCII compatibility
- Setting one special bit of single byte, we have 7 remaining bit with same assignment like ASCII
- Extended ASCII detection/fallback
- UTF-8 multibyte strings are rarely linguistically legit Extended ASCII strings.
þ ⇒ þ, ø ⇒ ø, ß ⇒ ß - Prefix freedom
- There is no whole code word in the system that is a prefix of any other code word in the system
- Self-synchronization
- If we jump to some byte, we can easily determine the start of the next character
- Sorting order
- Lexicographical order of bytes equal unicode codepoint order
via Wikipedia
UTF-8 encoded Japanese Wikipedia rendered in cp1252
use std::fs::File;
use std::io::prelude::*;
fn main() -> std::io::Result<()> {
let mut fd = File::create("pile_of_poo.html")?;
fd.write(b"<!DOCTYPE html>\n<head><title>\
\xf0\x9f\x92\xa9</title>\n")?;
Ok(())
}
use std::fs::File;
use std::io::prelude::*;
fn main() -> std::io::Result<()> {
let mut fd = File::create("mojibake.html")?;
fd.write(b"<!DOCTYPE html>\n<head><title>\xda \
\xf0\x9f\x92\xa9</title>\n")?;
Ok(())
}
version | ASCII compat | fallback | prefix-free | self-sync | sort |
5 | ❌ | ? | ✓ | ❌ | ✓ |
4 | ✓ | ? | ❌ | ❌ | ❌ |
3 | ✓ | ? | ✓ | ❌ | ❌ |
2 | ❌ | ? | ✓ | ❌ | ✓ |
1 | ✓ | ? | ✓ | ✓ | ❌ |
- Mojibake
- Character rendered in wrong encoding
- Han unification
- Korean and Japanese writing systems are based on Chinese characters ⇒ huge overlap ⇒ merge different writing systems
- Overlong encoding
- Remove leading zeros in your binary string. Then cram those bits into 1-4 UTF-8 bytes; as few as needed! If you take more bytes, you have overlong encoding; which is disallowed.
One possible rationale is the desire to limit the size of the full Unicode character set, where CJK characters as represented by discrete ideograms may approach or exceed 100,000 characters. Version 1 of Unicode was designed to fit into 16 bits and only 20,940 characters (32%) out of the possible 65,536 were reserved for these CJK Unified Ideographs.
TRON Code is a multi-byte character encoding used in the TRON project. It is similar to Unicode but does not use Unicode's Han unification process: each character from each CJK character set is encoded separately, including archaic and historical equivalents of modern characters
- Surrogates
- In UTF16: Extension to encode 2.5 bytes in a 2 bytes fixed-width encoding.
In UTF8: Invalid bit patterns for compatibility with UTF16. - Basic Multilingual Plane
- Plane of common use characters (65,536 code points)
compare with Wikipedia
via twitter
- shapecatcher.com
- joelonsoftware.com: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
const TEXT: &str = "Héllö Wør̶l̶d";
fn main() {
println!("{}", TEXT);
}
fn main() {
let s = String::from("Hello Graz");
println!("{}", s);
}
fn main() {
// let s = String::with_capacity(11);
let s = String::from("Hello Graz");
s += "!"; // also: s.push_str("!");
println!("{}", s);
}
Does it compile? No, it's immutable.
fn main() {
let mut s = String::from("Hello Graz");
s += "!";
println!("{}", s);
}
Does it compile? Yes.
- data must be valid UTF-8 string
- owns its data (dropping
String
means deallocate data) - ⇒ “owned string”
- does not implement Copy, thus move semantics apply
- consists of {&data, length, capacity}
// \xD8 is a surrogate code point
fn main() {
let s = String::from("Hello\xD8Graz!");
println!("{}", s);
}
error: this form of character escape may only
be used with characters in the range [\x00-\x7f]
--> src/main.rs:3:32
|
2 | let s = String::from("Hello\xD8Graz!");
| ^^^^
fn sub(arg: String) {
println!("{}", arg);
}
fn main() {
let s = String::from("Hello Graz");
sub(s);
println!("{}", s);
}
Does it compile?
error[E0382]: borrow of moved value: `s`
--> src/main.rs:8:20
|
6 | let s = String::from("Hello Graz");
| - move occurs because `s` has type
| `std::string::String`, which does
| not implement the `Copy` trait
7 | sub(s);
| - value moved here
8 | println!("{}", s);
| ^ value borrowed here after move
- data must be valid UTF-8 string
- stored in .data/.rodata, does not deallocate
- ⇒ “borrowed string”, lives as long as the program
- consists of {&data, length}
fn main() {
let s = "H\x65\u{6C}lo \
Graz";
println!("{}", s);
}
fn main() {
println!("\u{1F4A9}");
}
💩
fn main() {
let s: &str = "Hello Graz";
s += "!";
println!("{}", s);
}
error[E0368]: binary assignment operation
`+=` cannot be applied to type `&str`
--> src/main.rs:3:5
|
3 | s += "!";
| -^^^^^^^
| |
| cannot use `+=` on type `&str`
fn sub(arg: &str) {
println!("{}", arg);
}
fn main() {
let s: &str = "Hello Graz!";
sub(s);
println!("{}", s);
}
Does it compile? Yes.
fn main() {
let s = "Hello Graz!";
let b = "Hello Graz!";
println!("{:p} {:p}", s, b);
// 0x557ab7f09cc0 0x557ab7f09cc0
}
use std::mem;
fn main() {
println!("{}", mem::size_of::<A>());
}
where A
is u8
(1), u32
(4), f64
(8), &u8
(8), String
(24), &str
(16), Vec<u8>
(24) or &[char]
(16).
Vec<u8>
can contain an arbitrary non-UTF-8 stringchar
is always 4 bytes and thus can contain any UTF-8 code point[u8]
is a slice ofu8
. Cumbersome to handleOsString
, ffi::CString … if you need compatibility strings.
fn main() {
println!("{}", "ß".to_uppercase());
}
Output: SS
Compare with Unicode casemap F.A.Q.
fn main() {
let s = {
let alt: &str = "Graz";
alt
};
println!("{}", s);
}
Does it compile? Yes.
fn main() {
let s = {
let alt: String = "Graz".to_string();
alt
};
println!("{}", s);
}
Does it compile? Yes.
fn main() {
let s = {
let alt: String = "Graz".to_string();
alt.as_str()
};
println!("{}", s);
}
Does it compile?
error[E0597]: `alt` does not live long enough
--> src/main.rs:4:9
|
2 | let s = {
| - borrow later stored here
3 | let alt: String = "Graz".to_string();
4 | alt.as_str()
| ^^^ borrowed value does not live long enough
5 | };
| - `alt` dropped here while still borrowed
fn takes_str(s: &str) {}
let s = String::from("Hello");
takes_str(&s);
fn main() {
let mut s = "合気道";
println!("{}", s[1]);
}
error[E0277]: the type `str` cannot be indexed by `{integer}`
--> src/main.rs:3:20
|
3 | println!("{}", s[1]);
| ^^^^
| string indices are ranges of `usize`
fn main() {
let mut s = "合気道".chars();
println!("{}", s.nth(1).unwrap()); // 気
}
fn main() {
for s in "देवनागरी".chars() {
println!("{}", s);
}
}
द
े
व
न
ा
ग
र
ी
Interesting read: Stackoverflow “Why does modern Perl avoid UTF-8 by default?”
- ASCII is a __-bit encoding
- ASCII is a 7-bit encoding
- Maximum number of bytes of a UTF-8 code point?
- 4
- Which string types does rust define?
&str
,String
std::mem::size_of::<char>()
gives?- 4
- How to iterate over characters of a string?
let mut s = "Hello world".chars();
Wed, 2019/01/29 19:00
Topic: traits