Non UTF-8 encoding support #117

danielpclark · 2017-06-01T22:18:34Z

The Ruby spec has a test with windows encoded string basename_spec.rb#L162-L166 . This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2. Rust wasn't built to support these with the standard String or &str so custom types would need to be written to support such encodings.

The occurrence of these encodings should be virtually non-existent in web frameworks so problems would likely only arise in Windows specific applications.

Work that has been done in the community towards making a working solution includes

The WTF-8 encoding standard with the Rust crate implementation rust-wtf8.
The rust-encoding crate — Character encoding support for Rust.

This would make much more sense to implement in FasterPath once windows support has been added and code compiles specifically for Windows. So this should be considered after #102

The text was updated successfully, but these errors were encountered:

danielpclark · 2017-06-01T22:29:13Z

The test in question is

  it "returns the basename with the same encoding as the original" do
    basename = File.basename('C:/Users/Scuby Pagrubý'.encode(Encoding::Windows_1250))
    basename.should == 'Scuby Pagrubý'.encode(Encoding::Windows_1250)
    basename.encoding.should == Encoding::Windows_1250
  end

To make Rust happy the following works but some bytes of character data is lost in translation

  def self.basename(pth, ext="")
    Rust.basename(
      pth.encode(Encoding::UTF_8),
      ext.encode(Encoding::UTF_8)
    ).force_encoding(pth.encoding)
  end

The test output result is

File.basename returns the basename with the same encoding as the original FAILED
Expected "Scuby Pagrub\xC3\xBD"
 to equal "Scuby Pagrub\xFD"

glebm · 2017-09-11T23:07:09Z

This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2.

Not that this is relevant, but just an FYI: Windows-1250 is a single byte encoding. It only encodes 256 possible characters. The first 128 characters match ASCII-7BIT, and the second half is mostly accented latin letters.

danielpclark · 2017-09-12T11:27:15Z

Thanks @glebm . Since I've rewritten this project in ruru the point of this issue is now to update the capabilities of RString in ruru. New error on TravisCI:

- returns the basename with the same encoding as the originalthread '<unnamed>' panicked
at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 21, error_len: Some(1) }', /checkout/src/libcore/result.rs:906:4
stack backtrace:
   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
             at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at /checkout/src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at /checkout/src/libstd/sys_common/backtrace.rs:60
             at /checkout/src/libstd/panicking.rs:381
   3: std::panicking::default_hook
             at /checkout/src/libstd/panicking.rs:397
   4: std::panicking::rust_panic_with_hook
             at /checkout/src/libstd/panicking.rs:577
   5: std::panicking::begin_panic
             at /checkout/src/libstd/panicking.rs:538
   6: std::panicking::begin_panic_fmt
             at /checkout/src/libstd/panicking.rs:522
   7: rust_begin_unwind
             at /checkout/src/libstd/panicking.rs:498
   8: core::panicking::panic_fmt
             at /checkout/src/libcore/panicking.rs:71
   9: core::result::unwrap_failed
  10: ruru::class::string::RString::to_str
  11: r_basename

Unless they have alternate encoding support through alternate means. I still need to look into this.

danielpclark · 2018-05-01T20:24:59Z

I've been thinking looking at FFI and Fiddle may give insight for where to integrate encoding from Ruby's C code.

danielpclark · 2018-12-22T02:30:36Z

Good News

With the addition of encoding support in Rutie and the CodepointIterator we can move forward more easily with adding encoding support. Many of the algorithms will need to be redesigned to work by individual codepoint rather than by individual char.

danielpclark added tentative Windows wontfix labels Jun 1, 2017

danielpclark mentioned this issue Jun 1, 2017

Unicode UTF-8 support #107

Closed

danielpclark added the encoding label Jun 1, 2017

danielpclark added this to the 1.0.0 milestone Jun 1, 2017

danielpclark added the enhancement label Jun 2, 2017

danielpclark removed the wontfix label Sep 12, 2017

danielpclark removed the tentative label Mar 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non UTF-8 encoding support #117

Non UTF-8 encoding support #117

danielpclark commented Jun 1, 2017

danielpclark commented Jun 1, 2017

glebm commented Sep 11, 2017 •

edited

Loading

danielpclark commented Sep 12, 2017 •

edited

Loading

danielpclark commented May 1, 2018

danielpclark commented Dec 22, 2018 •

edited

Loading

Non UTF-8 encoding support #117

Non UTF-8 encoding support #117

Comments

danielpclark commented Jun 1, 2017

danielpclark commented Jun 1, 2017

glebm commented Sep 11, 2017 • edited Loading

danielpclark commented Sep 12, 2017 • edited Loading

danielpclark commented May 1, 2018

danielpclark commented Dec 22, 2018 • edited Loading

Good News

glebm commented Sep 11, 2017 •

edited

Loading

danielpclark commented Sep 12, 2017 •

edited

Loading

danielpclark commented Dec 22, 2018 •

edited

Loading