Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non UTF-8 encoding support #117

Open
danielpclark opened this issue Jun 1, 2017 · 5 comments
Open

Non UTF-8 encoding support #117

danielpclark opened this issue Jun 1, 2017 · 5 comments

Comments

@danielpclark
Copy link
Owner

The Ruby spec has a test with windows encoded string basename_spec.rb#L162-L166 . This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2. Rust wasn't built to support these with the standard String or &str so custom types would need to be written to support such encodings.

The occurrence of these encodings should be virtually non-existent in web frameworks so problems would likely only arise in Windows specific applications.

Work that has been done in the community towards making a working solution includes

This would make much more sense to implement in FasterPath once windows support has been added and code compiles specifically for Windows. So this should be considered after #102

@danielpclark
Copy link
Owner Author

The test in question is

  it "returns the basename with the same encoding as the original" do
    basename = File.basename('C:/Users/Scuby Pagrubý'.encode(Encoding::Windows_1250))
    basename.should == 'Scuby Pagrubý'.encode(Encoding::Windows_1250)
    basename.encoding.should == Encoding::Windows_1250
  end

To make Rust happy the following works but some bytes of character data is lost in translation

  def self.basename(pth, ext="")
    Rust.basename(
      pth.encode(Encoding::UTF_8),
      ext.encode(Encoding::UTF_8)
    ).force_encoding(pth.encoding)
  end 

The test output result is

File.basename returns the basename with the same encoding as the original FAILED
Expected "Scuby Pagrub\xC3\xBD"
 to equal "Scuby Pagrub\xFD"

@glebm
Copy link
Contributor

glebm commented Sep 11, 2017

This encoding is not UTF-8 compatible and is likely a variation on UTF-16 or UCS-2.

Not that this is relevant, but just an FYI: Windows-1250 is a single byte encoding. It only encodes 256 possible characters. The first 128 characters match ASCII-7BIT, and the second half is mostly accented latin letters.

@danielpclark
Copy link
Owner Author

danielpclark commented Sep 12, 2017

Thanks @glebm . Since I've rewritten this project in ruru the point of this issue is now to update the capabilities of RString in ruru. New error on TravisCI:

- returns the basename with the same encoding as the originalthread '<unnamed>' panicked
at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 21, error_len: Some(1) }', /checkout/src/libcore/result.rs:906:4
stack backtrace:
   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
             at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at /checkout/src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at /checkout/src/libstd/sys_common/backtrace.rs:60
             at /checkout/src/libstd/panicking.rs:381
   3: std::panicking::default_hook
             at /checkout/src/libstd/panicking.rs:397
   4: std::panicking::rust_panic_with_hook
             at /checkout/src/libstd/panicking.rs:577
   5: std::panicking::begin_panic
             at /checkout/src/libstd/panicking.rs:538
   6: std::panicking::begin_panic_fmt
             at /checkout/src/libstd/panicking.rs:522
   7: rust_begin_unwind
             at /checkout/src/libstd/panicking.rs:498
   8: core::panicking::panic_fmt
             at /checkout/src/libcore/panicking.rs:71
   9: core::result::unwrap_failed
  10: ruru::class::string::RString::to_str
  11: r_basename

Unless they have alternate encoding support through alternate means. I still need to look into this.

@danielpclark
Copy link
Owner Author

I've been thinking looking at FFI and Fiddle may give insight for where to integrate encoding from Ruby's C code.

@danielpclark
Copy link
Owner Author

danielpclark commented Dec 22, 2018

Good News

With the addition of encoding support in Rutie and the CodepointIterator we can move forward more easily with adding encoding support. Many of the algorithms will need to be redesigned to work by individual codepoint rather than by individual char.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants