Inconsistent handling of Unicode characters in String theory #412

kfriedberger · 2024-11-24T13:43:00Z

Different solvers return different results when using Unicode characters in String theory.
This should be analyzed. Maybe we need to fix JavaSMT or report to the solvers' developers.

Details: #391 (comment)

daniel-raffler · 2024-11-25T13:35:43Z

Thanks for opening the issue!

I've now added some tests:

When creating String constants in Z3 all Unicode characters must be (SMTLIB) escaped. The solver will also escape all its output.
Princess will accept any Java String and does not recognize SMTLIB escape sequences for Unicode characters. We added code in our backend to handle the conversion ourselves, but Strings from the model don't get converted back to SMTLIB format when they are printed.
CVC4 behaves similar to Z3, but actually throws an exception when the input contains Unicode characters.
CVC5 is closer to Princess, even though it does not allow Unicode characters when creating String constants

It's not too difficult to convert between UTF-16 and the SMTLIB escape format, but we'll have to decide which encoding we want to use internally. Specifically, should the Strings in StringFormula StringFormulaManager.makeString(String value) and String Evaluator.evaluate(StringFormula formula) only allow SMTLIB encoded Unicode characters, or do we want to support all Java Strings?

daniel-raffler · 2024-11-26T10:04:00Z

I've added some more conversions and all solvers should now behave the same. The current format got makeString(...) and evaluate(StringFormula formula) is is SMTLIB escaped Strings, but we may still change this now. The argument for keeping it that way is that we want to stay as close to the SMTLIB standard as possible. This may help interoperability, for instance when someone reads in a SMTLIB script and then tries to recreate it with JavaSMT. In that case it might be confusing if escaped Unicode characters are not properly recognized.

On the other hand JavaSMT is written in Java, and the type String in makeString(...) and evaluate(StringFormula formla) suggest that any Java String should be valid input. The conversion from Java String to the SMTLIB format (and back) is easy enough and can be handled by JavaSMT automatically, so there is no reason why we would be bound by the SMTLIB standard on this issue.

Either choice will break the API, although I'd argue that keeping SMTLIB as format is more in line with how the functions used to work so far (we just didn't document it).

In either case I would also suggest we also make escapeString(...) and unescapeString(...) available through the public API somehow. Theses methods will be needed often enough, and users shouldn't have to reimplement the rather error-prone conversion themselves. I've put both methods in FormulaCreator for now, but maybe they can be made available as default methods in the StringFormulaManager interface?

@kfriedberger, @baierd:
What's your opinion on this?

kfriedberger · 2024-12-30T22:47:19Z

The current implementation (and the implementation from #422) still avoid special cases like invalid Unicode escaping like "\u" (without a digit) or "\u{123456789}" (too long escape sequence). Additionally, SMTLIB has multiple ways to represent a single string constant and JavaSMT copies some behaviour, e.g., input as plain Java string using UTF16, or also escaped versions. It is unclear, how to double-escape an escape sequence in SMTLIB.

With #422, JavaSMT should allow all Java-letters from UTF16 for all solvers and escape them if needed. We also provide Java-letters in UTF16 in the model.

daniel-raffler · 2025-01-01T20:02:58Z

The current implementation (and the implementation from #422) still avoid special cases like invalid Unicode escaping like "\u" (without a digit) or "\u{123456789}" (too long escape sequence). Additionally, SMTLIB has multiple ways to represent a single string constant and JavaSMT copies some behaviour, e.g., input as plain Java string using UTF16, or also escaped versions. It is unclear, how to double-escape an escape sequence in SMTLIB.

With #422, JavaSMT should allow all Java-letters from UTF16 for all solvers and escape them if needed. We also provide Java-letters in UTF16 in the model.

Thanks for the explanation! I've moved the remaining Z3/CVC4 issue from #420 (link) to this branch and solved it, along with some other minor points. Double escaping is possible by substituting one (or all) of the letters from the escape sequence, for instance:

escape(escape("ꯍ")) = escape("\u{abcd}") = "\u{5c}u{abcd}"

Escaping the \ is also needed to handle broken Unicode sequences:

Term a = makeString("\u{abcd")
Term b = makeString("}")
Term r = makeConcat(a,b) // This should be "\u{abcd}" and not just "ꯍ"

Here we need to substitute \ with its escape sequence \u{5c} when creating Term a. Otherwise the concatenation will give us "\u{abcd}", which then gets converted to "ꯍ" when reading back the result from the model.

Even with this fix there still is a bit of an issue if we continue to use the concatenated Strings:

Term a = makeString("\u{abcd")
Term b = makeString("}")
Term r = makeConcat(a,b)
String s = eval(r) // Still "\u{abcd}" as we escaped the original String to "\u{5c}u{abcd}"
String t = eval(makeString(a')) // Now it's "ꯍ" since makeString() applied unescape()

It might be a bit unexpected that eval(makeString(x)) is not just x again. Especially if x is already an intermediate result were the escape sequences might not have been added intentionally.

Maybe one solution would be to simply ignore escape sequences in makeString()? We could still provide functions to the user that will handle escaping/unescaping, but those would have to be called manually:

Term u = makeString("\u{abcd}") // Just "\u{abcd}"
Term v = makeString(unescape("\u{abcd}")) // Now it's "ꯍ"

The same could also be used when getting values from the model:

Term z = makeString("ꯍ")
String s = eval(z) // "ꯍ"
String t = escape(eval(z)) // "\u{abcd}"

This could for instance be useful when trying to print the result to a SMTLIB file where "ꯍ" is not allowed as a String literal. The downside, of course, is that we'd be breaking the API again...

kfriedberger mentioned this issue Nov 24, 2024

Add support for Strings and Rationals to the Princess backend #391

Merged

daniel-raffler mentioned this issue Dec 30, 2024

#420: More operations in string theory #422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of Unicode characters in String theory #412

Inconsistent handling of Unicode characters in String theory #412

kfriedberger commented Nov 24, 2024

daniel-raffler commented Nov 25, 2024 •

edited

Loading

daniel-raffler commented Nov 26, 2024

kfriedberger commented Dec 30, 2024

daniel-raffler commented Jan 1, 2025

Inconsistent handling of Unicode characters in String theory #412

Inconsistent handling of Unicode characters in String theory #412

Comments

kfriedberger commented Nov 24, 2024

daniel-raffler commented Nov 25, 2024 • edited Loading

daniel-raffler commented Nov 26, 2024

kfriedberger commented Dec 30, 2024

daniel-raffler commented Jan 1, 2025

daniel-raffler commented Nov 25, 2024 •

edited

Loading