latin-camelcase feature make wrong segmentation #289

hamano · 2024-05-01T00:40:49Z

The default featre has issue with proper noun segmentation like OpenSSL.

main.rs:

use std::env;
use charabia::Segment;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
    println!("{}", segments)
}

default feature:

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl

disable default feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

The text was updated successfully, but these errors were encountered:

ManyTheFish · 2024-06-03T06:46:13Z

Hello @hamano, What do you expect in terms of segmentation?

Thank you!

hamano · 2024-06-04T01:35:51Z

~~It is ideal that proper nouns are not split, but if it is not in the dictionary, it can't avoided that OpenSSL would be split as Open|SSL.~~
Here, I recognize the issue as splitting it as Open|S|S|L. The word SSL would disappear.

hamano · 2024-06-04T01:46:07Z

The inability to segment words not found in the dictionary was a characteristic in Japanese. Please ignore that.
In any case, I believe it is desirable that 'OpenSSL' not be segment, even in Latin languages

hamano · 2024-06-04T06:12:21Z

CloudCannon/pagefind#591

hamano · 2024-06-04T06:16:51Z

The word open_ssl is not common, but for example, in the case of a module like apache mod_ssl, it is expected not to be segmented.

ManyTheFish · 2024-06-04T08:58:33Z

Hello @hamano,
I think the better way to solve this issue is to add the word SSL to the tokenizer dictionary. It seems to be a specific case more than a generality.

hamano · 2024-06-05T01:58:02Z

I'm concerned not just about the term "OpenSSL" but about countless similar terms.

OpenVPN
OpenBSD
OpenJDK
OpenCV
FreeRADIUS
FreeBSD
PostgreSQL
MySQL
MongoDB

I don't think all these terms should be added to the dictionary. New terms emerge one after another.

hamano · 2024-06-05T02:10:28Z

If you say this is the expected behavior of the latin-camelcase feature, then that's fine. I'll disable it. However, it's too inconvenient as a default feature, so I reported it, especially for use in technical documentation.

ManyTheFish · 2024-06-05T07:47:35Z

Understood, we may disable it from the default features,
thinking of it, this segmentation is mainly used for the cases you are describing, so if you want to change the behavior of the segmenter, I would gladly accept a new PR.
Moreover, the change seems easy, maybe replacing the highlighted code block by:

        let should_group = if last_char_was_lowercase && char.is_letter_uppercase() {
            false
        } else {
            true
        };

        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

or even

        let should_group = !(last_char_was_lowercase && char.is_letter_uppercase());
        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

Should solve your issue ☺️

However, the word OpenSSLError would be segmented as ["Open", "SSLError"] with this change, it would need a bit more efforts to make it segment like ["Open", "SSL", "Error"]. 🤔

hamano · 2024-06-06T01:29:25Z

I am unsure how "OpenSSLError" should be segment. Perhaps in the context of a program constant name appearing in documentation, it is expected not to be segment.
Anyway, I don't understand the use case for the latin-camelcase feature. So, I will simply disable it.
What I wanted to report here is that the excessive segmentation of "OpenSSL" into "Open|S|S|L" is likely not expected by anyone.

ManyTheFish · 2024-06-06T07:55:02Z

Hey @hamano,
no problem. I just wanted to say that if you want to use the feature, you can create a PR enhancing the behavior and I will accept it.
See you!

ManyTheFish added the good first issue Good for newcomers label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latin-camelcase feature make wrong segmentation #289

latin-camelcase feature make wrong segmentation #289

hamano commented May 1, 2024

ManyTheFish commented Jun 3, 2024

hamano commented Jun 4, 2024 •

edited

Loading

hamano commented Jun 4, 2024

hamano commented Jun 4, 2024

hamano commented Jun 4, 2024

ManyTheFish commented Jun 4, 2024

hamano commented Jun 5, 2024

hamano commented Jun 5, 2024

ManyTheFish commented Jun 5, 2024 •

edited

Loading

hamano commented Jun 6, 2024

ManyTheFish commented Jun 6, 2024

latin-camelcase feature make wrong segmentation #289

latin-camelcase feature make wrong segmentation #289

Comments

hamano commented May 1, 2024

ManyTheFish commented Jun 3, 2024

hamano commented Jun 4, 2024 • edited Loading

hamano commented Jun 4, 2024

hamano commented Jun 4, 2024

hamano commented Jun 4, 2024

ManyTheFish commented Jun 4, 2024

hamano commented Jun 5, 2024

hamano commented Jun 5, 2024

ManyTheFish commented Jun 5, 2024 • edited Loading

hamano commented Jun 6, 2024

ManyTheFish commented Jun 6, 2024

hamano commented Jun 4, 2024 •

edited

Loading

ManyTheFish commented Jun 5, 2024 •

edited

Loading