Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latin-camelcase feature make wrong segmentation #289

Open
hamano opened this issue May 1, 2024 · 11 comments
Open

latin-camelcase feature make wrong segmentation #289

hamano opened this issue May 1, 2024 · 11 comments
Labels
good first issue Good for newcomers

Comments

@hamano
Copy link

hamano commented May 1, 2024

The default featre has issue with proper noun segmentation like OpenSSL.

main.rs:

use std::env;
use charabia::Segment;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
    println!("{}", segments)
}

default feature:

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl

disable default feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl
@ManyTheFish
Copy link
Member

Hello @hamano, What do you expect in terms of segmentation?

Thank you!

@hamano
Copy link
Author

hamano commented Jun 4, 2024

It is ideal that proper nouns are not split, but if it is not in the dictionary, it can't avoided that OpenSSL would be split as Open|SSL.
Here, I recognize the issue as splitting it as Open|S|S|L. The word SSL would disappear.

@hamano
Copy link
Author

hamano commented Jun 4, 2024

The inability to segment words not found in the dictionary was a characteristic in Japanese. Please ignore that.
In any case, I believe it is desirable that 'OpenSSL' not be segment, even in Latin languages

@hamano
Copy link
Author

hamano commented Jun 4, 2024

CloudCannon/pagefind#591

@hamano
Copy link
Author

hamano commented Jun 4, 2024

The word open_ssl is not common, but for example, in the case of a module like apache mod_ssl, it is expected not to be segmented.

@ManyTheFish
Copy link
Member

Hello @hamano,
I think the better way to solve this issue is to add the word SSL to the tokenizer dictionary. It seems to be a specific case more than a generality.

@hamano
Copy link
Author

hamano commented Jun 5, 2024

I'm concerned not just about the term "OpenSSL" but about countless similar terms.

OpenVPN
OpenBSD
OpenJDK
OpenCV
FreeRADIUS
FreeBSD
PostgreSQL
MySQL
MongoDB

I don't think all these terms should be added to the dictionary. New terms emerge one after another.

@hamano
Copy link
Author

hamano commented Jun 5, 2024

If you say this is the expected behavior of the latin-camelcase feature, then that's fine. I'll disable it. However, it's too inconvenient as a default feature, so I reported it, especially for use in technical documentation.

@ManyTheFish
Copy link
Member

ManyTheFish commented Jun 5, 2024

Understood, we may disable it from the default features,
thinking of it, this segmentation is mainly used for the cases you are describing, so if you want to change the behavior of the segmenter, I would gladly accept a new PR.
Moreover, the change seems easy, maybe replacing the highlighted code block by:

        let should_group = if last_char_was_lowercase && char.is_letter_uppercase() {
            false
        } else {
            true
        };

        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

or even

        let should_group = !(last_char_was_lowercase && char.is_letter_uppercase());
        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

Should solve your issue ☺️

However, the word OpenSSLError would be segmented as ["Open", "SSLError"] with this change, it would need a bit more efforts to make it segment like ["Open", "SSL", "Error"]. 🤔

@hamano
Copy link
Author

hamano commented Jun 6, 2024

I am unsure how "OpenSSLError" should be segment. Perhaps in the context of a program constant name appearing in documentation, it is expected not to be segment.
Anyway, I don't understand the use case for the latin-camelcase feature. So, I will simply disable it.
What I wanted to report here is that the excessive segmentation of "OpenSSL" into "Open|S|S|L" is likely not expected by anyone.

@ManyTheFish
Copy link
Member

Hey @hamano,
no problem. I just wanted to say that if you want to use the feature, you can create a PR enhancing the behavior and I will accept it.
See you!

@ManyTheFish ManyTheFish added the good first issue Good for newcomers label Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants