Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess text, Corpus or new separate widget: provide tool to convert British English to American or vice versa #1078

Open
wvdvegte opened this issue Aug 15, 2024 · 1 comment

Comments

@wvdvegte
Copy link

wvdvegte commented Aug 15, 2024

Is your feature request related to a problem? Please describe.
When I'm working with a corpus that is a mixture of documents in American English and British English spelling, the two versions of the same word (e.g., behavior and behaviour) can influence analyses such as clustering because they may be treated as different words. Stemming might help in some cases but it's hard to find out when it does work and when it doesn't.
As an example, I had a case where, in Annotated Corpus Map, both "organize" and "organise" were identified as keywords within a cluster. It would be better if only one version would be identified as an even more significant keyword

Describe the solution you'd like
It would be better to have an option to automatically treat all the documents so that they are analyzed as written in only one version of English. I'm not sure if this should be an option in Corpus (where the language is selected first), in Preprocess Text (however this widget may be skipped if Document Embedding is used as suggested here) or as a separate widget altogether.
The conversion can be easily realized using the code suggested here on Stack Overflow, using a list that is no longer available at its original location, but is still available in the www archive here.

Describe alternatives you've considered
In the case I described before, I ended up with a quick fix going back to the source data (which was already in a table, fortunately, not in separate documents), find-and-replace "organis" by "organiz" and re-loading the data into Orange. But this is not a comprehensive solution to the problem.

@wvdvegte
Copy link
Author

wvdvegte commented Aug 20, 2024

Addendum: the suggested code has an error in its function definition, and the dictionary is incomplete. And of course, there is an alternative to be considered: write a Python script to do the translation. The Python script in the attached workflow contains the corrected code and the complete dictionary.
Nevertheless, it would be nice to have harmonization/harmonisation of English spelling as an easier-to-access option in the Text add-on. Also this script works on text in a table, it cannot process a corpus (I have no idea how to address a corpus in Python)

UK-US conversion.ows.zip

(edit: code adapted to replace whole words only, and convert to lowercase first. Added remark about corpus as input)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant