Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word cloud in foreign language. #367

Open
ghost opened this issue Apr 28, 2018 · 18 comments
Open

word cloud in foreign language. #367

ghost opened this issue Apr 28, 2018 · 18 comments

Comments

@ghost
Copy link

ghost commented Apr 28, 2018

Description

Trying to create word cloud in a foreign language

I have documented here

https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages

string1="""आजको छापा English Logo गृहपृष्ठ राजनीति समाज विचार किनमेल कला खेलकुद घुमफिर ब्लग साहित्यपाटी ग्लोबल फोटो ग्यालरी कस्तो छ प्रधानमन्त्रीको स्वास्थ्य? सरकारले सिण्डिकेट हटाएपछि देशैभरका टिकट काउन्टर बन्द उपेन्द्र यादवले फेरि दिए स\u200cंविधान नस्वीकारेको धम्की पौडेलले देउवालाई भने– प्रधानमन्त्री नभए पनि गणेशमानलाई जनताले पूज्छन्, तपाईंलाई कस्ले पुज्छ? ३३ किलो सुन गायब प्रकरण : यस्तो छ गोरे – प्रहरी ‘कनेक्सन’ चीनलाई उपहार दिने गैँडा फेला परेन काठमाडौंमा भारतका ३ पूर्वराजदूत गाउँ चम्किए, सदरमुकाम खस्किए यी हुन् मोबाइल नबोक्ने ‘ठूला मान्छे’ विगतको पोल खुल्ने डरले भगाइयो गोरेलाई गुराँस टिप्नेलाई ‘जंगलमै कारबाही’ नेपाल भ्रमणमा आफ्नै कार ल्याउँदैछन् मोदीले सिंहदरबारभित्र कोठा खोज्दै प्रधानमन्त्री कार्यालय डाक्टरले ‘भ्वाइस रेस्ट’ गर्न भनेका गच्छदार ३ घन्टा ५ मिनेट बोले, शुक्रबार थप १ घन्टा बोल्ने अभियुक्तसँग नाम थर मिल्दा निर्दोषलार्इ जेल सांसदहरूले व्यापार-व्यावसाय गर्न नपाउने सरकारको नीति तथा कार्यक्रम तयार, ८ प्रतिशतको आर्थिक वृद्धिको लक्ष्य स्वतन्त्र हुन सम्बन्धविच्छेद गर्ने क्रम बढ्यो मोदीको भ्रमण तालिका बनाउनै हम्मे बोली फेरिएन प्रधानमन्त्रीको: दुई बर्षपछि पनि उस्तै भाषण पञ्चायतदेखि नै\xa0सुन र शक्तिको सम्बन्ध! यी हुन् सुन तस्करीका ७ घुम्ती एमाले–माओवादीले १० हजार युवालाई मार्क्सवाद पढाउने साउदीमा नेपाली युवालाई मृत्युदण्डको फैसला सरकारी निकायले १३ अर्ब नतिर्दा गुठी थला ‘पूर्वी नेपाल भूकम्प उच्च जोखिममा’ सामुदायिक स्कुलमा पनि निजीजस्तै शुल्क सिंचाई विभागमा दिनहुँ चल्छ जुवातास गृहपृष्ठ ब्लग साहित्यपाटी पाठक विचार दसैं सामग्री छापाबाट फिड """

Steps/Code to Reproduce

wordcloud = WordCloud(max_font_size=300, background_color = 'white', relative_scaling=1,
width=1500, height=1000, colormap='plasma').generate(string1)#generate_from_frequencies(linklist)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Produces the cloud as attached.
https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages

@amueller
Copy link
Owner

See discussion in #315 and #238

@ghost
Copy link
Author

ghost commented May 6, 2018

Yes looks like there is no whole lot of support for Nepali language.

Using popular fonts like Preeti, Kantipur, etc, still generates squares or blocks. However I used Devnagari font available, it kinda generated some but cannot find a full solution to Nepali language. (see partial solution below)

https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages/50081321#50081321

@amueller
Copy link
Owner

amueller commented May 7, 2018

have you tried noto? And what's the problem you're currently facing?

@kneupaneRecordedBooks
Copy link

yes, it print the the letters but not printing the words correctly. Wordcloud does not make whole lot of sense.

for an instance,
image

is not expected from this text

"""हिन्दू धर्मगुरु आचार्य श्रीनिवास पक्राउ परेका छन्। आफूले आफैंलाई गोली हान्न लगाएको अभियोगमा मोरङ प्रहरीको टोलीले श्रीनिवासलाई सोमबार काठमाडौंबाट पक्राउ गरेको हो।"""

@amueller
Copy link
Owner

amueller commented May 7, 2018

If you can't give me more details on what the problem is, I won't be able to help you. So you're saying it's not splitting up the string into words correctly? You can provide your own regular expression. See #272 for a discussion on Thai.
Also I highly recommend using Python3 for this. If you figure it out, I'd be happy to add more language specific documentation or examples

@kneupaneRecordedBooks
Copy link

Your answer to #272 seems me reasonable and infact it is not because of wordcloud. I know for fact. Its the font issue.

@amueller
Copy link
Owner

amueller commented May 7, 2018

Well the current problem doesn't seem to be a font issue but a tokenization issue (I think) but I can't tell because I can't read the language.

@kneupaneRecordedBooks
Copy link

so being the first time trying to implement this in my own mother tongue, I totally feel embarrassed about this tokenization. What are the best resources to learn about tokenization for foreign language. I am sure you have been made aware of this fact.

@amueller
Copy link
Owner

amueller commented May 7, 2018

I don't know. Maybe look at https://nlp.stanford.edu/IR-book/
Also: "foreign" is not really the right word to use here. That would imply as opposed to a native language. I think neither your nor my native language is English, so I wouldn't consider other languages foreign (for me English is a foreign language).

@kneupaneRecordedBooks
Copy link

yes probably you are right. I should have made it distinct to "Nepali". Yes, English is my second language. :-)

@SilentFlame
Copy link

@amueller Is this issue solved for devnagri script.? As by adding fonts too, I'm unable to get the display of words, still getting rectangular boxes.

@amueller
Copy link
Owner

@SilentFlame that probably means the font doesn't contain the symbols. Try rendering with PIL/pillow directly.

@SilentFlame
Copy link

@amueller thanks, I tried with the suggestion above and it works, just that I had to write my own regex for words.

@Shorotshishir
Copy link

Hello!, i am having trouble using bangla language text in word cloud. It is not tokenizing correctly. It is breaking the complex word and all the vowel signs. I tried CTLK to tokenize but still no luck.

@amueller
Copy link
Owner

@Shorotshishir you might need a custom regexp. Tokenization is unfortunately beyond the scope of this package. Let me know if you find a solution. spacy might help.

@riyadhrazzaq
Copy link

Hello!, i am having trouble using bangla language text in word cloud. It is not tokenizing correctly. It is breaking the complex word and all the vowel signs. I tried CTLK to tokenize but still no luck.

Use a custom regex and Bengali font. You can use something from Omicron Lab. Here's my code, but I am still facing incorrect glyph placement problem.

rgx = r"[\u0980-\u09FF]+"
wordcloud = WordCloud.(font_path=customFontPath',regexp=rgx).generate(text)
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
plt.show()

@fuad021
Copy link

fuad021 commented Apr 30, 2020

@riyadhrazzaq vai, did you solved the glyph placement problem?

@riyadhrazzaq
Copy link

@riyadhrazzaq vai, did you solved the glyph placement problem?

I did not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants