-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word cloud in foreign language. #367
Comments
Yes looks like there is no whole lot of support for Nepali language. Using popular fonts like Preeti, Kantipur, etc, still generates squares or blocks. However I used Devnagari font available, it kinda generated some but cannot find a full solution to Nepali language. (see partial solution below) |
have you tried noto? And what's the problem you're currently facing? |
yes, it print the the letters but not printing the words correctly. Wordcloud does not make whole lot of sense. is not expected from this text """हिन्दू धर्मगुरु आचार्य श्रीनिवास पक्राउ परेका छन्। आफूले आफैंलाई गोली हान्न लगाएको अभियोगमा मोरङ प्रहरीको टोलीले श्रीनिवासलाई सोमबार काठमाडौंबाट पक्राउ गरेको हो।""" |
If you can't give me more details on what the problem is, I won't be able to help you. So you're saying it's not splitting up the string into words correctly? You can provide your own regular expression. See #272 for a discussion on Thai. |
Your answer to #272 seems me reasonable and infact it is not because of wordcloud. I know for fact. Its the font issue. |
Well the current problem doesn't seem to be a font issue but a tokenization issue (I think) but I can't tell because I can't read the language. |
so being the first time trying to implement this in my own mother tongue, I totally feel embarrassed about this tokenization. What are the best resources to learn about tokenization for foreign language. I am sure you have been made aware of this fact. |
I don't know. Maybe look at https://nlp.stanford.edu/IR-book/ |
yes probably you are right. I should have made it distinct to "Nepali". Yes, English is my second language. :-) |
@amueller Is this issue solved for devnagri script.? As by adding fonts too, I'm unable to get the display of words, still getting rectangular boxes. |
@SilentFlame that probably means the font doesn't contain the symbols. Try rendering with PIL/pillow directly. |
@amueller thanks, I tried with the suggestion above and it works, just that I had to write my own regex for words. |
Hello!, i am having trouble using bangla language text in word cloud. It is not tokenizing correctly. It is breaking the complex word and all the vowel signs. I tried CTLK to tokenize but still no luck. |
@Shorotshishir you might need a custom regexp. Tokenization is unfortunately beyond the scope of this package. Let me know if you find a solution. spacy might help. |
Use a custom regex and Bengali font. You can use something from Omicron Lab. Here's my code, but I am still facing incorrect glyph placement problem.
|
@riyadhrazzaq vai, did you solved the glyph placement problem? |
I did not. |
Description
Trying to create word cloud in a foreign language
I have documented here
https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages
string1="""आजको छापा English Logo गृहपृष्ठ राजनीति समाज विचार किनमेल कला खेलकुद घुमफिर ब्लग साहित्यपाटी ग्लोबल फोटो ग्यालरी कस्तो छ प्रधानमन्त्रीको स्वास्थ्य? सरकारले सिण्डिकेट हटाएपछि देशैभरका टिकट काउन्टर बन्द उपेन्द्र यादवले फेरि दिए स\u200cंविधान नस्वीकारेको धम्की पौडेलले देउवालाई भने– प्रधानमन्त्री नभए पनि गणेशमानलाई जनताले पूज्छन्, तपाईंलाई कस्ले पुज्छ? ३३ किलो सुन गायब प्रकरण : यस्तो छ गोरे – प्रहरी ‘कनेक्सन’ चीनलाई उपहार दिने गैँडा फेला परेन काठमाडौंमा भारतका ३ पूर्वराजदूत गाउँ चम्किए, सदरमुकाम खस्किए यी हुन् मोबाइल नबोक्ने ‘ठूला मान्छे’ विगतको पोल खुल्ने डरले भगाइयो गोरेलाई गुराँस टिप्नेलाई ‘जंगलमै कारबाही’ नेपाल भ्रमणमा आफ्नै कार ल्याउँदैछन् मोदीले सिंहदरबारभित्र कोठा खोज्दै प्रधानमन्त्री कार्यालय डाक्टरले ‘भ्वाइस रेस्ट’ गर्न भनेका गच्छदार ३ घन्टा ५ मिनेट बोले, शुक्रबार थप १ घन्टा बोल्ने अभियुक्तसँग नाम थर मिल्दा निर्दोषलार्इ जेल सांसदहरूले व्यापार-व्यावसाय गर्न नपाउने सरकारको नीति तथा कार्यक्रम तयार, ८ प्रतिशतको आर्थिक वृद्धिको लक्ष्य स्वतन्त्र हुन सम्बन्धविच्छेद गर्ने क्रम बढ्यो मोदीको भ्रमण तालिका बनाउनै हम्मे बोली फेरिएन प्रधानमन्त्रीको: दुई बर्षपछि पनि उस्तै भाषण पञ्चायतदेखि नै\xa0सुन र शक्तिको सम्बन्ध! यी हुन् सुन तस्करीका ७ घुम्ती एमाले–माओवादीले १० हजार युवालाई मार्क्सवाद पढाउने साउदीमा नेपाली युवालाई मृत्युदण्डको फैसला सरकारी निकायले १३ अर्ब नतिर्दा गुठी थला ‘पूर्वी नेपाल भूकम्प उच्च जोखिममा’ सामुदायिक स्कुलमा पनि निजीजस्तै शुल्क सिंचाई विभागमा दिनहुँ चल्छ जुवातास गृहपृष्ठ ब्लग साहित्यपाटी पाठक विचार दसैं सामग्री छापाबाट फिड """
Steps/Code to Reproduce
wordcloud = WordCloud(max_font_size=300, background_color = 'white', relative_scaling=1,
width=1500, height=1000, colormap='plasma').generate(string1)#generate_from_frequencies(linklist)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Produces the cloud as attached.
https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages
The text was updated successfully, but these errors were encountered: