Igbo Tokenizer Project #3

chrisemezue · 2021-10-20T03:49:53Z

I have finally kicked off the Igbo tokenizer project to extract words from the articles scraped.
Sorry for the lateness.
Still a number of TO DOs, but this is a kick-off.

ijemmao · 2021-10-21T11:21:25Z

tokenizer/README.md

@@ -0,0 +1,14 @@
+# 🧼 webscraper


so just to clarify, the tokenizer code will live in this repo under the tokenizer directory?

that's the idea, Yes.

or you want just the extracted words?

ijemmao · 2021-10-21T11:22:23Z

tokenizer/tokenizer.py

+# License: MIT
+
+
+def cleanhtml(raw_html):


Could you write up some documentation on the following:

What does this script do?

How to use this script (i.e. what is the format of input that can be handled by this script, how to run the script, etc.)

This new documentation should be another .md file that the public can read through

this is partly what i wanted us to discuss more on. i was planning of a kind of Github action that can automatically extract words from the curated articles. Due to my limited experience in Github and time, this has not been finalized. Is this something that is useful? or there may be not much need for it?

ijemmao

This is a solid start! My main question about this PR though is why did you copy over the data of words into the tokenizer directory?

edaiofficial added 4 commits October 20, 2021 05:46

tokenizer files

b624878

tokenizer files

6a158a4

updated README

51c7bfa

improvements

54fceac

ijemmao reviewed Oct 21, 2021

View reviewed changes

ijemmao requested changes Oct 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Igbo Tokenizer Project #3

Igbo Tokenizer Project #3

chrisemezue commented Oct 20, 2021

ijemmao Oct 21, 2021

chrisemezue Oct 21, 2021

chrisemezue Oct 21, 2021

ijemmao Oct 21, 2021

chrisemezue Oct 21, 2021

ijemmao left a comment

Igbo Tokenizer Project #3

Are you sure you want to change the base?

Igbo Tokenizer Project #3

Conversation

chrisemezue commented Oct 20, 2021

ijemmao Oct 21, 2021

Choose a reason for hiding this comment

chrisemezue Oct 21, 2021

Choose a reason for hiding this comment

chrisemezue Oct 21, 2021

Choose a reason for hiding this comment

ijemmao Oct 21, 2021

Choose a reason for hiding this comment

chrisemezue Oct 21, 2021

Choose a reason for hiding this comment

ijemmao left a comment

Choose a reason for hiding this comment