-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Igbo Tokenizer Project #3
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,14 @@ | |||
# 🧼 webscraper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so just to clarify, the tokenizer code will live in this repo under the tokenizer
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the idea, Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you want just the extracted words?
# License: MIT | ||
|
||
|
||
def cleanhtml(raw_html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write up some documentation on the following:
- What does this script do?
- How to use this script (i.e. what is the format of input that can be handled by this script, how to run the script, etc.)
This new documentation should be another .md
file that the public can read through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is partly what i wanted us to discuss more on. i was planning of a kind of Github action that can automatically extract words from the curated articles. Due to my limited experience in Github and time, this has not been finalized. Is this something that is useful? or there may be not much need for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a solid start! My main question about this PR though is why did you copy over the data of words into the tokenizer
directory?
I have finally kicked off the Igbo tokenizer project to extract words from the articles scraped.
Sorry for the lateness.
Still a number of TO DOs, but this is a kick-off.