Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Igbo Tokenizer Project #3

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

chrisemezue
Copy link

I have finally kicked off the Igbo tokenizer project to extract words from the articles scraped.
Sorry for the lateness.
Still a number of TO DOs, but this is a kick-off.

@@ -0,0 +1,14 @@
# 🧼 webscraper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so just to clarify, the tokenizer code will live in this repo under the tokenizer directory?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the idea, Yes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you want just the extracted words?

# License: MIT


def cleanhtml(raw_html):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write up some documentation on the following:

  • What does this script do?
  • How to use this script (i.e. what is the format of input that can be handled by this script, how to run the script, etc.)

This new documentation should be another .md file that the public can read through

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is partly what i wanted us to discuss more on. i was planning of a kind of Github action that can automatically extract words from the curated articles. Due to my limited experience in Github and time, this has not been finalized. Is this something that is useful? or there may be not much need for it?

Copy link
Collaborator

@ijemmao ijemmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a solid start! My main question about this PR though is why did you copy over the data of words into the tokenizer directory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants