-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add keyword extraction logic and rename source text collector directory #58
base: main
Are you sure you want to change the base?
Add keyword extraction logic and rename source text collector directory #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pass keywords
as a property, it's not clear how it was generated—I would prefer to declare bert_keywords
(in whatever case) since it's likely we'll have other kinds of keywords in the future (people love labeling stuff). Otherwise, this looks like a worthy enhancement / worth a try to see how much more fidelity we get
import requests | ||
import yake | ||
from bs4 import BeautifulSoup | ||
from keybert import KeyBERT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this in the requirements.txt?
@EvilDrPurple would you want to test this one / leave your review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it and looks very good. One thing I should note is it took about 10 minutes to execute for 18 urls, meaning it will take approximately 37 hours to run on our current list of over 4000 urls. Perhaps we could have the keyword extraction disabled by default, sort of like how it's done with the render-javascript feature. This way when we don't need the keywords generated we don't have to wait a very long time to get the tags.
Adding onto this, maybe we should modify the logic of the collector so that it outputs data as it goes to a file so that way if there is a failure during execution it can easily picked up where it left off. I can only imagine the frustration of losing everything after the program has been running for over 24 hours. Probably warrants an issue being opened for this |
Since we don't know how effective keywords will be, I'm hesitant to say we should include such a lengthy pre-keywording step in the process. Maybe we could generate them as their own step, the same way we generate |
I may like to take a second look at how I'm implementing it -- considering the model is pre-trained, in theory labeling shouldn't take so long after the initial download of the model. I may also need to think about how to test it to account for the sort of hardware we would run it on, as I may get faster run times because of my personal hardware which could bias my assessment of its performance in the cloud. I also confess that I had tested it primarily for functionality and not performance, which is worthwhile for me to take into consideration. Relatedly, @josh-chamberlain, I wonder if the PR template might benefit from the inclusion of a header called "Performance Impacts" or something similar, to encourage disclosing how a change impacts execution time and other metrics. |
@maxachis I added a line about it 👍 |
@josh-chamberlain @EvilDrPurple I tested the execution time on my current setup Codeimport timeit
import nltk
from keyword_extractor import KeywordExtractor
nltk.download('brown') # Download the Brown Corpus
from nltk.corpus import brown
# Get text data as a single string
text = ' '.join(brown.words(categories='news'))
text_to_process = text[:5000]
print(text_to_process)
ke = KeywordExtractor()
time = timeit.timeit(lambda: ke.extract_keywords(text_to_process), number=50)
print(f"Average execution time over 50 runs: {time} seconds")
print(f"Execution time per run: {time / 50} seconds") My HardwareProcessor Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz 2.90 GHz Output
DiscussionNote that this is testing the performance add of the Extrapolating to 4000 URLs, this becomes 363 minutes, or 6.05 hours. Again, if we are running this function alone, on hardware similar to my own. Determining how prohibitive this might be would depend on what hardware we use and where in the pipeline it's located. My gut reaction (always fallible) is that this should be comparable to the performance cost of getting data from a web page initially, especially if Javascript is enabled. I'd need to run a broader performance test on this whole component to better determine it's impact, which we may need to do anyway if we're determining the anticipated costs on a Digital Ocean Droplet. |
@josh-chamberlain In terms of implementation, I see several possible options. There are others, but these stand out to me as most apparent: Option 1: Inclusion in source_text_collector.py, under all conditionsPros
Cons
Option 2: Inclusion in source_text_collector.py, under select conditionsPros
Cons
Option 3: Implement at separate, later point in pipelinePros
Cons
|
@maxachis got it, so we're adding 5.5 seconds per URL on your machine. As a basis for comparison, how long does each URL take on your machine without adding keywords? Either way, I would probably recommend we test whether adding keywords improves accuracy before merging this. Hard to know whether it's worth it otherwise |
@bonjarlow Per your discussion in the last working meeting about how removing the URLs seems to have a negative impact on accuracy, have you looked at how keywords impact accuracy when URLs are removed from the picture? |
@maxachis yeah I've played around with different combinations of features, it seems the model tops out at ~70% with or without keywords. I'll make an issue now |
For additional clarity, the issue in question is #75 |
@bonjarlow @EvilDrPurple Coming back to this: What are your thoughts on including this keyword extraction as a part of the tag collector function? Is this useful/potentially useful enough to include? |
@maxachis What if we only generate keywords from the human-annotated data? We'd have fewer URLs, guaranteed to be used for training. I think we could use those keywords to scope common crawl or otherwise generate better batches, too. |
@josh-chamberlain I'm intrigued by these thoughts, but I'm unclear on how the process would work. Does this mean that humans come up with the keywords, or that we apply automated keyword generation to only human-annotated data? Or some fancy third option? |
@maxachis we could ask people for keywords, but I was thinking it would work the same way you had in mind—except only applied to annotated sources. It's a way to worry about a lot fewer keywords with maybe better quality. It does mean more editing data mid-stream...we'd have to get really good at modifying label studio / hugging face data in a sensible way. |
With that in mind, I'm wondering if this PR is ready for prime-time, or if it might need some adjustments. Because as of right now it's designed to pull keywords from everything being pulled. |
@josh-chamberlain Got it. In that case, I'll convert this back to a draft for the time being and move it from Intake to In-Progress |
Fixes
Nothing, but partially addresses #55
Description
Testing