Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keyword extraction logic and rename source text collector directory #58

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

maxachis
Copy link
Collaborator

Fixes

Nothing, but partially addresses #55

Description

  • Renames html_tag_collector directory (and associated references) to source_tag_collector
  • Create KeywordExtractor, a tool that utilizes the KeyBERT library to extract keywords from a web page's html

Testing

  • A unit test has been created in test_source_text_collector_unit.py, which can be run to demonstrate basic functionality of the added method
  • The integration test in test_source_text_collector_integration.py can be run on the given urls (or others) to see how the keywords appear alongside other included elements

Copy link
Contributor

@josh-chamberlain josh-chamberlain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass keywords as a property, it's not clear how it was generated—I would prefer to declare bert_keywords (in whatever case) since it's likely we'll have other kinds of keywords in the future (people love labeling stuff). Otherwise, this looks like a worthy enhancement / worth a try to see how much more fidelity we get

import requests
import yake
from bs4 import BeautifulSoup
from keybert import KeyBERT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this in the requirements.txt?

@josh-chamberlain
Copy link
Contributor

@EvilDrPurple would you want to test this one / leave your review?

Copy link
Contributor

@EvilDrPurple EvilDrPurple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it and looks very good. One thing I should note is it took about 10 minutes to execute for 18 urls, meaning it will take approximately 37 hours to run on our current list of over 4000 urls. Perhaps we could have the keyword extraction disabled by default, sort of like how it's done with the render-javascript feature. This way when we don't need the keywords generated we don't have to wait a very long time to get the tags.

@EvilDrPurple
Copy link
Contributor

I tested it and looks very good. One thing I should note is it took about 10 minutes to execute for 18 urls, meaning it will take approximately 37 hours to run on our current list of over 4000 urls. Perhaps we could have the keyword extraction disabled by default, sort of like how it's done with the render-javascript feature. This way when we don't need the keywords generated we don't have to wait a very long time to get the tags.

Adding onto this, maybe we should modify the logic of the collector so that it outputs data as it goes to a file so that way if there is a failure during execution it can easily picked up where it left off. I can only imagine the frustration of losing everything after the program has been running for over 24 hours. Probably warrants an issue being opened for this

@josh-chamberlain
Copy link
Contributor

Since we don't know how effective keywords will be, I'm hesitant to say we should include such a lengthy pre-keywording step in the process. Maybe we could generate them as their own step, the same way we generate name and description. Maybe we generate keywords later on in the process.

@maxachis
Copy link
Collaborator Author

I tested it and looks very good. One thing I should note is it took about 10 minutes to execute for 18 urls, meaning it will take approximately 37 hours to run on our current list of over 4000 urls. Perhaps we could have the keyword extraction disabled by default, sort of like how it's done with the render-javascript feature. This way when we don't need the keywords generated we don't have to wait a very long time to get the tags.

I may like to take a second look at how I'm implementing it -- considering the model is pre-trained, in theory labeling shouldn't take so long after the initial download of the model.

I may also need to think about how to test it to account for the sort of hardware we would run it on, as I may get faster run times because of my personal hardware which could bias my assessment of its performance in the cloud.

I also confess that I had tested it primarily for functionality and not performance, which is worthwhile for me to take into consideration. Relatedly, @josh-chamberlain, I wonder if the PR template might benefit from the inclusion of a header called "Performance Impacts" or something similar, to encourage disclosing how a change impacts execution time and other metrics.

@josh-chamberlain
Copy link
Contributor

@maxachis I added a line about it 👍

@maxachis
Copy link
Collaborator Author

maxachis commented Mar 27, 2024

@josh-chamberlain @EvilDrPurple I tested the execution time on my current setup

Code

import timeit
import nltk
from keyword_extractor import KeywordExtractor

nltk.download('brown')  # Download the Brown Corpus
from nltk.corpus import brown

# Get text data as a single string
text = ' '.join(brown.words(categories='news'))  
text_to_process = text[:5000]
print(text_to_process)
ke = KeywordExtractor()
time = timeit.timeit(lambda: ke.extract_keywords(text_to_process), number=50)
print(f"Average execution time over 50 runs: {time} seconds")
print(f"Execution time per run: {time / 50} seconds")

My Hardware

Processor Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz 2.90 GHz
Installed RAM 16.0 GB (15.8 GB usable)
System type 64-bit operating system, x64-based processor
GPU Intel UHD Graphics 630 (Integrated)
Edition Windows 11 Pro
Version 23H2
OS build 22631.3296
Experience Windows Feature Experience Pack 1000.22687.1000.0

Output

Average execution time over 50 runs: 272.2858916000114 seconds
Execution time per run: 5.445717832000228 seconds

Discussion

Note that this is testing the performance add of the extract_keywords() function alone, and that we're assuming that there won't be fundamental differences in structure between the Brown corpus and any web pages we process that would impact performance.

Extrapolating to 4000 URLs, this becomes 363 minutes, or 6.05 hours. Again, if we are running this function alone, on hardware similar to my own.

Determining how prohibitive this might be would depend on what hardware we use and where in the pipeline it's located. My gut reaction (always fallible) is that this should be comparable to the performance cost of getting data from a web page initially, especially if Javascript is enabled. I'd need to run a broader performance test on this whole component to better determine it's impact, which we may need to do anyway if we're determining the anticipated costs on a Digital Ocean Droplet.

@maxachis
Copy link
Collaborator Author

maxachis commented Mar 27, 2024

@josh-chamberlain In terms of implementation, I see several possible options. There are others, but these stand out to me as most apparent:

Option 1: Inclusion in source_text_collector.py, under all conditions

Pros

  • Code is already designed for that, so minimal alterations would be required
  • Ensures consistent output in data

Cons

  • Permanently increases execution time of source_test_collector.py
  • Impact of keywordization may be variable, depending on data -- for web pages with little code in expected structures, keywords may substantially improve overall quality of data. For web pages that already have a lot of content, keywords may not add much

Option 2: Inclusion in source_text_collector.py, under select conditions

Pros

  • Enables greater flexibility in when to apply, enabling saving time on data that is unlikely to require it
  • Dynamic application of keywordization can potentially equalize quality between high-content and low-content web pages.

Cons

  • Will require modifications to setup, both for obtaining metrics to determine when to apply keywordization and the conditional logic to execute it
  • Dataset becomes more inconsistent -- some rows may have data columns which others lack, which could have unpredictable effects on performance.

Option 3: Implement at separate, later point in pipeline

Pros

  • Decouples keywordization from other parts of process, enabling greater testability and the ability to more easily disable its functionality if the need arises
  • Enables keywordization to be implemented at a separate time from other parts of pipeline, improving flexibility and possibly resilience to failure (i.e. if keywordization fails, the rest of source_text_collector.py won't be halted)

Cons

  • Data for keywordization will either need to be saved for execution (using extra storage space) or re-retrieved when performing keywordization (increasing execution cost as opposed to including it within the other code).
  • Will require more substantial revamping of code, as well as clarification on where to include the code

@josh-chamberlain
Copy link
Contributor

@maxachis got it, so we're adding 5.5 seconds per URL on your machine. As a basis for comparison, how long does each URL take on your machine without adding keywords?

Either way, I would probably recommend we test whether adding keywords improves accuracy before merging this. Hard to know whether it's worth it otherwise

@maxachis
Copy link
Collaborator Author

maxachis commented Apr 8, 2024

@bonjarlow Per your discussion in the last working meeting about how removing the URLs seems to have a negative impact on accuracy, have you looked at how keywords impact accuracy when URLs are removed from the picture?

@bonjarlow
Copy link
Contributor

@maxachis yeah I've played around with different combinations of features, it seems the model tops out at ~70% with or without keywords. I'll make an issue now

@maxachis
Copy link
Collaborator Author

For additional clarity, the issue in question is #75

@maxachis
Copy link
Collaborator Author

@bonjarlow @EvilDrPurple Coming back to this: What are your thoughts on including this keyword extraction as a part of the tag collector function? Is this useful/potentially useful enough to include?

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented May 14, 2024

@maxachis What if we only generate keywords from the human-annotated data? We'd have fewer URLs, guaranteed to be used for training. I think we could use those keywords to scope common crawl or otherwise generate better batches, too.

@maxachis
Copy link
Collaborator Author

@maxachis What if we only generate keywords from the human-annotated data? We'd have fewer URLs, guaranteed to be used for training. I think we could use those keywords to scope common crawl or otherwise generate better batches, too.

@josh-chamberlain I'm intrigued by these thoughts, but I'm unclear on how the process would work. Does this mean that humans come up with the keywords, or that we apply automated keyword generation to only human-annotated data? Or some fancy third option?

@josh-chamberlain
Copy link
Contributor

@maxachis we could ask people for keywords, but I was thinking it would work the same way you had in mind—except only applied to annotated sources. It's a way to worry about a lot fewer keywords with maybe better quality.

It does mean more editing data mid-stream...we'd have to get really good at modifying label studio / hugging face data in a sensible way.

@josh-chamberlain josh-chamberlain removed the request for review from mbodeantor May 18, 2024 01:41
@maxachis
Copy link
Collaborator Author

@maxachis we could ask people for keywords, but I was thinking it would work the same way you had in mind—except only applied to annotated sources. It's a way to worry about a lot fewer keywords with maybe better quality.

It does mean more editing data mid-stream...we'd have to get really good at modifying label studio / hugging face data in a sensible way.

With that in mind, I'm wondering if this PR is ready for prime-time, or if it might need some adjustments. Because as of right now it's designed to pull keywords from everything being pulled.

@josh-chamberlain
Copy link
Contributor

@maxachis yeah, I just updated #49 to mention it. I think we should only do it for labeled sources; we could also generate descriptions at the same time.

@maxachis maxachis marked this pull request as draft May 22, 2024 11:33
@maxachis
Copy link
Collaborator Author

@josh-chamberlain Got it. In that case, I'll convert this back to a draft for the time being and move it from Intake to In-Progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Work in progress
Development

Successfully merging this pull request may close these issues.

4 participants