Add keyword extraction logic and rename source text collector directory #58

maxachis · 2024-03-24T23:39:08Z

Fixes

Nothing, but partially addresses #55

Description

Renames html_tag_collector directory (and associated references) to source_tag_collector
Create KeywordExtractor, a tool that utilizes the KeyBERT library to extract keywords from a web page's html

Testing

A unit test has been created in test_source_text_collector_unit.py, which can be run to demonstrate basic functionality of the added method
The integration test in test_source_text_collector_integration.py can be run on the given urls (or others) to see how the keywords appear alongside other included elements

…or_integration

josh-chamberlain

If we pass keywords as a property, it's not clear how it was generated—I would prefer to declare bert_keywords (in whatever case) since it's likely we'll have other kinds of keywords in the future (people love labeling stuff). Otherwise, this looks like a worthy enhancement / worth a try to see how much more fidelity we get

josh-chamberlain · 2024-03-25T16:10:18Z

source_text_collector/keyword_extractor.py

+import requests
+import yake
+from bs4 import BeautifulSoup
+from keybert import KeyBERT


do we need this in the requirements.txt?

josh-chamberlain · 2024-03-25T16:15:05Z

@EvilDrPurple would you want to test this one / leave your review?

EvilDrPurple

I tested it and looks very good. One thing I should note is it took about 10 minutes to execute for 18 urls, meaning it will take approximately 37 hours to run on our current list of over 4000 urls. Perhaps we could have the keyword extraction disabled by default, sort of like how it's done with the render-javascript feature. This way when we don't need the keywords generated we don't have to wait a very long time to get the tags.

EvilDrPurple · 2024-03-25T18:22:15Z

I tested it and looks very good. One thing I should note is it took about 10 minutes to execute for 18 urls, meaning it will take approximately 37 hours to run on our current list of over 4000 urls. Perhaps we could have the keyword extraction disabled by default, sort of like how it's done with the render-javascript feature. This way when we don't need the keywords generated we don't have to wait a very long time to get the tags.

Adding onto this, maybe we should modify the logic of the collector so that it outputs data as it goes to a file so that way if there is a failure during execution it can easily picked up where it left off. I can only imagine the frustration of losing everything after the program has been running for over 24 hours. Probably warrants an issue being opened for this

josh-chamberlain · 2024-03-26T19:23:49Z

Since we don't know how effective keywords will be, I'm hesitant to say we should include such a lengthy pre-keywording step in the process. Maybe we could generate them as their own step, the same way we generate name and description. Maybe we generate keywords later on in the process.

maxachis · 2024-03-26T22:13:04Z

I tested it and looks very good. One thing I should note is it took about 10 minutes to execute for 18 urls, meaning it will take approximately 37 hours to run on our current list of over 4000 urls. Perhaps we could have the keyword extraction disabled by default, sort of like how it's done with the render-javascript feature. This way when we don't need the keywords generated we don't have to wait a very long time to get the tags.

I may like to take a second look at how I'm implementing it -- considering the model is pre-trained, in theory labeling shouldn't take so long after the initial download of the model.

I may also need to think about how to test it to account for the sort of hardware we would run it on, as I may get faster run times because of my personal hardware which could bias my assessment of its performance in the cloud.

I also confess that I had tested it primarily for functionality and not performance, which is worthwhile for me to take into consideration. Relatedly, @josh-chamberlain, I wonder if the PR template might benefit from the inclusion of a header called "Performance Impacts" or something similar, to encourage disclosing how a change impacts execution time and other metrics.

josh-chamberlain · 2024-03-27T16:23:28Z

@maxachis I added a line about it 👍

maxachis · 2024-03-27T20:22:05Z

@josh-chamberlain @EvilDrPurple I tested the execution time on my current setup

Code

import timeit
import nltk
from keyword_extractor import KeywordExtractor

nltk.download('brown')  # Download the Brown Corpus
from nltk.corpus import brown

# Get text data as a single string
text = ' '.join(brown.words(categories='news'))  
text_to_process = text[:5000]
print(text_to_process)
ke = KeywordExtractor()
time = timeit.timeit(lambda: ke.extract_keywords(text_to_process), number=50)
print(f"Average execution time over 50 runs: {time} seconds")
print(f"Execution time per run: {time / 50} seconds")

My Hardware

Processor Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz 2.90 GHz
Installed RAM 16.0 GB (15.8 GB usable)
System type 64-bit operating system, x64-based processor
GPU Intel UHD Graphics 630 (Integrated)
Edition Windows 11 Pro
Version 23H2
OS build 22631.3296
Experience Windows Feature Experience Pack 1000.22687.1000.0

Output

Average execution time over 50 runs: 272.2858916000114 seconds
Execution time per run: 5.445717832000228 seconds

Discussion

Note that this is testing the performance add of the extract_keywords() function alone, and that we're assuming that there won't be fundamental differences in structure between the Brown corpus and any web pages we process that would impact performance.

Extrapolating to 4000 URLs, this becomes 363 minutes, or 6.05 hours. Again, if we are running this function alone, on hardware similar to my own.

Determining how prohibitive this might be would depend on what hardware we use and where in the pipeline it's located. My gut reaction (always fallible) is that this should be comparable to the performance cost of getting data from a web page initially, especially if Javascript is enabled. I'd need to run a broader performance test on this whole component to better determine it's impact, which we may need to do anyway if we're determining the anticipated costs on a Digital Ocean Droplet.

maxachis · 2024-03-27T20:36:33Z

@josh-chamberlain In terms of implementation, I see several possible options. There are others, but these stand out to me as most apparent:

Option 1: Inclusion in source_text_collector.py, under all conditions

Pros

Code is already designed for that, so minimal alterations would be required
Ensures consistent output in data

Cons

Permanently increases execution time of source_test_collector.py
Impact of keywordization may be variable, depending on data -- for web pages with little code in expected structures, keywords may substantially improve overall quality of data. For web pages that already have a lot of content, keywords may not add much

Option 2: Inclusion in source_text_collector.py, under select conditions

Pros

Enables greater flexibility in when to apply, enabling saving time on data that is unlikely to require it
Dynamic application of keywordization can potentially equalize quality between high-content and low-content web pages.

Cons

Will require modifications to setup, both for obtaining metrics to determine when to apply keywordization and the conditional logic to execute it
Dataset becomes more inconsistent -- some rows may have data columns which others lack, which could have unpredictable effects on performance.

Option 3: Implement at separate, later point in pipeline

Pros

Decouples keywordization from other parts of process, enabling greater testability and the ability to more easily disable its functionality if the need arises
Enables keywordization to be implemented at a separate time from other parts of pipeline, improving flexibility and possibly resilience to failure (i.e. if keywordization fails, the rest of source_text_collector.py won't be halted)

Cons

Data for keywordization will either need to be saved for execution (using extra storage space) or re-retrieved when performing keywordization (increasing execution cost as opposed to including it within the other code).
Will require more substantial revamping of code, as well as clarification on where to include the code

josh-chamberlain · 2024-03-28T15:57:25Z

@maxachis got it, so we're adding 5.5 seconds per URL on your machine. As a basis for comparison, how long does each URL take on your machine without adding keywords?

Either way, I would probably recommend we test whether adding keywords improves accuracy before merging this. Hard to know whether it's worth it otherwise

maxachis · 2024-04-08T14:30:02Z

@bonjarlow Per your discussion in the last working meeting about how removing the URLs seems to have a negative impact on accuracy, have you looked at how keywords impact accuracy when URLs are removed from the picture?

bonjarlow · 2024-04-10T21:06:48Z

@maxachis yeah I've played around with different combinations of features, it seems the model tops out at ~70% with or without keywords. I'll make an issue now

maxachis · 2024-04-11T11:37:08Z

For additional clarity, the issue in question is #75

maxachis · 2024-04-21T22:08:45Z

@bonjarlow @EvilDrPurple Coming back to this: What are your thoughts on including this keyword extraction as a part of the tag collector function? Is this useful/potentially useful enough to include?

josh-chamberlain · 2024-05-14T17:35:14Z

@maxachis What if we only generate keywords from the human-annotated data? We'd have fewer URLs, guaranteed to be used for training. I think we could use those keywords to scope common crawl or otherwise generate better batches, too.

maxachis · 2024-05-14T20:04:00Z

@maxachis What if we only generate keywords from the human-annotated data? We'd have fewer URLs, guaranteed to be used for training. I think we could use those keywords to scope common crawl or otherwise generate better batches, too.

@josh-chamberlain I'm intrigued by these thoughts, but I'm unclear on how the process would work. Does this mean that humans come up with the keywords, or that we apply automated keyword generation to only human-annotated data? Or some fancy third option?

josh-chamberlain · 2024-05-16T12:52:23Z

@maxachis we could ask people for keywords, but I was thinking it would work the same way you had in mind—except only applied to annotated sources. It's a way to worry about a lot fewer keywords with maybe better quality.

It does mean more editing data mid-stream...we'd have to get really good at modifying label studio / hugging face data in a sensible way.

maxachis · 2024-05-22T01:17:57Z

@maxachis we could ask people for keywords, but I was thinking it would work the same way you had in mind—except only applied to annotated sources. It's a way to worry about a lot fewer keywords with maybe better quality.

It does mean more editing data mid-stream...we'd have to get really good at modifying label studio / hugging face data in a sensible way.

With that in mind, I'm wondering if this PR is ready for prime-time, or if it might need some adjustments. Because as of right now it's designed to pull keywords from everything being pulled.

josh-chamberlain · 2024-05-22T02:52:21Z

@maxachis yeah, I just updated #49 to mention it. I think we should only do it for labeled sources; we could also generate descriptions at the same time.

maxachis · 2024-05-22T11:34:23Z

@josh-chamberlain Got it. In that case, I'll convert this back to a draft for the time being and move it from Intake to In-Progress

maxachis added 6 commits March 24, 2024 18:01

Rename html_tag_collector to source_tag_collector

ebf75c4

Rename html_tag_collector to source_tag_collector

0c33781

Rename text_html_tag_collector_integration to html_source_tag_collect…

63220e3

…or_integration

Create KeywordExtractor class

1c38fda

Create source_text_collector_unit tests, add test_extract_keywords

17ee74b

Add keyword extractor to collector logic

c5e6b84

maxachis requested a review from EvilDrPurple March 24, 2024 23:39

maxachis requested review from josh-chamberlain and mbodeantor as code owners March 24, 2024 23:39

maxachis mentioned this pull request Mar 24, 2024

HTML scraper: grab alternate text if headers/meta are not available #55

Closed

josh-chamberlain reviewed Mar 25, 2024

View reviewed changes

EvilDrPurple requested changes Mar 25, 2024

View reviewed changes

josh-chamberlain removed the request for review from mbodeantor May 18, 2024 01:41

josh-chamberlain mentioned this pull request May 22, 2024

Feature: training dataset maintenance #49

Open

2 tasks

maxachis marked this pull request as draft May 22, 2024 11:33

josh-chamberlain mentioned this pull request May 29, 2024

automated workflow: update training data from label studio #89

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add keyword extraction logic and rename source text collector directory #58

Add keyword extraction logic and rename source text collector directory #58

maxachis commented Mar 24, 2024

josh-chamberlain left a comment

josh-chamberlain Mar 25, 2024

josh-chamberlain commented Mar 25, 2024

EvilDrPurple left a comment •

edited

Loading

EvilDrPurple commented Mar 25, 2024

josh-chamberlain commented Mar 26, 2024

maxachis commented Mar 26, 2024

josh-chamberlain commented Mar 27, 2024

maxachis commented Mar 27, 2024 •

edited

Loading

maxachis commented Mar 27, 2024 •

edited

Loading

josh-chamberlain commented Mar 28, 2024

maxachis commented Apr 8, 2024

bonjarlow commented Apr 10, 2024

maxachis commented Apr 11, 2024

maxachis commented Apr 21, 2024

josh-chamberlain commented May 14, 2024 •

edited

Loading

maxachis commented May 14, 2024

josh-chamberlain commented May 16, 2024

maxachis commented May 22, 2024

josh-chamberlain commented May 22, 2024

maxachis commented May 22, 2024

Add keyword extraction logic and rename source text collector directory #58

Are you sure you want to change the base?

Add keyword extraction logic and rename source text collector directory #58

Conversation

maxachis commented Mar 24, 2024

Fixes

Description

Testing

josh-chamberlain left a comment

Choose a reason for hiding this comment

josh-chamberlain Mar 25, 2024

Choose a reason for hiding this comment

josh-chamberlain commented Mar 25, 2024

EvilDrPurple left a comment • edited Loading

Choose a reason for hiding this comment

EvilDrPurple commented Mar 25, 2024

josh-chamberlain commented Mar 26, 2024

maxachis commented Mar 26, 2024

josh-chamberlain commented Mar 27, 2024

maxachis commented Mar 27, 2024 • edited Loading

Code

My Hardware

Output

Discussion

maxachis commented Mar 27, 2024 • edited Loading

Option 1: Inclusion in source_text_collector.py, under all conditions

Pros

Cons

Option 2: Inclusion in source_text_collector.py, under select conditions

Pros

Cons

Option 3: Implement at separate, later point in pipeline

Pros

Cons

josh-chamberlain commented Mar 28, 2024

maxachis commented Apr 8, 2024

bonjarlow commented Apr 10, 2024

maxachis commented Apr 11, 2024

maxachis commented Apr 21, 2024

josh-chamberlain commented May 14, 2024 • edited Loading

maxachis commented May 14, 2024

josh-chamberlain commented May 16, 2024

maxachis commented May 22, 2024

josh-chamberlain commented May 22, 2024

maxachis commented May 22, 2024

EvilDrPurple left a comment •

edited

Loading

maxachis commented Mar 27, 2024 •

edited

Loading

maxachis commented Mar 27, 2024 •

edited

Loading

josh-chamberlain commented May 14, 2024 •

edited

Loading