-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback #19
Comments
Hi Jonas, love the feedback. Thanks for taking the time. I might need to check more thoroughly, but here are some thoughts of things to be fixed/improved on my side:
|
Resulting issues and enhancements:
added:
|
Just saw that |
Oops, my bad. |
Re: pip, it installs
|
Okay, issue identified, cause still unclear. You would need the 1.0.0rc2 version. Maybe because 1.0 is python 3.9+? If that's not it, I'm out of ideas. Just tried with docker and ubuntu-latest, worked like a charm. |
Yep, that's the cause. User error, case closed :) |
While fixing, found #23 |
Have added the Github profiles as a test case and re-worked training, should now work reasonably fast. CSS selectors are flaky at times, need to find a reasonable heuristic to prefer good ones. |
Here's another example that doesn't work in case you're looking for work :-D import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
article1_url = "https://www.spiegel.de/politik/kristina-haenel-nach-abstimmung-ueber-219a-im-bundestag-dieser-kampf-ist-vorbei-a-f3c04fb2-8126-4831-bc32-ac6c58e1e520"
resp = requests.get(article1_url)
resp.raise_for_status()
page = Page(resp.content)
sample = Sample(
page,
{
"title": "»Dieser Kampf ist vorbei«",
"subtitle": "Ärztin Kristina Hänel nach Abstimmung über 219a",
"teaser": "Der umstrittene Paragraf zum »Werbeverbot« für Abtreibung ist seit heute Geschichte – und die Gießenerin Kristina Hänel, die seit Jahren dafür gekämpft hat, kann aufatmen. Wie geht es für die Medizinerin jetzt weiter?",
"author": "Nike Laurenz",
"published": "24.06.2022, 14.26 Uhr",
},
)
training_set = TrainingSet()
training_set.add_sample(sample)
scraper = train_scraper(training_set)
resp = requests.get("https://www.spiegel.de/politik/deutschland/abtreibung-abschaffung-von-paragraf-219a-fuer-die-muendige-frau-kommentar-a-784cd403-f279-4124-a216-e320042d1719")
result = scraper.get(Page(resp.content))
print(result) |
What does "doesn't work" mean in that context? I think that it's impossible to get it right with one sample (and esp. for two slightly different pages). I would most likely fail to write a scraper myself just by looking at one page either. |
It crashes (but with 1 sample only, haven’t tested more) |
So regarding spiegel online, this was quite some work as articles have different layouts. Took me some major performance tweaks to get it running in a sensible amount of time without sacrificing correctness. I still have issues with missing authors because the scraper class raises an error instead of assuming None if no author is found, but that's fixable. Issue #25 Here's the code: https://gist.github.com/lorey/fdb88d6c8e41b9b6bc8df264cffc68e1 |
Fixed the authors issue, now takes around 30s on my machine. Formatting by me:
|
Impressive work 🤩 |
Example from a commercial application, """
To use this:
pip install requests
pip install --pre mlscraper
To automatically build any scraper, check out https://github.com/lorey/mlscraper
"""
import logging
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
ARTICLES = ({
'url': 'https://www.rahm24.de/schlafen-und-wohnen/komfortmatratzen/schaumstoffmatratze-burmeier-basic-fit',
'title': "Schaumstoffmatratze Burmeier Basic-Fit",
#'price': '230,00 € *',
'manufacturer': 'Burmeier',
},
{
'url': 'https://www.rahm24.de/medizintechnik/inhalationstherapie/inhalationsgeraet-omron-ne-c28p',
'title': 'Inhalationsgerät Omron NE-C28P',
#'price': '87,00 € *',
'manufacturer': 'Omron',
},
{
'url': 'https://www.rahm24.de/schlafen-und-wohnen/aufstehsessel/ruhe-und-aufstehsessel-innov-cocoon',
'title': 'Ruhe- und Aufstehsessel Innov Cocoon',
#'price': '1.290,00 € *',
'manufacturer': 'Innov`Sa',
},
)
def train_and_scrape():
"""
This trains the scraper and scrapes two other pages.
"""
scraper = train_medical_aid_scraper()
urls_to_scrape = [
'https://www.rahm24.de/pflegeprodukte/stoma/stoma-vlieskompressen-saliomed',
]
for url in urls_to_scrape:
# fetch page
article_resp = requests.get(url)
article_resp.raise_for_status()
page = Page(article_resp.content)
# extract result
result = scraper.get(page)
print(result)
def train_medical_aid_scraper():
training_set = make_training_set_for_articles(ARTICLES)
scraper = train_scraper(training_set, complexity=2)
return scraper
def make_training_set_for_articles(articles):
"""
This creates a training set to automatically derive selectors based on the given samples.
"""
training_set = TrainingSet()
for article in articles:
# fetch page
article_url = article['url']
html_raw = requests.get(article_url).content
page = Page(html_raw)
# create and add sample
sample = Sample(page, article)
training_set.add_sample(sample)
return training_set
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
train_and_scrape() |
There's some weird whitespace causing issues. But it works if you change the price to a proper price in dot notation (which is hidden in the html): ARTICLES = ({
'url': 'https://www.rahm24.de/schlafen-und-wohnen/komfortmatratzen/schaumstoffmatratze-burmeier-basic-fit',
'title': "Schaumstoffmatratze Burmeier Basic-Fit",
'price': '230.00',
'manufacturer': 'Burmeier',
},
{
'url': 'https://www.rahm24.de/medizintechnik/inhalationstherapie/inhalationsgeraet-omron-ne-c28p',
'title': 'Inhalationsgerät Omron NE-C28P',
'price': '87.00',
'manufacturer': 'Omron',
},
{
'url': 'https://www.rahm24.de/schlafen-und-wohnen/aufstehsessel/ruhe-und-aufstehsessel-innov-cocoon',
'title': 'Ruhe- und Aufstehsessel Innov Cocoon',
'price': '1290.00',
'manufacturer': 'Innov`Sa',
},
) returns:
|
I think generally this needs to be fixed by #15 |
Gave this a try :-)
Feedback:
mlscraper.html
is missing from the PyPI package.mlscraper.training.NoScraperFoundException: did not find scraper
Would be nice if the error message gave some guidance as to what fields
couldn't be found in the HTML.
Even with DEBUG log level it's not really helpful.
The text was updated successfully, but these errors were encountered: