You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have cloned the project and created a config file using the responsible Python script.
With wg-gesucht and immowelt the scraping works just perfectly fine.
However, when scraping kleinanzeigen, i receive the error:
urllib.error.HTTPError: HTTP Error 404: Not Found
Below is the full stack trace:
/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/bin/python /Users/jack/Development/flathunter/flathunt.py
[2024/02/29 21:04:21|config.py |INFO ]: Using config path /Users/jack/Development/flathunter/config.yaml
[2024/02/29 21:04:21|chrome_wrapper.py |INFO ]: Initializing Chrome WebDriver for crawler...
Traceback (most recent call last):
File "/Users/jack/Development/flathunter/flathunt.py", line 99, in <module>
main()
File "/Users/jack/Development/flathunter/flathunt.py", line 95, in main
launch_flat_hunt(config, heartbeat)
File "/Users/jack/Development/flathunter/flathunt.py", line 35, in launch_flat_hunt
hunter.hunt_flats()
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 56, in hunt_flats
for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in crawl_for_exposes
return chain(*[try_crawl(searcher, url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in <listcomp>
return chain(*[try_crawl(searcher, url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 27, in try_crawl
return searcher.crawl(url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 151, in crawl
return self.get_results(url, max_pages)
File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 139, in get_results
soup = self.get_page(search_url)
File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 56, in get_page
return self.get_soup_from_url(search_url, driver=self.get_driver())
File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 44, in get_driver
self.driver = get_chrome_driver(driver_arguments)
File "/Users/jack/Development/flathunter/flathunter/chrome_wrapper.py", line 69, in get_chrome_driver
driver = uc.Chrome(version_main=chrome_version, options=chrome_options) # pylint: disable=no-member
File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 258, in __init__
self.patcher.auto()
File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 178, in auto
self.unzip_package(self.fetch_package())
File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 287, in fetch_package
return urlretrieve(download_url)[0]
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py
I found related issues #538 and #439.
In there it seems like the problem is related to the version of chrome driver.
Since in my case flathunter works for wg-gesucht and immowelt,
I assume that this issue is different and may be related to kleinanzeigen.
# Enable verbose mode (print DEBUG log messages)
# verbose: true
# Should the bot endlessly looop through the URLs?
# Between each loop it waits for <sleeping_time> seconds.
# Note that Ebay will (temporarily) block your IP if you
# poll too often - don't lower this below 600 seconds if you
# are crawling Ebay.
loop:
active: yes
sleeping_time: 600
# Location of the Database to store already seen offerings
# Defaults to the current directory
#database_location: /path/to/database
# List the URLs containing your filter properties below.
# Currently supported services: www.immobilienscout24.de,
# www.immowelt.de, www.wg-gesucht.de, www.kleinanzeigen.de, meinestadt.de and vrm-immo.de.
# List the URLs in the following format:
# urls:
# - https://www.immobilienscout24.de/Suche/...
# - https://www.wg-gesucht.de/...
urls:
- https://www.kleinanzeigen.de/s-wohnung-mieten/schoeneberg/c203l3443
#- https://www.wg-gesucht.de/wohnungen-in-Muenchen.90.2.1.0.html
#- https://www.immowelt.de/suche/berlin/wohnungen/mieten?d=true&pma=1200&rmi=2&sd=DESC&sf=TIMESTAMP&sp=1
# Define filters to exclude flats that don't meet your critera.
# Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',
# 'max_price', 'min_price', and 'excluded_titles'.
#
# 'excluded_titles' takes a list of regex patterns that match against
# the title of the flat. Any matching titles will be excluded.
# More to Python regex here: https://docs.python.org/3/library/re.html
#
# Example:
# filters:
# excluded_titles:
# - "wg"
# - "zwischenmiete"
# min_price: 700
# max_price: 1000
# min_size: 50
# max_size: 80
# max_price_per_square: 1000
filters:
# There are often city districts in the address which
# Google Maps does not like. Use this blacklist to remove
# districts from the search.
#
# blacklist:
# - Innenstadt
# If an expose includes an address, the bot is capable of
# displaying the distance and time to travel (duration) to
# some configured other addresses, for specific kinds of
# travel.
#
# Available kinds of travel ('gm_id') can be found in the
# Google Maps API documentation, but basically there are:
# - "bicycling"
# - "transit" (public transport)
# - "driving"
# - "walking"
#
# The example configuration below includes a place for
# "John", located at the main train station of munich.
# Two kinds of travel (bicycle and transit) are requested,
# each with a different label. Furthermore a place for
# "Jane" is included, located at the given destination and
# with the same kinds of travel.
# durations:
# - name: John
# destination: Hauptbahnhof, München
# modes:
# - gm_id: transit
# title: "Öff."
# - gm_id: bicycling
# title: "Rad"
# - name: Jane
# destination: Karlsplatz, München
# modes:
# - gm_id: transit
# title: "Öff."
# - gm_id: driving
# title: "Auto"
# Multiline message (yes, the | is supposed to be there),
# to format the message received from the Telegram bot.
#
# Available placeholders:
# - {title}: The title of the expose
# - {rooms}: Number of rooms
# - {price}: Price for the flat
# - {durations}: Durations calculated by GMaps, see above
# - {url}: URL to the expose
message: |
{title}
Zimmer: {rooms}
Größe: {size}
Preis: {price}
Ort: {address}
{url}
# Calculating durations requires access to the Google Maps API.
# Below you can configure the URL to access the API, with placeholders.
# The URL should most probably just kept like that.
# To use the Google Maps API, an API key is required. You can obtain one
# without costs from the Google App Console (just google for it).
# Additionally, to enable the API calls in the code, set the 'enable' key to True
#
# google_maps_api:
# key: YOUR_API_KEY
# url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
# enable: False
# If you are planning to scrape immoscout24.de, the bot will need
# to circumvent the sites captcha protection by using a captcha
# solving service. Register at either imagetypers or 2captcha
# (the former is prefered), desposit some funds, uncomment the
# corresponding lines below and replace your API key/token.
# Use driver_arguments to provide options for Chrome WebDriver.
# captcha:
# imagetyperz:
# token: alskdjaskldjfklj
# 2captcha:
# api_key: alskdjaskldjfklj
# driver_arguments:
# - "--headless"
captcha:
# You can select whether to be notified by telegram, apprise or by mattermost
# or Slack webhooks. For all notifiers selected here a configuration must be
# provided below.
# notifiers:
# - telegram
# - apprise
# - mattermost
# - slack
notifiers:
- telegram
# Sending messages using Telegram requires a Telegram Bot configured.
# Telegram.org offers a good documentation about how to create a bot.
# Once you read it, will make sense. Still: bot_token should hold the
# access token of your bot and receiver_ids should list the client ids
# of receivers. Note that those receivers are required to already have
# started a conversation with your bot.
#
# telegram:
# bot_token: 160165XXXXXXX....
# notify_with_images: true
# receiver_ids:
# - 12345....
# - 67890....
telegram:
bot_token: 6896489191:AAGvdqFTdJWUDHhT6qOzWSSZhrJ23WZkopg
receiver_ids:
- '16861054'
# Sending messages via mattermost requires a webhook url provided by a
# mattermost server. You can find a description how to set up a webhook with
# the official mattermost documentation:
# https://docs.mattermost.com/developer/webhooks-incoming.html
# mattermost:
# webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356
mattermost:
# Sending messages using Apprise requires an Apprise url.
# Apprise allows to send notifications to a wide variety of services.
# You can find a description how to set up an Apprise url with the official
# documentation: https://github.com/caronc/apprise
# Signal notifications are documented here https://github.com/caronc/apprise/wiki/Notify_signal
#
# apprise:
# - gotifys://...
# - mailto://..
# - signal://localhost:9922/{FromPhoneNo}
apprise:
# Sending messages to a Slack channel requires a webhook url. You can find
# a guide on how to set up a Slack webhook in the official documentation:
# https://api.slack.com/messaging/webhooks
#
# slack:
# webhook_url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXX...
slack:
# If you are running the web interface, you can configure Login with Telegram support
# Follow the instructions here to register your domain with the Telegram bot:
# https://core.telegram.org/widgets/login
#
# website:
# bot_name: bot_name_xxx
# domain: flathunter.example.com
# session_key: SomeSecretValue
# listen:
# host: 127.0.0.1
# port: 8080
# If you are deploying to google cloud,
# uncomment this and set it to your project id. More info in the readme.
# google_cloud_project_id: my-flathunters-project-id
# For websites like idealista.it, there are anti-crawler measures that can be
# circumvented using proxies.
# use_proxy_list: True
# If you are having bot detection issues with immobilienscout24,
# you can set the cookie that you get from your logged in account
# Go to the immobilienscout24.de website, log in, and then in the developer tools
# (F12) go to the "Network" tab, then "Cookies" and copy the value of the
# "reese84" cookie.
immoscout_cookie: ''
I appreciate any help on that!
Please let me know, if any further information is required.
The text was updated successfully, but these errors were encountered:
Hello,
I have cloned the project and created a config file using the responsible Python script.
With wg-gesucht and immowelt the scraping works just perfectly fine.
However, when scraping kleinanzeigen, i receive the error:
urllib.error.HTTPError: HTTP Error 404: Not Found
Below is the full stack trace:
I found related issues #538 and #439.
In there it seems like the problem is related to the version of chrome driver.
Since in my case flathunter works for wg-gesucht and immowelt,
I assume that this issue is different and may be related to kleinanzeigen.
and my config file:
I appreciate any help on that!
Please let me know, if any further information is required.
The text was updated successfully, but these errors were encountered: