Skip to content

Commit

Permalink
Make Capmonster default and make the tests pass
Browse files Browse the repository at this point in the history
  • Loading branch information
codders committed Dec 16, 2024
1 parent b151af9 commit ee3f63a
Show file tree
Hide file tree
Showing 7 changed files with 25 additions and 14 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
run: coverage run
env:
FLATHUNTER_HEADLESS_BROWSER: true
FLATHUNTER_2CAPTCHA_KEY: ${{ secrets.TWOCAPTCHA_API_KEY }}
FLATHUNTER_CAPMONSTER_KEY: ${{ secrets.CAPMONSTER_API_KEY }}
WDM_LOCAL: 1
CHROMIUM_BIN: /opt/hostedtoolcache/chromium/stable/x64/chrome
- name: Run codecov
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Currently available messaging services are [Telegram](https://telegram.org/), [M
- [Configuration](#configuration)
- [URLs](#urls)
- [Telegram](#telegram)
- [2Captcha](#2captcha)
- [Capmonster](#capmonster)
- [Proxy](#proxy)
- [Google API](#google-api)
- [Command-line Interface](#command-line-interface)
Expand Down Expand Up @@ -170,7 +170,7 @@ Some sites (including Kleinanzeigen and ImmoScout24) implement bot detection to

#### Captchas

Some sites (including ImmoScout24) implement a Captcha to avoid being crawled by evil web scrapers. Since our crawler is not an evil one, the people at [2Captcha](https://2captcha.com) and [Imagetyperz](https://imagetyperz.com/) provide services that help you solve them. You can head over to one of those services and buy some credit for captcha solving. You will need to install the API key for your captcha-solving account in the `config.yaml`. Check out `config.yaml.dist` to see how to configure `2Captcha` or `Imagetyperz` with Flathunter. **At this time, ImmoScout24 can not be crawled by Flathunter without using 2Captcha/Imagetyperz. Buying captcha solutions does not guarantee that you will get past the ImmoScout24 bot detection (see [#296](https://github.com/flathunters/flathunter/issues/296), [#302](https://github.com/flathunters/flathunter/issues/302))**.
Some sites (including ImmoScout24) implement a Captcha to avoid being crawled by evil web scrapers. Since our crawler is not an evil one, the people at [2Captcha](https://2captcha.com), [Imagetyperz](https://imagetyperz.com/) and [Capmonster](https://capmonster.cloud/) provide services that help you solve them. You can head over to one of those services and buy some credit for captcha solving. You will need to install the API key for your captcha-solving account in the `config.yaml`. Check out `config.yaml.dist` to see how to configure `2Captcha`, `Imagetyperz` or `Capmonster` with Flathunter. **At this time, ImmoScout24 can not be crawled by Flathunter without using Capmonster. Buying captcha solutions does not guarantee that you will get past the ImmoScout24 bot detection (see [#296](https://github.com/flathunters/flathunter/issues/296), [#302](https://github.com/flathunters/flathunter/issues/302))**.

#### ImmoScout24 Cookie Override

Expand Down
10 changes: 6 additions & 4 deletions config.yaml.dist
Original file line number Diff line number Diff line change
Expand Up @@ -123,15 +123,17 @@ title: "{crawler}: {title}"

# If you are planning to scrape immoscout24.de, the bot will need
# to circumvent the sites captcha protection by using a captcha
# solving service. Register at either imagetypers or 2captcha
# (the former is prefered), desposit some funds, uncomment the
# corresponding lines below and replace your API key/token.
# Use driver_arguments to provide options for Chrome WebDriver.
# solving service. Register at either imagetypers, 2captcha or
# capmonster (as at December 2024, Capmonster is prefered), desposit
# some funds, uncomment the corresponding lines below and replace your
# API key/token. Use driver_arguments to provide options for Chrome WebDriver.
# captcha:
# imagetyperz:
# token: alskdjaskldjfklj
# 2captcha:
# api_key: alskdjaskldjfklj
# capmonster:
# api_key: alskdjaskldjfklj
# driver_arguments:
# - "--headless"
captcha:
Expand Down
10 changes: 5 additions & 5 deletions config_wizard.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,22 +227,22 @@ def configure_captcha(urls: List[str], config: YamlConfig) -> Optional[Dict[str,
"To crawl ImmoScout, we need to browse the site with a real Chrome browser instance\n"
"and solve the Captcha that shows up on the ImmoScout site.\n")
print("You WILL NEED TO INSTALL google-chrome / chromium to solve Captchas\n")
print("We recommend using 2captcha (https://2captcha.com/) as your captcha-solving\n"
print("We recommend using Capmonster (https://capmonster.cloud/) as your captcha-solving\n"
"service. You will need an account there with some credit on it.\n"
"IMPORTANT NOTICE: Buying captcha credit does not guarantee that Flathunter will be\n"
"able to bypass the bot detection on the ImmoScout site - pay at your own risk!!\n")
print("Once you have an account and have paid, enter the API Key here (or hit Enter\n"
"to skip Captcha configuration, but be aware that ImmoScout scraping will fail...)\n")
if config.get_twocaptcha_key() is not None:
api_key = prompt("Enter 2Captcha API Key: ", default=config.get_twocaptcha_key())
if config.get_capmonster_key() is not None:
api_key = prompt("Enter Capmonster API Key: ", default=config.get_capmonster_key())
else:
api_key = prompt("Enter 2Captcha API Key: ")
api_key = prompt("Enter Capmonster API Key: ")

if len(api_key) == 0:
return None
return {
"captcha": {
"2captcha": {
"capmonster": {
"api_key": api_key
},
"driver_arguments": [
Expand Down
7 changes: 7 additions & 0 deletions flathunter/crawler/immobilienscout.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from flathunter.abstract_crawler import Crawler
from flathunter.logging import logger
from flathunter.chrome_wrapper import get_chrome_driver
from flathunter.captcha.twocaptcha_solver import TwoCaptchaSolver
from flathunter.exceptions import DriverLoadException

STATIC_URL_PATTERN = re.compile(r'https://www\.immobilienscout24\.de')
Expand Down Expand Up @@ -183,6 +184,12 @@ def get_page(self, search_url, driver=None, page_no=None):

def get_expose_details(self, expose):
"""Loads additional details for an expose by processing the expose detail URL"""
if self.config.captcha_enabled():
# Currently (December 2024) the captcha triggers on every page request when
# solving with Capmonster (2captcha isn't working). It would be very expensive
# to solve a captcha for every single expose URL, so we skip here in the interests
# of saving money
return expose
soup = self.get_soup_from_url(expose['url'])
date = soup.find('dd', {"class": "is24qa-bezugsfrei-ab"})
expose['from'] = datetime.datetime.now().strftime("%2d.%2m.%Y")
Expand Down
4 changes: 3 additions & 1 deletion test/crawler/test_crawl_immobilienscout.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,11 @@ def test_process_expose_fetches_details(crawler):
for attr in [ 'title', 'price', 'size', 'rooms', 'address', 'from' ]:
assert expose[attr] is not None

def test_captcha_error_no_balance(crawler):
def test_twocaptcha_error_no_balance(crawler):
if not test_config.captcha_enabled():
pytest.skip("Captcha solving is not enabled - skipping immoscout tests. Setup captcha solving")
if not isinstance(test_config.get_captcha_solver(), TwoCaptchaSolver):
pytest.skip("Captcha solver is not 2captcha - skipping 2captcha balance check")
with requests_mock.mock() as m:
with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), "fixtures", "immo-scout-IS24-response.html")) as fixture:
immo_scout_matcher = re.compile('www.immobilienscout24.de')
Expand Down
2 changes: 1 addition & 1 deletion test/test_config_wizard.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ def test_configure_captcha(self, prompt_mock):
"https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?sorting=2"
]
res = config_wizard.configure_captcha(urls, self.config)
self.assertEqual((res or {}).get("captcha", {}).get("2captcha", {}).get("api_key"), "12345")
self.assertEqual((res or {}).get("captcha", {}).get("capmonster", {}).get("api_key"), "12345")

def test_configure_captcha_is_none(self):
urls = [
Expand Down

0 comments on commit ee3f63a

Please sign in to comment.