Skip to content

Commit

Permalink
Merge pull request #634 from codders/feat/capmonster-pr-20240917
Browse files Browse the repository at this point in the history
Add Capmonster support per PR from @DerLeole
  • Loading branch information
codders authored Dec 16, 2024
2 parents 6f43447 + ef81250 commit fb87f57
Show file tree
Hide file tree
Showing 14 changed files with 257 additions and 17 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
run: coverage run
env:
FLATHUNTER_HEADLESS_BROWSER: true
FLATHUNTER_2CAPTCHA_KEY: ${{ secrets.TWOCAPTCHA_API_KEY }}
FLATHUNTER_CAPMONSTER_KEY: ${{ secrets.CAPMONSTER_API_KEY }}
WDM_LOCAL: 1
CHROMIUM_BIN: /opt/hostedtoolcache/chromium/stable/x64/chrome
- name: Run codecov
Expand Down
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Currently available messaging services are [Telegram](https://telegram.org/), [M
- [Configuration](#configuration)
- [URLs](#urls)
- [Telegram](#telegram)
- [2Captcha](#2captcha)
- [Capmonster](#capmonster)
- [Proxy](#proxy)
- [Google API](#google-api)
- [Command-line Interface](#command-line-interface)
Expand Down Expand Up @@ -170,7 +170,12 @@ Some sites (including Kleinanzeigen and ImmoScout24) implement bot detection to

#### Captchas

Some sites (including ImmoScout24) implement a Captcha to avoid being crawled by evil web scrapers. Since our crawler is not an evil one, the people at [2Captcha](https://2captcha.com) and [Imagetyperz](https://imagetyperz.com/) provide services that help you solve them. You can head over to one of those services and buy some credit for captcha solving. You will need to install the API key for your captcha-solving account in the `config.yaml`. Check out `config.yaml.dist` to see how to configure `2Captcha` or `Imagetyperz` with Flathunter. **At this time, ImmoScout24 can not be crawled by Flathunter without using 2Captcha/Imagetyperz. Buying captcha solutions does not guarantee that you will get past the ImmoScout24 bot detection (see [#296](https://github.com/flathunters/flathunter/issues/296), [#302](https://github.com/flathunters/flathunter/issues/302))**.
Some sites (including ImmoScout24) implement a Captcha to avoid being crawled by evil web scrapers. Since our crawler is not an evil one, the people at [2Captcha](https://2captcha.com), [Imagetyperz](https://imagetyperz.com/) and [Capmonster](https://capmonster.cloud/) provide services that help you solve them. You can head over to one of those services and buy some credit for captcha solving. You will need to install the API key for your captcha-solving account in the `config.yaml`. Check out `config.yaml.dist` to see how to configure `2Captcha`, `Imagetyperz` or `Capmonster` with Flathunter. **At this time, ImmoScout24 can not be crawled by Flathunter without using Capmonster. Buying captcha solutions does not guarantee that you will get past the ImmoScout24 bot detection (see [#296](https://github.com/flathunters/flathunter/issues/296), [#302](https://github.com/flathunters/flathunter/issues/302))**.

#### Capmonster

Currently, [Capmonster](https://capmonster.cloud/) is the only implemented captcha-solving service that solves the captchas on ImmoScout24. You will need to set
the `FLATHUNTER_CAPMONSTER_KEY` environment variable or add the key to your `config.yaml` to solve the captchas.

#### ImmoScout24 Cookie Override

Expand Down
10 changes: 6 additions & 4 deletions config.yaml.dist
Original file line number Diff line number Diff line change
Expand Up @@ -123,15 +123,17 @@ title: "{crawler}: {title}"

# If you are planning to scrape immoscout24.de, the bot will need
# to circumvent the sites captcha protection by using a captcha
# solving service. Register at either imagetypers or 2captcha
# (the former is prefered), desposit some funds, uncomment the
# corresponding lines below and replace your API key/token.
# Use driver_arguments to provide options for Chrome WebDriver.
# solving service. Register at either imagetypers, 2captcha or
# capmonster (as at December 2024, Capmonster is prefered), desposit
# some funds, uncomment the corresponding lines below and replace your
# API key/token. Use driver_arguments to provide options for Chrome WebDriver.
# captcha:
# imagetyperz:
# token: alskdjaskldjfklj
# 2captcha:
# api_key: alskdjaskldjfklj
# capmonster:
# api_key: alskdjaskldjfklj
# driver_arguments:
# - "--headless"
captcha:
Expand Down
10 changes: 5 additions & 5 deletions config_wizard.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,22 +227,22 @@ def configure_captcha(urls: List[str], config: YamlConfig) -> Optional[Dict[str,
"To crawl ImmoScout, we need to browse the site with a real Chrome browser instance\n"
"and solve the Captcha that shows up on the ImmoScout site.\n")
print("You WILL NEED TO INSTALL google-chrome / chromium to solve Captchas\n")
print("We recommend using 2captcha (https://2captcha.com/) as your captcha-solving\n"
print("We recommend using Capmonster (https://capmonster.cloud/) as your captcha-solving\n"
"service. You will need an account there with some credit on it.\n"
"IMPORTANT NOTICE: Buying captcha credit does not guarantee that Flathunter will be\n"
"able to bypass the bot detection on the ImmoScout site - pay at your own risk!!\n")
print("Once you have an account and have paid, enter the API Key here (or hit Enter\n"
"to skip Captcha configuration, but be aware that ImmoScout scraping will fail...)\n")
if config.get_twocaptcha_key() is not None:
api_key = prompt("Enter 2Captcha API Key: ", default=config.get_twocaptcha_key())
if config.get_capmonster_key() is not None:
api_key = prompt("Enter Capmonster API Key: ", default=config.get_capmonster_key())
else:
api_key = prompt("Enter 2Captcha API Key: ")
api_key = prompt("Enter Capmonster API Key: ")

if len(api_key) == 0:
return None
return {
"captcha": {
"2captcha": {
"capmonster": {
"api_key": api_key
},
"driver_arguments": [
Expand Down
75 changes: 75 additions & 0 deletions flathunter/abstract_crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import re
from time import sleep
from typing import Optional, Any
import json

import backoff
import requests
Expand Down Expand Up @@ -70,6 +71,8 @@ def get_soup_from_url(
driver.get(url)
if re.search("initGeetest", driver.page_source):
self.resolve_geetest(driver)
elif re.search("awswaf-captcha", driver.page_source):
self.resolve_awsawf(driver)
elif re.search("g-recaptcha", driver.page_source):
self.resolve_recaptcha(
driver, checkbox, afterlogin_string or "")
Expand Down Expand Up @@ -193,6 +196,78 @@ def resolve_geetest(self, driver):
driver.refresh()
raise

@backoff.on_exception(wait_gen=backoff.constant,
exception=CaptchaUnsolvableError,
max_tries=3)
def resolve_awsawf(self, driver):
"""Resolve AWS WAF Captcha"""

# Intercept background network traffic via log sniffing
sleep(2)
logs = [json.loads(lr["message"])["message"] for lr in driver.get_log("performance")]

def log_filter(log_):
return (
# is an actual response
log_["method"] == "Network.responseReceived"
# and json
and "json" in log_["params"]["response"]["mimeType"]
)

context = None
iv = None
for log in filter(log_filter, logs):
request_id = log["params"]["requestId"]
resp_url = log["params"]["response"]["url"]
if "problem" in resp_url and "awswaf" in resp_url:
response = driver.execute_cdp_cmd(
"Network.getResponseBody", {"requestId": request_id}
)
response_json = json.loads(response["body"])
iv = response_json["state"]["iv"]
context = response_json["state"]["payload"]
sitekey = response_json["key"]
if context is None or iv is None:
raise CaptchaUnsolvableError("Unable to find captcha data in logs")

sitekey = re.findall(
r"apiKey: \"(.*?)\"", driver.page_source)[0]

challenge = None
challenge_matches = re.findall(r'src="([^"]*challenge\.js)"', driver.page_source)
for match in challenge_matches:
logger.debug('Challenge SRC Value: %s', match)
challenge = match

jsapi = None
jsapi_matches = re.findall(r'src="([^"]*jsapi\.js)"', driver.page_source)
for match in jsapi_matches:
logger.debug('JsApi SRC Value: %s', match)
jsapi = match

if challenge is None or jsapi is None:
raise CaptchaUnsolvableError("Unable to find challenge or JSApi value in page source")

try:
captcha = self.captcha_solver.solve_awswaf(
sitekey,
iv,
context,
challenge,
jsapi,
driver.current_url
)
old_cookie = driver.get_cookie('aws-waf-token')
new_cookie = old_cookie
new_cookie['value'] = captcha.token
driver.delete_cookie('aws-waf-token')
driver.add_cookie(new_cookie)
sleep(1)
driver.refresh()
except CaptchaUnsolvableError:
driver.refresh()
raise

@backoff.on_exception(wait_gen=backoff.constant,
exception=CaptchaUnsolvableError,
max_tries=3)
Expand Down
87 changes: 87 additions & 0 deletions flathunter/captcha/capmonster_solver.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
"""Captcha solver for CapMonster Captcha Solving Service (https://capmonster.cloud)"""
import json
from typing import Dict
from time import sleep
import backoff
import requests

from flathunter.logging import logger
from flathunter.captcha.captcha_solver import (
CaptchaSolver,
CaptchaBalanceEmpty,
CaptchaUnsolvableError,
GeetestResponse,
AwsAwfResponse,
RecaptchaResponse,
)

class CapmonsterSolver(CaptchaSolver):
"""Implementation of Captcha solver for CapMonster"""

def solve_geetest(self, geetest: str, challenge: str, page_url: str) -> GeetestResponse:
"""Should be implemented in subclass"""
raise NotImplementedError("Geetest captcha solving is not implemented for CapMonster")

def solve_recaptcha(self, google_site_key: str, page_url: str) -> RecaptchaResponse:
"""Should be implemented in subclass"""
raise NotImplementedError("Recaptcha captcha solving is not implemented for Capmonster")

def solve_awswaf(
self,
sitekey: str,
iv: str,
context: str,
challenge_script: str,
captcha_script: str,
page_url: str
) -> AwsAwfResponse:
"""Solves AWS WAF Captcha"""
logger.info("Trying to solve AWS WAF.")
params = {
"clientKey": self.api_key,
"task": {
"type": "AmazonTaskProxyless",
"websiteURL": page_url,
"challengeScript": "",
"captchaScript": captcha_script,
"websiteKey": sitekey,
"context": "",
"iv": "",
"cookieSolution": True
}
}
captcha_id = self.__submit_capmonster_request(params)
untyped_result = self.__retrieve_capmonster_result(captcha_id)
return AwsAwfResponse(untyped_result)

@backoff.on_exception(**CaptchaSolver.backoff_options)
def __submit_capmonster_request(self, params: Dict[str, str]) -> str:
submit_url = "https://api.capmonster.cloud/createTask"
submit_response = requests.post(submit_url, json=params, timeout=30)
logger.info("Got response from capmonster: %s", submit_response.text)

response_json = submit_response.json()

return response_json["taskId"]

@backoff.on_exception(**CaptchaSolver.backoff_options)
def __retrieve_capmonster_result(self, captcha_id: str):
retrieve_url = "https://api.capmonster.cloud/getTaskResult"
params = {
"clientKey": self.api_key,
"taskId": captcha_id
}
while True:
retrieve_response = requests.get(retrieve_url, json=params, timeout=30)
logger.debug("Got response from capmonster: %s", retrieve_response.text)

response_json = retrieve_response.json()
if not "status" in response_json:
raise requests.HTTPError(response=response_json["errrorCode"])

if response_json["status"] == "processing":
logger.info("Captcha is not ready yet, waiting...")
sleep(5)
continue
if response_json["status"] == "ready":
return response_json["solution"]["cookies"]["aws-waf-token"]
24 changes: 22 additions & 2 deletions flathunter/captcha/captcha_solver.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ class RecaptchaResponse:
"""Response from reCAPTCHA"""
result: str

@dataclass
class AwsAwfResponse:
"""Response from AWS WAF"""
token: str


class CaptchaSolver:
"""Interface for Captcha solvers"""
Expand All @@ -34,15 +39,30 @@ def solve_geetest(self, geetest: str, challenge: str, page_url: str) -> GeetestR
"""Should be implemented in subclass"""
raise NotImplementedError()

def solve_awswaf(
self,
sitekey: str,
iv: str,
context: str,
challenge_script: str,
captcha_script: str,
page_url: str
) -> AwsAwfResponse:
"""Should be implemented in subclass"""
raise NotImplementedError()

def solve_recaptcha(self, google_site_key: str, page_url: str) -> RecaptchaResponse:
"""Should be implemented in subclass"""
raise NotImplementedError()

class CaptchaUnsolvableError(Exception):
"""Raised when Captcha was unsolveable"""
def __init__(self):
def __init__(self, message = None):
super().__init__()
self.message = "Failed to solve captcha."
if message is not None:
self.message = message
else:
self.message = "Failed to solve captcha."

class CaptchaBalanceEmpty(Exception):
"""Raised when Captcha account is out of credit"""
Expand Down
12 changes: 12 additions & 0 deletions flathunter/captcha/imagetyperz_solver.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
CaptchaSolver,
CaptchaUnsolvableError,
GeetestResponse,
AwsAwfResponse,
RecaptchaResponse,
)

Expand Down Expand Up @@ -58,6 +59,17 @@ def solve_recaptcha(self, google_site_key: str, page_url: str) -> RecaptchaRespo
)
return RecaptchaResponse(self.__retrieve_imagetyperz_result(captcha_id))

def solve_awswaf(
self,
sitekey: str,
iv: str,
context: str,
challenge_script: str,
captcha_script: str,
page_url: str
) -> AwsAwfResponse:
"""Should be implemented at some point"""
raise NotImplementedError("AWS WAF captchas not supported for Imagetyperz")

@backoff.on_exception(**CaptchaSolver.backoff_options)
def __submit_imagetyperz_request(self, submit_url: str, params: Dict[str, str]) -> str:
Expand Down
15 changes: 14 additions & 1 deletion flathunter/captcha/twocaptcha_solver.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
CaptchaBalanceEmpty,
CaptchaUnsolvableError,
GeetestResponse,
AwsAwfResponse,
RecaptchaResponse,
)

Expand Down Expand Up @@ -46,12 +47,23 @@ def solve_recaptcha(self, google_site_key: str, page_url: str) -> RecaptchaRespo
captcha_id = self.__submit_2captcha_request(params)
return RecaptchaResponse(self.__retrieve_2captcha_result(captcha_id))

def solve_awswaf(
self,
sitekey: str,
iv: str,
context: str,
challenge_script: str,
captcha_script: str,
page_url: str
) -> AwsAwfResponse:
"""Should be implemented at some point"""
raise NotImplementedError("AWS WAF captchas not supported for 2Captcha")

@backoff.on_exception(**CaptchaSolver.backoff_options)
def __submit_2captcha_request(self, params: Dict[str, str]) -> str:
submit_url = "http://2captcha.com/in.php"
submit_response = requests.post(submit_url, params=params, timeout=30)
logger.debug("Got response from 2captcha/in: %s", submit_response.text)
logger.info("Got response from 2captcha/in: %s", submit_response.text)

if not submit_response.text.startswith("OK"):
raise requests.HTTPError(response=submit_response)
Expand All @@ -66,6 +78,7 @@ def __retrieve_2captcha_result(self, captcha_id: str):
"key": self.api_key,
"action": "get",
"id": captcha_id,
"json": 0,
}
while True:
retrieve_response = requests.get(retrieve_url, params=params, timeout=30)
Expand Down
1 change: 1 addition & 0 deletions flathunter/chrome_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def get_chrome_driver(driver_arguments):
chrome_options.add_argument(driver_argument)
chrome_version = get_chrome_version()
chrome_options.add_argument("--headless=new")
chrome_options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
driver = uc.Chrome(version_main=chrome_version, options=chrome_options) # pylint: disable=no-member

driver.execute_cdp_cmd(
Expand Down
Loading

0 comments on commit fb87f57

Please sign in to comment.