-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No flats coming through on Immoscout #577
Comments
Same thing here. Sorry don't have any solution yet. I also get this error: I guess this is the main issue and then the page just times out or something and get's detected as bot. |
I think I have the same problem - Immoscout stopped working 2 days ago. However I don't think it's got to do with 2captcha. However, I can't really wrap my head around what the error message I'm getting tells us. I'm using docker-compose on a Linux server with 2captcha enabled. This is the log from after starting the container:
|
I am not running with flathunter but my own crawler for Immoscout. I noticed that they seem to have switched from Geetest to AWS WAF. I haven't had time to look into yet, but 2captcha seems to support that kind of challenge. Just needs to be implemented. |
When I am running the same configuration (I am running seleniumbase instead of uc, but shouldn't matter too much) from my local machine via WSL it works and asks for a geetest captcha but on my VPS running directly on Linux it will return this pagesource:
As you said it will give us a awswaf captcha instead of the geetest. Could it be that for 'bad ips' there is another layer on captcha protection? Does something like this exist. I am not really deeply into it. |
I'm still seeing geetest: According to the 2captcha docs, we need to captcha these details in order to solve a waf captcha:
I can see |
There seem to be different versions of the Captcha challenge. One with key, context, iv, and then there is another one with only the challengeScript. I checked 2captcha docs and did not spot an api call version for the latter one. But I also checked only briefly I will attempt an implementation early next week and share my results here. But maybe someone else is keen in doing so already :-) |
To reproduce it I just need to use my VPS. I think the reason might be that it got flagged as a bad actor due to the amount of requests. It still works locally with geetest. But there only with the AWS captcha. The problem? (Dunno, have no experience there) is that the captcha only appears after the corresponding javascript is run (when I click inspect element). Then you get the needed information:
One would need to map the type of tests to have Capsolver solve the captcha: https://docs.capsolver.com/guide/recognition/AwsWafClassification.html Ah and the challenge rotates pretty fast. Hopefully that won't be another issue (actually like 60 seconds) |
You only really need to challengeScript, which is available when the challenge is AWS WAF. I've created a basic example to dissect the challenge, submit it to Capsolver, and retrieve the result in form of a cookie. I'll just post the example here in case anyone would like to pick it up. from seleniumbase import Driver
import re
import requests
from time import sleep
from selenium_stealth import stealth
from selenium import webdriver
CAPSOLVER_API_KEY = "XXX"
CAPSOLVER_API_ENDPOINT = "https://api.capsolver.com/createTask"
url = "https://www.immobilienscout24.de/Suche/de/wohnung-mieten?sorting=2&pagenumber=1"
client = requests.Session()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
driver = Driver(uc=True, headless=True, agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", uc_cdp_events = True)
driver.set_page_load_timeout(20)
driver.execute_cdp_cmd('Network.enable', {})
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
def delete_cookies():
for cookie in driver.get_cookies():
driver.delete_cookie(cookie['name'])
def get_aws_cookie():
for cookie in driver.get_cookies():
if(cookie['name'] == 'aws-waf-token'):
return cookie
def resolve_aws(iter=0):
script_content = driver.page_source
key_match = re.search(r'"key":"([^"]+)"', script_content)
iv_match = re.search(r'"iv":"([^"]+)"', script_content)
context_match = re.search(r'"context":"([^"]+)"', script_content)
jschallange_match = re.search(r'<script src="(.*?challenge.js.*?)".*?></script>', script_content)
key = None
iv = None
context = None
jschallange = None
if key_match and iv_match and context_match:
key = key_match.group(1)
iv = iv_match.group(1)
context = context_match.group(1)
jschallange = jschallange_match.group(1)
data = {
"clientKey": CAPSOLVER_API_KEY,
"task": {
"type": "AntiAwsWafTaskProxyLess",
"websiteURL": driver.current_url,
"awsKey": key,
"awsIv": iv,
"awsContext": context,
"awsChallengeJS": jschallange
}
}
else:
jschallange = jschallange_match.group(1)
data = {
"clientKey": CAPSOLVER_API_KEY,
"task": {
"type": "AntiAwsWafTaskProxyLess",
"websiteURL": driver.current_url,
"awsChallengeJS": jschallange
}
}
try:
task_id_response = client.post(CAPSOLVER_API_ENDPOINT, json=data)
task_id = task_id_response.json()['taskId']
try_cnt=0
while True:
cookie_response = client.post("https://api.capsolver.com/getTaskResult", json={"clientKey": CAPSOLVER_API_KEY, "taskId": task_id}).json()
sleep(5)
if cookie_response["status"] == "ready":
cookie = cookie_response["solution"]["cookie"]
# Replace the old cookie with the newly obtained
old_cookie = get_aws_cookie()
new_cookie = old_cookie
new_cookie['value'] = cookie
delete_cookies()
driver.add_cookie(new_cookie)
driver.uc_open_with_reconnect(driver.current_url, reconnect_time=3)
return True
elif cookie_response["status"] == "failed":
return False
else:
try_cnt+=1
if(try_cnt>5):
return False
continue
except Exception as e:
print(e)
# First delete all cookies, fetch IS24 page and solve AWS if presented
delete_cookies()
driver.uc_open(url)
if re.search("awswaf", driver.page_source):
resolve_aws(0) |
If this is really the case and 'bad ips' get flagged and shown the AWS captcha, wouldn't one temporary solution be to randomise the crawling interval by a few secs/mins? I don't have any other idea as to how they would sense it's a 'bad ip', as ~10mins seems like a totally reasonable refreshing time for an actual human. It might just be about the regularity? |
@jukoson Thanks so much for the sample code! It shouldn't be too hard to integrate that into the crawlers that we have. I don't know if I'll get around to that this week, but if someone else wants to give it a go that would be very welcome! |
You can solve using the documentation: |
The code above already resolves the captcha - however it does not yet bring you to the actual page you wanted to land at. I've progressed a little in that I can now solve the challenge and land on the is24 main website. From there, I accept cookies, type a search and then click the Search button. However another challenge pops up, after which I'm being redirected to the main page again. UPDATE from seleniumbase import Driver
import re
import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from random import randint
url = "https://www.immobilienscout24.de/Suche/de/wohnung-mieten?sorting=2&pagenumber=1"
CAPSOLVER_API_ENDPOINT = "https://api.capsolver.com/createTask"
CAPSOLVER_API_KEY = "XXX"
client = requests.Session()
DEFAULT_IS24_URL='https://www.immobilienscout24.de/'
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
userAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
driver = Driver(uc=True, headless2=True, agent=userAgent, uc_cdp_events = True)
driver.set_page_load_timeout(20)
driver.execute_cdp_cmd('Network.setBlockedURLs',
{"urls": ["https://api.geetest.com/get.*"]})
driver.execute_cdp_cmd('Network.enable', {})
def delete_cookies():
for cookie in driver.get_cookies():
driver.delete_cookie(cookie['name'])
def get_aws_cookie():
for cookie in driver.get_cookies():
if(cookie['name'] == 'aws-waf-token'):
return cookie
def resolve_aws(iter=0):
print(f"--------------------- ITER {iter}")
script_content = driver.page_source
key_match = re.search(r'"key":"([^"]+)"', script_content)
iv_match = re.search(r'"iv":"([^"]+)"', script_content)
context_match = re.search(r'"context":"([^"]+)"', script_content)
jschallange_match = re.search(r'<script src="(.*?challenge.js.*?)".*?></script>', script_content)
key = None
iv = None
context = None
jschallange = None
if key_match and iv_match and context_match:
key = key_match.group(1)
iv = iv_match.group(1)
context = context_match.group(1)
jschallange = jschallange_match.group(1)
data = {
"clientKey": CAPSOLVER_API_KEY,
"task": {
"type": "AntiAwsWafTaskProxyLess",
"websiteURL": driver.current_url,
"awsKey": key,
"awsIv": iv,
"awsContext": context,
"awsChallengeJS": jschallange
}
}
else:
jschallange = jschallange_match.group(1)
data = {
"clientKey": CAPSOLVER_API_KEY,
"task": {
"type": "AntiAwsWafTaskProxyLess",
"websiteURL": driver.current_url,
"awsChallengeJS": jschallange
}
}
try:
task_id_response = client.post(CAPSOLVER_API_ENDPOINT, json=data)
task_id = task_id_response.json()['taskId']
try_cnt=0
while True:
cookie_response = client.post("https://api.capsolver.com/getTaskResult", json={"clientKey": CAPSOLVER_API_KEY, "taskId": task_id}).json()
sleep(3)
if cookie_response["status"] == "ready":
# Get the cookie (AWS WAF token) from the CAPSOLVER response
cookie = cookie_response["solution"]["cookie"]
old_cookie = get_aws_cookie()
driver.delete_cookie('aws-waf-token')
new_cookie = old_cookie
new_cookie['value'] = cookie
driver.add_cookie(new_cookie)
captcha_container = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "captcha-container")))
interactive_element = WebDriverWait(captcha_container, 10).until(EC.element_to_be_clickable((By.XPATH, "//button | //input | //a")))
interactive_element.click(). <---------- There is a mistake here, it's not clicking the submit button!
print("Capsolver Solved")
if(driver.current_url == DEFAULT_IS24_URL):
print(f"Landed on DEFAULT_IS24_URL {DEFAULT_IS24_URL}")
driver.sleep(randint(1,3))
try:
shadow_root_ele = driver.find_element(By.CSS_SELECTOR, "#usercentrics-root").shadow_root
shadow_root_ele.find_element(By.CSS_SELECTOR, "button[data-testid='uc-accept-all-button']").click()
except Exception as e:
print(f"No cookie banner? e: {e}")
print("past the cookie banner!")
driver.sleep(randint(1,3))
if (driver.execute_script("return document.querySelectorAll('#oss-location')[1].value;") != 'Berlin'):
driver.type("(//input[@id='oss-location'])[2]", "Berlin")
driver.sleep(randint(1,3))
try:
# No idea why single click doesnt work
print("Clicking search")
driver.click('button.oss-main-criterion.oss-button.button-primary.one-whole.vertical-center-container')
driver.click('button.oss-main-criterion.oss-button.button-primary.one-whole.vertical-center-container')
except Exception as e:
print(f"Couldnt click? e: {e}")
driver.sleep(randint(1,3))
return True
elif cookie_response["status"] == "failed":
print("capsolver failed")
return False
else:
print(f"capsolver not ready yet.... Status: {cookie_response["status"]}")
try_cnt+=1
if(try_cnt>5):
print("capsolver did not process in time for the loop")
return False
continue
except Exception as e:
print(f"Resolve AWS WAF failed with {e}")
delete_cookies()
driver.uc_open(url)
if re.search("awswaf", driver.page_source):
print("AWS WAF Challenge")
resolve_aws(0)
if re.search("awswaf", driver.page_source):
resolve_aws(1)
if re.search("awswaf", driver.page_source):
resolve_aws(2) |
is it by design that you
Not criticizing or trying to gotcha you are or anything just wanting to unterstand the code. I am not sure the capsolver actually worked. Have you tasted it anywhere else? Instead of redirecting to main I try to use the 'submit' button after switching the cookies. But this didn't work. It said wrong answer. I replaced captcha_container and interactive_element with:
I am not sure how the solver is supposed to work, but changing the cookie and then clicking ok is apparently not it |
@fmmix It was never intended to return to the webpage - I gave ChatGpt parts of the website and asked it to click that button for me to be honest. I didn't question it since I also did not notice any other buttons on the page. I could not explain why resolving the challenge would bring me to the main pafge, so I thought it's just some extra layer or sanity check that I'm not a bot. Now it all makes sense however. |
I don't have any experience with stuff like this myself, digging through website elements etc. That line here will tell the driver to click on the top left symbol which you already guessed. ` ` The issue with the page is that the submit/bestätigen button is behind a #shadow-root which will hide the elements from your normal find elements functions. But it looks like the idea is to not have to press any buttons by replacing the cookie it should just solve the problem by refreshing the page. I tried it using the 202 example from the blog: I played a bit more and almost got it to work.
Not a full solution just the page_source part. I modified the while into a longer wait. And I run everything headed since I want to see what it does :-D, in headless2=True under with SB (...) it works just the same with getting the page_source. Maybe someone else can solve the missing puzzle? Or find another captcha solver for aws.Update: Another thing I found out is that the length of my manual cookie values are 326 and from the captcha solver only 262. Looks like the solver is simply not working correctly |
Thanks @fmmix , really good work. I feel this is getting somewhat closer to a solution. I just checked here as well, and the cookie length that I receive from capsolver is 262 compared to 326 with manual solving. (That's 64 apart ... as a person that counts in the binary system: coincidence ?) For now, I've reached out to 2Captcha support to see if they can resolve the 202 response syntax (jschallenge only) and also to Capsolver to ask whether they have an explanation for what could go wrong. Will update here. |
I've got some good news over here. I switched to a different captcha solving service and it ... just works. Attached is an example code that works for me. Please let me know if it works for others here too. import re
import requests
from selenium import webdriver
from seleniumbase import SB
URL = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=4.0-&price=-3500.0&exclusioncriteria=swapflat&pricetype=rentpermonth&sorting=2&enteredFrom=result_list"
SOLVER_API_ENDPOINT_CREATE = "https://api.capmonster.cloud/createTask"
SOLVER_API_ENDPOINT_GET = "https://api.capmonster.cloud/getTaskResult"
SOLVER_API_KEY = "XYZ"
client = requests.Session()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
def open_page(sb, url):
sb.driver.uc_open_with_reconnect(url, reconnect_time=2)
def get_aws_cookie(sb):
for cookie in sb.get_cookies():
if (cookie['name'] == 'aws-waf-token'):
return cookie
def resolve_aws(sb, iter=0):
patternJsApi = r'src="([^"]*jsapi\.js)"'
jsapi_matches = re.findall(patternJsApi, sb.driver.page_source)
for match in jsapi_matches:
print(f'SRC Value: {match}')
jsapi = match
patternKey = r'apiKey:\s*"([^"]+)"'
match = re.search(patternKey, sb.driver.page_source)
if match:
api_key = match.group(1)
print(f'apiKey: {api_key}')
else:
print('No apiKey found.')
exit()
data = {
"clientKey": SOLVER_API_KEY,
"task": {
"type": "AmazonTaskProxyless",
"websiteURL": URL,
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"captchaScript": jsapi,
"websiteKey": api_key,
"challengeScript": "",
"context": "",
"iv": "",
"cookieSolution": True
}
}
try:
task_id_response = client.post(SOLVER_API_ENDPOINT_CREATE, json=data)
task_id = task_id_response.json()['taskId']
try_cnt=0
while True:
sb.driver.sleep(5)
cookie_response = client.post(SOLVER_API_ENDPOINT_GET, json={"clientKey": SOLVER_API_KEY, "taskId": task_id}).json()
if cookie_response["status"] == "ready":
print(f'ready: {cookie_response}')
cookie = cookie_response["solution"]["cookies"]['aws-waf-token']
old_cookie = get_aws_cookie(sb)
sb.driver.delete_cookie('aws-waf-token')
new_cookie = old_cookie
new_cookie['value'] = cookie
sb.driver.add_cookie(new_cookie)
sb.driver.sleep(3)
sb.driver.refresh()
sb.driver.sleep(5)
return True
elif cookie_response["status"] == "failed":
print(f"solver failed: {cookie_response}")
exit()
else:
print(f"solver not ready yet.... Status: {cookie_response}")
try_cnt+=1
if(try_cnt>5):
print("solver did not process in time for the loop")
exit()
continue
except Exception as e:
print(f"Resolve AWS WAF failed with {e}")
def is_aws_waf(sb):
is_awswaf = re.search("awswaf", sb.driver.page_source)
is_roboter = re.search("Roboter", sb.driver.page_source)
return is_awswaf and is_roboter
with SB(uc=True, headed=True) as sb:
open_page(sb, URL)
if is_aws_waf(sb):
resolve_aws(sb)
if is_aws_waf(sb):
print(".... STILL AWS")
else:
print("Resolved !!!")
else:
exit() |
I was eagerly waiting for your post hehe. I can confirm it works for me too🥳 Amazing stuff! (In the name of science I sacrificed 7 dollars since it was the lowest amount I could paypal and didn't find the free trial at first, looks like you can request it from support if you want a free trial without paying - oh well, gonna use it up eventually 😅 ) Looks like scrolling isn't even needed which is great since I only used the SB context for the scroll function. In flathunter the driver gets passed around to different parts of the code and just using that instead of the context might be easier to implement here. |
Does anyone commited the fix? Cant use Immoscout :( |
Thank you very much @jukoson, flathunter works for immoscout again. Is there any discussion on how to bring this fix into the repository? Probably the question right now is how to continue with the captcha solving. Options are
Either way thanks to everyone for maintaining this project! |
Hi @Oli4, all, I am one of the maintainers of flathunter. From my side, I am very open to well-formed pull requests - I am happy to review them and provide feedback. I don't have the capacity to work on it myself right now, but I will make the time to seriously look at and test and merge good PRs. We can support multiple captcha engines - we already do - so adding a new engine shouldn't be any trouble. I would be reluctant to remove existing support since it's really hard to say what users are using which features, and backward-compatibility is important. Happy also to answer high-level questions about approach here. If we need to change the crawler architecture a bit to support a new captcha solver, we can do that. But someone needs to write the PR and not break existing configs. Thanks, Arthur |
Proposing a PR has been on my ToDo ever since. I am not too familiar with the flathunter codebase and due to a lack of time I couldn't manage. If anyone wants to pick it up, let me know. Otherwise I'm positive to provide something within a week or two. For anyone struggling with captcha recognition or in desire for a no-care solution I would like to recommend taking a look at my own project www.immobilien-bot.de or straight on telegram: @codders I am happy to remove the reference if this is inappropriate. By the way Flathunter is recommended on the website :-) |
Small sad update: Seems like the changes worked on my home machine, but deploying it to my server within a docker container still resulted in failure. After some tests it looks like the entire script tag that has the captcha script in it isn't even included in the original html requests, so there must be some kind of secondary safeguard that prevents the captcha from even appearing in some cases. Using mullvad VPN on the container also made no difference, neither did some of the driver options mentioned here before. I'm at my wits end for now. The weirdest part is that afaik the AWS WAF documentation doesn't include any features that would hide their scripts serverside in some cases. Very curious. |
@jukoson Thanks so much for contributing the patches and info that @DerLeole then developed into a PR, and sharing your solution to the riddle of the Immoscout captcha. I don't have a problem you mentioning immobilien-bot here - I'm happy to finally know who it is that's behind the site. And yes - thanks for the link to Flathunter! If there are ways to bring your code and our code closer together so that collaboration is easier and we both do less maintenance work, I'm happy to talk about it. |
I'm currently also working on a fix for 2captcha. The IV and context are rotated via additional network requests, which I captured with selenium-wire. I could actually get a sovled captcha back from 2captcha but for some reason, the response from 2captcha did not work. I contacted them, and they send me a solution using their coordinate solver (generic picture capture solver). I will try to integrate it and open a PR |
Awesome! I got too that point as well, but their solution never worked and the length was significantly shorter than the length of a correctly manually solved one. If you don't want to use selenium wire, there is a way to get all background request ids through logging and then use the ChromeDevToolkit API to access all the request data. That's what I did in the PR in my initial attempt. |
Hey @DerLeole thank you for implementing your solution in a PR. I'd really like to see this merged as a fellow immoscout enjoyer, so I added small fixes like code style changes to your fork, that (hopefully) will speed up this process. Please check it and merge so it can be propagated to the main repo, when you have time. Thanks in advance! |
@AntonKorobkov Thank you very much and sorry on the delay of this, past 2 weeks got busy. I actually got an answer from 2captcha support on how to implement captcha solving using an universal approach I wanna try. Support also said, they are working on support for the new aws implementation. Looking into that next week. |
@DerLeole I already implemented the coordinate solver for 2captcha and it does work. But after one hour on the second try it raises an exception currently. One could probably fix this by just using a while loop instead of the backup package, but I wanted it to be consistent with the rest of the project. |
Does anyone fixed the PR? |
Resolved in #634 with the implementation of Capmonster support. Please let me know / re-open the ticket if that's not working for you. |
Since May 2, 2024, there haven't been any ads coming through on my bot that is hosted on Google Cloud. The cloud project is set up exactly as in the tutorial and had been working fine for over a month now. I have logs enabled on the cloud app and every time the application is run, I get the error "IS24 bot detection has identified our script as a bot - we've been blocked", but instead of attempting the captcha with 2captcha as it used to do before, it just closes the application.
Has anyone also encountered this and/or know how to fix it?
The text was updated successfully, but these errors were encountered: