-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Immoscout: Bot detection/No captcha necessary #302
Comments
Hi @phi1eas , I've definitely made the same experience as you before - that the URL in the chrome-driver frame gets detected but the same URL in the normal browser window works fine. We used to have that regularly before we switched to the @ozeidan made a comment in #272 that they have been working on a solution based on an What I can recommend, if you are just doing a simple search in Berlin, is to use the hosted version at https://flathunter.codders.de . You can log in there with your Telegram account and setup a basic filter, and you will get messages about new flats in Berlin - no setup required from your side, and the Immoscout crawling is working at time of writing. The Hope that helps! |
Thank you so much for your quick and helpful response! I will look into your references and try to contribute where I can. All the best! |
Just to make sure I'm not missing something: Running flathunter without --headless driver argument, I get this site: Now if I copy the link and open a new tab in the same window, I get this site with a captcha: Doesn't this mean that there must be some different information passed by the browser if I manually open the link, as opposed to opening it within flathunter? Maybe we could use that? Thanks again! |
Yeah, I mean, obviously somewhere there there must be a difference. The tricky part is working out where. You could try and spy on the traffic between the browsers and immoscout to see what the difference in requests is, but it might also be that some Javascript is running in the page after it loads to decide whether or not to show the captcha. It could be about the position of your mouse, or the size of the window, or pretty much any property of the application (browsers running javascript give away a lot of clues). But the fact that you can reproduce it, and that you have a good case and a bad case on the same machine, is already a solid start for investigating. |
Ah. I should also say. The code we use to launch the window also blocks the GeeTest API call (I think the captcha is powered by GeeTest). We do this so that we can request the Captcha from Python without re-using the same captcha token twice. So that is obviously one difference between the automated browser and the manual browser. You can try disabling that (https://github.com/flathunters/flathunter/blob/main/flathunter/chrome_wrapper.py#L46) and see if that makes a difference. Flathunter won't be able to solve the captcha, but you'll be able to see if that's what's tripping the bot detection. |
Just merged in #313, which bumps |
I tried the updated version but had no luck. The output remains the same that @phi1eas described. |
I set everything up today (Feb 28, running on Mac OSX 10.14 & sending notifications via Telegram, captchas solved w/ Imagetyperz), and I had the same problem (first without any However, I got it working (no longer detecting me as a bot) after I added the additional
UPDATE: Nevermind, I guess it really is somehow stochastic / traffic-dependent? Because now I'm running it and being detected as a bot again (without any change to the |
@conorheins Damn - nice try! Thanks for the updates, and sorry to hear that you're struggling with the bot detection. I don't know if it would help you to turn down the looping frequency. It's really hard to see from here what makes a difference. As far as I can tell, it works okay most of the time for most users, but it's for sure not working for everyone all the time. |
Thanks for the quick reply @codders -- good to know, I'll try messing with the looping frequency. To be clear, by that you mean decreasing the count in |
Increasing the |
Is there anything else I could try changing / playing with to make IS24 crawler work in Google Cloud Deployment? It doesn't work for me at all (gets blocked all the time) |
@infctr If you've tried everything here, I'm not sure what else. What deployment region are you using in Google Cloud? For me, it's working reliably out of |
I also face the same issue (local run on windows 10 laptop), so I tried commenting this line. Flathunter still reports "Unable to find IS24 variable in window" and "IS24 bot detection has identified our script as a bot - we've been blocked". In the browser it looks like "Gleich geht's weiter" page which quickly redirects to the "Ich bin kein Roboter" page without captcha, and then captcha appears, after like a second or so. With this line uncommented captcha does not appear. So there is indeed some relation, but script can't pass it anyway unfortunately. |
Hi, I just read here about this problem: I wrote my own script with headless chrome and a php-wrapper for immoscout. I do not crawl the html-version, but the json-url they use for the map. It looks like this: https://www.immobilienscout24.de/Suche/controller/mapResults.go?searchUrl=/Suche/radius/wohnung-mieten? My crawler gets blocked initially and then periodically after about 20 minutes. The blocker page from above without the captcha shows up then, the captcha is only displayed on the web-version. I think they do some kind of browser-fingerprinting with the script they load from https://www.immobilienscout24.de/assets/immo-1-17 (I think an antibot-script from distil network?) However, you can simply open the json-page in a new incognito window and reload it without solving any captcha and you will get through. So my workaround right now is very silly: I copy the value from the cookie "reese84" from incognito-window to my script, then it runs again for about 20 minutes. I think immoscout just does some kind of whitelisting for your browser with the distil-script and sets a fresh cookie reese84 when the script does not detect you as a bot. And: Sovling a captcha on the web-version does not help in this case, you still get blocked on the json-version vice versa. (test-case: if you open the web-version with headless chrome (in non-headless mode) and pass the captcha, the data for the map from the json-url does not load). Anyway, your script works differently I suppose but maybe this info is helpful (or old for you, then sorry for the interruption) ... |
So maybe it is a very big misconception on my side, but the idea is that you prove on your side if a fresh cookie in your script solves your problem, and if so (probably not, because it might have some additional ip-range-blocking), we could search for a service to automate this (=> send-url-and-return-fresh-cookie-api)? I did not find such service on 2captcha or imagetyperz.... |
Hi @trendschau , Thanks for the detailed investigation and information. Is your code up on Github anywhere? I'm not sure if what you describe relates to the problem that our users encounter or not. Right now, for many users, the captcha solving works "just fine" - I have an instance running on Google Cloud that has been scraping ImmoScout for years without problems using the Flathunter code. I have also noticed the It seems like ticket_tracker_api solved this with JS injection - that might be something we could try or investigate. |
@codders totally agree, I don't know if it is related to the problem described in this ticket but you can easily proof it by adding a valid reese84-cookie to headless chrome. Since flathunters works fine for all other users, the reason for the blocking page might be totally different, but maybe the solution is similar. My code is probably not of interest (very basic), but I cleaned it from all captcha-solving parts (not needed anymore) and pushed it to github. I never planned to publish it, so I am sorry for the spaghetti ... I think a super simplistic workaround might be a browser extension in another window, that stores cookies periodically on the file system in combination with a page refresh extension (something like https://github.com/ktty1220/export-cookie-for-puppeteer but without manual action). But I have to stop coding and start searching for a flat now ... |
@trendschau Thanks for the tip and for the code! Yes - any hints are welcome to resolve this, and I'll be happy to try this (or even happier if someone else on the thread wants to make a PR). If it fixes the issue for the users that are struggling, it would be an amazing find. Best of luck with your search! |
@codders just to finish this: I found a way to automate the process with two browser extensions. Very dirty but it seems to work for now, so immobilienscout has some open data there :D Btw the archive-part of their website is completely unprotected as well, ahtough not very helpful for flat searchers. Pushed the code in case it is of interest. Good luck to you all, too! |
I solved this problem by injecting my cookie to the header to the GET request in abstract_crawler.py. It seems like if you have a valid cookie from one of your logged in sessions in the IS24 you can surpass the robot check. Btw I have a premium account so that might be a thing for the paid users. I see that @trendschau already pointed out a similar solution |
So I’ve been playing with this as well and I noticed that when I got detected as a bot (with no capture showing, as above) I can log into IS24 in the running Chrome session with my user account (plus) and then the subsequent reloads work fine. Don't know yet for how long. Will report back. |
@yanone did you try to use the set cookie feature? |
Yes, I did, and it wouldn’t work, still blocked. |
And that's probably because the Selenium app is a separate process. The Chrome that I opened manually and got the cookie from and the Chrome that the bot opens are two different instances. |
Are you sure you are copying the correct cookie in correct format? If it seems harder than logging in manually probably there is sth wrong :D |
Update: It ran for about an hour, and now they've logged me out and are showing me the captcha page without a captcha again. |
an hour is not so bad :D |
Hello Hello wonderful people! I get almost the same error, my coding skills are intermediate/low so I tryed to play a little with the settings. [2023/07/14 09:49:21|config.py |INFO ]: Using config path C:\Users\asus\flathunter/config.yaml |
I'm seeing that the headed chromium browser isn't setting the same reese84 cookie that I have in my config file. Anyone else able to see that it is being set correctly? |
I am also observing the same issue as reported in the ticket when I start the script. I have tried also the reese84 cookie approach but still it gets detected from the beginning. |
Does anyone know how to resolve this issue? I've tried everything discussed here, different reese84 cookies values, etc... |
Same for me. Would love an update! Getting blocked right out of the gate, even with my normal browsers reese84 cookie... |
Same here, at first it lasted at least a day, now I am getting blocked basically right away |
To the commenters who are struggling, would be great if you can leave some info about your setup - what OS, docker or direct, chomedriver arguments etc. |
+1 my config running on windows docker desktop with reese84 cookie variable set:
2023-10-31 09:00:21 [2023/10/31 08:00:21|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary? same in WSL (Windows Subsystem for Linux) environment: |
Not a flathunter issue, but I'm working on a project that has the same issues. I noticed that with headless chrome via puppeteer, the browser gets locked out without showing a captcha. With headed chrome, I was able to bypass bot detection using the paid capsolver.com API (https://www.capsolver.com/blog/The-other-captcha/bypass-imperva-nodejs) and 2captcha for the geetest captcha. Guess I'll just keep running in headed mode for now, although it's probably a bit more resource hungry. I decyphered the |
For me it never worked once with any kind of driver arguments or reese values |
Maybe #514 will help some of you... |
Marking this as resolved with the recent changes in #634 . Please feel free to open a new ticket if this continues to be a problem. |
Hi,
I am trying to run flathunter on immscout24 using imagetyperz. I run into the following issue:
What I think is weird is this: If I do not pass "--headless" as a driver_argument, a Chromium window opens. This window has the immoscout bot detection page loaded. If I copy the URL from that window, and open this URL in a new tab in Chromium, I get the same page, but this time with the Captcha.
Is this because immoscout24 classified me as a bot, or is there something else going on?
This is my config.yaml:
Thank you so much!
The text was updated successfully, but these errors were encountered: