Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selenium / chromedrive should clean up on exit #155

Open
step21 opened this issue Feb 14, 2022 · 6 comments
Open

selenium / chromedrive should clean up on exit #155

step21 opened this issue Feb 14, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@step21
Copy link

step21 commented Feb 14, 2022

In my observations, upon quitting flathunter, or restarting it, the selenium/google-chrome processes are not closed properly. It would likely need sth like chromedriver.quit() somewhere but not sure where.

@iwasherefirst2
Copy link

I think this is the reason why the script starts failing after two weeks, see #145 and also https://stackoverflow.com/questions/71351792/after-one-selenium-timeoutexception-i-get-always-sessionnotcreatedexception

I noticed that when I try to start chromedriver manually it fails with "port already in use". Checking my process list, there are indeed leftovers:

Screenshot from 2022-03-10 20-30-17

A workaround is to kill all chromeriver instances on your start scrip

killall chromedriver 

@step21
Copy link
Author

step21 commented Mar 10, 2022

I also noticed this, when f.e. the script crashes or is shut down, it never cleans up chromedriver properly. Except I do not use it anymore right now and also do not have the time to find out where this would need to be fixed. So my easy solution was to just restart daily, and then regularly either restart the machine or clean up chromedriver.

@iwasherefirst2
Copy link

The problem is the script starts chromedriver here https://github.com/flathunters/flathunter/blob/main/flathunter/abstract_crawler.py#L48-L56

One could put a try-catch around the infinte for loop and quit the driver. But this would still cause issues if you kill the process manually. So maybe closing all instances of webdriver before starting a new one is the best solution here?
But killing all processes each time before I start the script does the job for me atm

How come you are not using it anymore? Did you find a flat yet? Thanks for all your help with supervisor etc.

@step21
Copy link
Author

step21 commented Mar 10, 2022

Yeah, I would have to dig into what would be the best way to do it properly. But for now I would probably also do it like you.
Yes, for the time being I found a flat, even though it is at first only for one year. (possibly unlimited after) Still, flathunter helped.

@mordax7
Copy link

mordax7 commented May 26, 2022

Maybe a quick explanation for why we do not close the driver sessions while crawling.
We purposely do not close the browser session every time it's done crawling the page to keep the session-id alive. Otherwise, you would trigger the captcha every time you try to crawl the page, making it more expensive to crawl.

Probably the accumulation happens when the process is not properly stopped. Killing the processes with a script before starting flathunter is a good way, but the people themselves should do this because it could warry from use case to use case. Another solution would be to run it in a Docker container.

We could implement into flathunter a way to close all the chrome driver instances of flathunter before starting the crawler. However, this would be a bit more work. I found driver.service.process.pid with a quick search. If we stored this somewhere persistent, we could check if the process with that PID still exists the next time it starts and kills it.

I do not see any other way.

@step21
Copy link
Author

step21 commented May 26, 2022

What is the way to properly stop it? In my experience, esp when it crashed sometimes due to misformed 2captcha response, it never cleaned up on killing/restarting. (as far as I could tell)
I also read that basically without restrictions, chromium driver will take as much memory as it can get. This can be a problem, because basically your whole machine memory is filled with chromium and on smaller machines (maybe on larger ones too but not sure) this means it will be non-responsive regularly. The solution found online for this suggested using cgroups or docker to limit memory consumption.

So

  • store pid in directory of config.yml (or in config yml) or in ~/.local/flathunter/pid
  • use cgroups or sth else to limit memory?
    (I know cleaning up after old instances and memory consumption are technically two different things, but for me they are also kinda related so addressing both here.)

@alexanderroidl alexanderroidl added the enhancement New feature or request label Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants