Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU out of memory issues #1793

Open
chschroeder opened this issue Dec 22, 2022 · 1 comment
Open

GPU out of memory issues #1793

chschroeder opened this issue Dec 22, 2022 · 1 comment

Comments

@chschroeder
Copy link

Hi,

Thank you for this library, especially for the clear documentation. I have been enjoying using it, but the issue with GPU memory management has somewhat dampened my experience in recent days.

It seems that this is a long-standing issue (#487 #522 #1573 #1712; likely non-exhaustive) but seems to be still unresolved (#1717). I can also confirm myself that it is still an issue with the latest release.

Use Case:
This is a scenario where the existing problem is aggravated due to the loop but I am encountering this in an active learning setup where a model is trained and predictions are created repeatedly within a loop. Of course this can be solved by just using a larger GPU, which I previously did to cope with this, but I want my examples to run on colab as well.

Description of the problem:
The problem (at least for the paths of fit() and encode() which were relevant to me) lead to situations were GPU memory cannot be released or will be released too late. This eventuall results in OutOfMemoryError: CUDA out of memory. Tried to allocate [....],

What are your thoughts on this problem? Is someone working on it? If not, I encountered a similar behaviour in my own library and managed to get that under control. See for example this line of code in my fit() method. This has rarely ever been a problem again over many active learning runs. I could provide a PR applying this fix at the appropriate locations.

@chschroeder
Copy link
Author

Update:
While I did not succeed in building a minimal failing notebook in huggingface/setfit#242 (with limited time at least), I keep running into this problem.

I am currently running an active learning setup which encompasses 16 single runs (8x vanilla BERT / 8x sentence transformer) with varying query strategies. The BERT Model is much larger (336M) parameters than the SetFit model, but despite this not a single BERT run fails. However, SetFit runs regularly fail with RuntimeError: CUDA out of memory..

In #1795 @Dobiasd managed to provide a working sample of a leaking encode(), maybe we need to take a closer look at that part first.

Although I think all my recent errors were caused by fit() maybe that is just the symptom and encode() fills the GPU memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant