-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question/Feature Request: Reducing spaCy package size #2851
Comments
Thanks for bringing this up and yes, I definitely agree! I think the biggest bloat at the moment comes from the lemmatization lookup tables and other similar resources. The language data itself should be pretty light and only really include smaller dictionaries of rules for tokenization, norms, lemmatization and so on. Going forward, we'd love to transition the lemmatizers to rule-based solutions that rely on the tagger or to entirely statistical components (shipped via the models). These would also perform much better, so it's a win-win overall. Developing the components isn't trivial, but we do have someone working on lemmatization for spaCy now. You can find more details in #2668. We've also been getting awesome community contributions (most recently for Greek and French). |
@ines Thanks very much for your thoughts and quick response! Good to hear that it's on your radar. Do you think my intermittent approach of simply removing those folders for now is okay? Many thanks |
Yes, from spaCy's perspective, this should be okay – the language data is lazy-loaded, so |
Merging this with #3258! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Summary
Hi, I was recently investigating the causes of our large docker images and I noticed that the spacy installation (v2.0.12) is taking approximately 346mb in the site-packages directory.
The vast majority of this large folder size originates from the language packages. As I only use the English language, my plan of attack was to fork spaCy and remove all languages other than
en
. After removing these languages, the folder size goes down to 37.5mb. This approach seems to work fine although I am wary about doing this.My feature request is
What would be required to implement functionality to split out other languages so they are either optionally included in an installed wheel or optionally installed? Is this something that other people would find beneficial?
My question is
Is this a safe or the best approach?
Many thanks
Dom
The text was updated successfully, but these errors were encountered: