Question/Feature Request: Reducing spaCy package size #2851

DomHudson · 2018-10-15T17:42:51Z

Summary

Hi, I was recently investigating the causes of our large docker images and I noticed that the spacy installation (v2.0.12) is taking approximately 346mb in the site-packages directory.

The vast majority of this large folder size originates from the language packages. As I only use the English language, my plan of attack was to fork spaCy and remove all languages other than en. After removing these languages, the folder size goes down to 37.5mb. This approach seems to work fine although I am wary about doing this.

My feature request is

What would be required to implement functionality to split out other languages so they are either optionally included in an installed wheel or optionally installed? Is this something that other people would find beneficial?

My question is

Is this a safe or the best approach?

Many thanks
Dom

The text was updated successfully, but these errors were encountered:

ines · 2018-10-15T17:57:32Z

Thanks for bringing this up and yes, I definitely agree!

I think the biggest bloat at the moment comes from the lemmatization lookup tables and other similar resources. The language data itself should be pretty light and only really include smaller dictionaries of rules for tokenization, norms, lemmatization and so on. Going forward, we'd love to transition the lemmatizers to rule-based solutions that rely on the tagger or to entirely statistical components (shipped via the models). These would also perform much better, so it's a win-win overall.

Developing the components isn't trivial, but we do have someone working on lemmatization for spaCy now. You can find more details in #2668. We've also been getting awesome community contributions (most recently for Greek and French).

DomHudson · 2018-10-15T18:42:04Z

@ines Thanks very much for your thoughts and quick response! Good to hear that it's on your radar. Do you think my intermittent approach of simply removing those folders for now is okay?

Many thanks
Dom

ines · 2018-10-15T23:34:55Z

Do you think my intermittent approach of simply removing those folders for now is okay?

Yes, from spaCy's perspective, this should be okay – the language data is lazy-loaded, so spacy.tr for instance is only required if you load a model that specifies it, if you run util.get_lang_class('tr'), or if you actually import it from spacy.lang. (And if you run the tests, I guess.)

ines · 2019-03-10T22:48:15Z

Merging this with #3258!

lock · 2019-04-09T23:21:48Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the enhancement Feature requests and improvements label Oct 15, 2018

ines closed this as completed Mar 10, 2019

jdukatz mentioned this issue Mar 20, 2019

Spacy import in v2.1 appears to depend on Japanese language class #3446

Closed

lock bot locked as resolved and limited conversation to collaborators Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question/Feature Request: Reducing spaCy package size #2851

Question/Feature Request: Reducing spaCy package size #2851

DomHudson commented Oct 15, 2018 •

edited

Loading

ines commented Oct 15, 2018 •

edited

Loading

DomHudson commented Oct 15, 2018

ines commented Oct 15, 2018 •

edited

Loading

ines commented Mar 10, 2019

lock bot commented Apr 9, 2019

Question/Feature Request: Reducing spaCy package size #2851

Question/Feature Request: Reducing spaCy package size #2851

Comments

DomHudson commented Oct 15, 2018 • edited Loading

Summary

My feature request is

My question is

ines commented Oct 15, 2018 • edited Loading

DomHudson commented Oct 15, 2018

ines commented Oct 15, 2018 • edited Loading

ines commented Mar 10, 2019

lock bot commented Apr 9, 2019

DomHudson commented Oct 15, 2018 •

edited

Loading

ines commented Oct 15, 2018 •

edited

Loading

ines commented Oct 15, 2018 •

edited

Loading