Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py memory problem #1

Open
transfluxus opened this issue Apr 25, 2017 · 7 comments
Open

train.py memory problem #1

transfluxus opened this issue Apr 25, 2017 · 7 comments

Comments

@transfluxus
Copy link

is there a way to use a word embedding genereted with something else (like gensim for example).
This implementation dies after a while on my relatively large data set (with 32gb of memory)

@kefirski
Copy link
Owner

kefirski commented Apr 25, 2017

What do you mean by "dies after a while"?
There are no restricts on the nature of word embeddings –– you just have to save it in appropriate file and Embedding module will pick up them

@transfluxus
Copy link
Author

transfluxus commented Apr 25, 2017

it says. 'killed' after 20 minutes max

@transfluxus
Copy link
Author

the output of train are several files:
characters_vocab.pkl, train_character_tensor.npy, train_word_tensor.npy, valid_word_tensor.npy, words_vocab.pkl, valid_character_tensor.npy and word_embeddings.npy
which one do I need for the next steps?

@xushenkun
Copy link

I think "dies after a while" is because the seq_len is too long.
I have encountered this sometimes and it's alright after I reduced the length of each corpus sentence.

@transfluxus
Copy link
Author

interesting. It's a while ago so I don't remember if I used a sentence of a whole document as a sentence. but I guess i used sentences, so how would I chop them?

@xushenkun
Copy link

@transfluxus I used Chinese corpus and it should be less than 300 words in each sentence; or crashed. I think it should be less than 1000 words for English corpus. I just split the sentence when encountered commas or full stops.

@transfluxus
Copy link
Author

i limited the sentence length to 100, still doesn't run through. actually already the train_word_embedding fails.
loading the whole corpus and then creating multiple representations of it is not really practical if your corpus has a real size (4.2mio sentences in my case). it's gotta be streamed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants