Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus reader can create batch size of 0 even when training prefixes are present #188

Open
shuttle1987 opened this issue Aug 19, 2018 · 4 comments

Comments

@shuttle1987
Copy link
Member

There's a situation I ran into with the test_fast in test_na.py where if the batch size is not specified the CorpusReader creates a batch size of 0. This seems to be a bug.

@shuttle1987
Copy link
Member Author

def test_fast():
"""
A fast integration test that runs 1 training epoch over a tiny
dataset. Note that this does not run ffmpeg to normalize the WAVs since
Travis doesn't have that installed. So the normalized wavs are included in
the feat/ directory so that the normalization isn't run.
"""
# 4 utterance toy set
TINY_EXAMPLE_LINK = "https://cloudstor.aarnet.edu.au/plus/s/g2GreDNlDKUq9rz/download"
tiny_example_dir = join(DATA_BASE_DIR, "tiny_example/")
rm_dir(Path(tiny_example_dir))
download_example_data(TINY_EXAMPLE_LINK)
labels = corpus.determine_labels(Path(tiny_example_dir), "phonemes")
corp = corpus.Corpus("fbank", "phonemes", Path(tiny_example_dir), labels)
exp_dir = experiment.prep_exp_dir(directory=EXP_BASE_DIR)
model = experiment.get_simple_model(exp_dir, corp)
model.train(min_epochs=2, max_epochs=5)
# Assert the convergence of the model at the end by reading the test scores
ler = get_test_ler(exp_dir)
# Can't expect a decent test score but just check that there's something.
assert ler < 2.0

Specifically on line 126 get_simple_model determines the batch size and passes it in as a parameter. Creating a simple test case like this could provide a nice regression test here.

@shuttle1987
Copy link
Member Author

Added some tests for utils.make_batch in #189 and it appears the issue isn't from there. Will work on this more tomorrow.

@shuttle1987
Copy link
Member Author

I see in the constructor there's the following:

        if not num_train:
            if not batch_size:
                batch_size = 64
            num_train = len(corpus.get_train_fns()[0])

This will find the num_train from how many feature files were found for the training.

Do we want to warn if no feature files are found? It seems as though this class has a precondition that the feature files are already preprocessed and that a warning would make sense here.

@oadams oadams removed the bug label Aug 20, 2018
@oadams
Copy link
Collaborator

oadams commented Aug 20, 2018

I just pushed a check in train_batch_gen() that throws an exception if the Corpus has no training utterances.

I'm removing the bug status now, I'll keep the issue open since there's an open question of whether to move some of that batch_size logic in the constructor to .train_batch_gen() since it's only relevant to that. The reason I don't want to do this immediately is because it's probably a good idea to load the dev and test data in batches too since conceivably they won't all fit in memory.

Perhaps the simplest option is to just have the batch_size argument have a kwarg default that is reasonable (32 or something) and use this in generating train, dev and test batches. I can get rid of most of the logic, including that exception that talks about divisibility of num_train and batch_size I made a while back. Who cares, the model should just feed the remainder in as a batch smaller than the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants