Corpus reader can create batch size of 0 even when training prefixes are present #188

shuttle1987 · 2018-08-19T15:54:13Z

There's a situation I ran into with the test_fast in test_na.py where if the batch size is not specified the CorpusReader creates a batch size of 0. This seems to be a bug.

The text was updated successfully, but these errors were encountered:

shuttle1987 · 2018-08-19T15:56:12Z

persephone/persephone/tests/experiments/test_na.py

Lines 97 to 122 in d73db45

    
           def test_fast(): 
        
               """ 
        
               A fast integration test that runs 1 training epoch over a tiny 
        
               dataset. Note that this does not run ffmpeg to normalize the WAVs since 
        
               Travis doesn't have that installed. So the normalized wavs are included in 
        
               the feat/ directory so that the normalization isn't run. 
        
               """ 
        
               # 4 utterance toy set 
        
               TINY_EXAMPLE_LINK = "https://cloudstor.aarnet.edu.au/plus/s/g2GreDNlDKUq9rz/download" 
        
               tiny_example_dir = join(DATA_BASE_DIR, "tiny_example/") 
        
               rm_dir(Path(tiny_example_dir)) 
        
               download_example_data(TINY_EXAMPLE_LINK) 
        
               labels = corpus.determine_labels(Path(tiny_example_dir), "phonemes") 
        
               corp = corpus.Corpus("fbank", "phonemes", Path(tiny_example_dir), labels) 
        
               exp_dir = experiment.prep_exp_dir(directory=EXP_BASE_DIR) 
        
               model = experiment.get_simple_model(exp_dir, corp) 
        
               model.train(min_epochs=2, max_epochs=5) 
        
               # Assert the convergence of the model at the end by reading the test scores 
        
               ler = get_test_ler(exp_dir) 
        
               # Can't expect a decent test score but just check that there's something. 
        
               assert ler < 2.0

Specifically on line 126 get_simple_model determines the batch size and passes it in as a parameter. Creating a simple test case like this could provide a nice regression test here.

shuttle1987 · 2018-08-19T16:25:13Z

Added some tests for utils.make_batch in #189 and it appears the issue isn't from there. Will work on this more tomorrow.

shuttle1987 · 2018-08-20T05:41:24Z

I see in the constructor there's the following:

        if not num_train:
            if not batch_size:
                batch_size = 64
            num_train = len(corpus.get_train_fns()[0])

This will find the num_train from how many feature files were found for the training.

Do we want to warn if no feature files are found? It seems as though this class has a precondition that the feature files are already preprocessed and that a warning would make sense here.

oadams · 2018-08-20T15:28:46Z

I just pushed a check in train_batch_gen() that throws an exception if the Corpus has no training utterances.

I'm removing the bug status now, I'll keep the issue open since there's an open question of whether to move some of that batch_size logic in the constructor to .train_batch_gen() since it's only relevant to that. The reason I don't want to do this immediately is because it's probably a good idea to load the dev and test data in batches too since conceivably they won't all fit in memory.

Perhaps the simplest option is to just have the batch_size argument have a kwarg default that is reasonable (32 or something) and use this in generating train, dev and test batches. I can get rid of most of the logic, including that exception that talks about divisibility of num_train and batch_size I made a while back. Who cares, the model should just feed the remainder in as a batch smaller than the others.

shuttle1987 added the bug label Aug 19, 2018

shuttle1987 mentioned this issue Aug 19, 2018

[WIP] Batch generation bugfix #189

Merged

oadams removed the bug label Aug 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus reader can create batch size of 0 even when training prefixes are present #188

Corpus reader can create batch size of 0 even when training prefixes are present #188

shuttle1987 commented Aug 19, 2018

shuttle1987 commented Aug 19, 2018

shuttle1987 commented Aug 19, 2018

shuttle1987 commented Aug 20, 2018

oadams commented Aug 20, 2018 •

edited

Loading

Corpus reader can create batch size of 0 even when training prefixes are present #188

Corpus reader can create batch size of 0 even when training prefixes are present #188

Comments

shuttle1987 commented Aug 19, 2018

shuttle1987 commented Aug 19, 2018

shuttle1987 commented Aug 19, 2018

shuttle1987 commented Aug 20, 2018

oadams commented Aug 20, 2018 • edited Loading

oadams commented Aug 20, 2018 •

edited

Loading