Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not able to create negative dataset #86

Open
srewai opened this issue Jul 19, 2021 · 7 comments
Open

not able to create negative dataset #86

srewai opened this issue Jul 19, 2021 · 7 comments

Comments

@srewai
Copy link

srewai commented Jul 19, 2021

Hi,
thanks for the great work. When I try to create positive dataset using the readme for keyword 'fire', it works fine but when i try to create the negative datset it hangs forever. ANy idea where might be the problem?

@srewai
Copy link
Author

srewai commented Jul 20, 2021

Is there a specific version of common voice dataset that should be downloaded? I am using version : en_2181h_2020-12-11

@ColonelThirtyTwo
Copy link

Having the same issue - I left generate_dataset.sh running overnight and it didn't finish. Using cv-corpus-6.1-2020-12-11, commit f8f5ac1

@ColonelThirtyTwo
Copy link

Looks like it tries to read 28303 clips for the negative dataset for my desired input of hey_computer. Combined with #56, it's probably what's taking forever.

@ColonelThirtyTwo
Copy link

Workaround: Comment out the line print_stats('Dataset', ctx, train_ds, dev_ds, test_ds, compute_length=True) in create_raw_dataset.py

It loads all the files just to print some stats to you, during which there's no feedback. The file data is theoretically cached in an LRU cache, but it's unlikely much will be reused.

The writing files step is still very slow, but at least it will give you a progress bar instead of just sitting there.

@srewai
Copy link
Author

srewai commented Jul 29, 2021

So it works for you ?

@ColonelThirtyTwo
Copy link

@srewai I'm making several code changes to hopefully resolve some of the inefficiencies in the training dataset creator, but I'm running into other issues atm. I'll publish a fork when I am done.

@daemon
Copy link
Member

daemon commented Aug 27, 2021

I'll take a look at this issue soon. IIRC the print_stats function loads a lot of the files if compute_length is True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants