Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is the Visual Wake Word test set? #135

Open
LucaUrbinati44 opened this issue Jan 12, 2023 · 5 comments
Open

Where is the Visual Wake Word test set? #135

LucaUrbinati44 opened this issue Jan 12, 2023 · 5 comments

Comments

@LucaUrbinati44
Copy link

LucaUrbinati44 commented Jan 12, 2023

I would like to evaluate the pretrained MobileNet model on the preprocessed COCO2014 test set, but I am not able to find this preprcessed test set anywhere in the repo. Where can I find it? For the other three datasets (AD, IC, KS) it has been already provided in the repo.

I suspect I have to generate it by myself using this script setting dataType='test2014', because this should be the same script that has been used to create the training+validation dataset that is used for the training and that can be downloaded here.

Moreover, the paper entitled "MLPerf Tiny Benchmark" mentions the presence of this test set for the VWW problem at paragraph 4.1.

Finally, why is there no test.py (or evaluated.py) script to run the model on the test set, while for all the other three datasets (AD, IC, KS) there are such scripts?

Thank you,
Regards,
Luca Urbinati

@colbybanbury
Copy link
Contributor

Good question!

MS-COCO does not publish the labels (aka annotations) for the test set and holds competitions oriented around the test set. This means that Visual Wake Words does not contain an explicit test set.

It's traditionally best practice to use the Val set as the test set and use a small percentage of the training set for validation if needed. MLPerf Tiny should potentially move to adopt this practice, including an update to the paper.

@cskiraly and @jeremy-syn, who currently owns the VWW benchmark? I'm happy to help make the change if needed.

@LucasFischer123
Copy link

Hi @colbybanbury @LucaUrbinati44

Any news on this issue ?

Thanks

Lucas

@LucaUrbinati44
Copy link
Author

LucaUrbinati44 commented Mar 20, 2023

Hi @LucasFischer123,

Short answer
We "solved" it by using 10% of the whole dataset as "validation set" during training (according to the train_vww.py script) and then using these 1000 images for testing.

Long answer
We discovered that these 1000 images correspond to 1000 images of the provided dataset.
So, as first experiment, we removed those 1000 images from the dataset and we used the remaining dataset to train a floating point model from scratch using train_vww.py (without changing anything in this training script) and then we made inference on the 1000 images for testing. The result was around 83%, smaller than the 86% mentioned in the paper.

Then, as second experiment, we trained the model again from scratch, but this time on the whole dataset, i.e. without removing the 1000 images. This time the testing result on the 1000 images was 86%, as the paper.

Since the second experiment gave the same results of the paper, we decided to go for this second “solution” (see “Short answer”).

However, we know that this procedure is not 100% correct since the model saw the 1000 images twice (during training and during testing).

Thus, we hope the organizers' could solve this issue soon, both in the repo instructions and in the paper.

Thank you all,
Luca Urbinati and Marco Terlizzi

@NilsGraf
Copy link

NilsGraf commented Aug 23, 2023

Hi @LucaUrbinati44 @colbybanbury @LucasFischer123 @cskiraly and @jeremy-syn

I had a similar question on how to evaluate accuracy. I created this Jupyter notebook, which you can run in your browser (or use this script if you prefer running locally).

This script downloads the dataset from Silabs and runs both TFLite reference models (int8-model and float-model) with the 1000 images listed in y_labels.csv to measure their accuracy. I get below results:

float accuracy: 85.2   
int8 accuracy : 85.9  
image count   : 1000  

Does this look correct?

BTW, I get 86.0% for int8 accuracy (instead of 85.9%) when I run on M1 MacBook instead of colab.

@NilsGraf
Copy link

One more note: For the int8-accuracy, a few of the testcases in y_labels.csv produce a probability of exactly 0.5 (i.e. signed int8 value of 0, or unsigned int8 value of 128). In my script I assume that probability-of-person = 0.5 indicates a person. Changing this to non-person reduces the int8-accuracy by 0.3%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants