Discrepancy between training logs and manually evaluating metrics #679

JulienVig · 2024-05-29T14:27:57Z

While training a model in the webapp, the training accuracy reported by tensorflow.js fit method can be widely different from evaluation values, especially when using batch norm layers (such as in mobilenet).

For example, calling model.evaluateDataset on the training dataset after each epoch can show diverging trends, with the tfjs accuracy logs rising to 1 while the manual evaluation staying constant around random or even dropping to 0.

Similarly, using the webapp to test a model that we just trained and selecting the same training set as test set yields different result than what is reported in the training board (which are tfjs training logs).

The difference stays small for small networks but accuracies completely diverge when doing transfer learning with a pre-trained model such as mobilenet.

A small difference can be explained by how tfjs fit method evaluates the accuracy: the model is updated after each batch so the accuracy reported is an aggregation of many models versions rather than one model evaluating all the training set.

This is related to keras-team/keras#6977 which seems to blame dropout and batch normalization layers. I did not manage to mitigate the issue by following fixes mentioned in the issue (sometimes due to tensorflow.js not allowing certain operations)

This stackoverflow post reports a similar issue during transfer learning and having solved it by retraining all Batch normalization layers to fit the statistics to the new dataset.

Main points:

If the webapp training board (= tfjs fit method training logs) shows a certain accuracy, it doesn't mean that evaluating the model on the same training set will yield the same accuracy
Empirically, discrepancies seems to only occur when models contain some Batch Norm. It didn't manage to mitigate the issue (mostly due to fixes being in python and tfjs limiting our options) but it is worth investigating further. Models without batch norm show a small and expected difference

JulienVig added bug Something isn't working discojs Related to Disco.js labels May 29, 2024

JulienVig mentioned this issue May 30, 2024

Create a Skin Condition classification task #678

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between training logs and manually evaluating metrics #679

Discrepancy between training logs and manually evaluating metrics #679

JulienVig commented May 29, 2024 •

edited

Loading

Discrepancy between training logs and manually evaluating metrics #679

Discrepancy between training logs and manually evaluating metrics #679

Comments

JulienVig commented May 29, 2024 • edited Loading

JulienVig commented May 29, 2024 •

edited

Loading