Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prediction Output #131

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

chadlagore
Copy link
Collaborator

@chadlagore chadlagore commented May 29, 2018

👷 Changes

  • Added prediction output capability.
  • Refactored classes a bit.

🔦 Testing Instructions

pytest -vvv

Follow README instructions.

minutes/audio.py Outdated
@@ -8,6 +8,9 @@


class Audio:
"""Internal audio maninpulation class. I reserve the right to change this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/maninpulation/manipulation/g

minutes/base.py Outdated
return {
i: getattr(self, i) for i in self.intialization_params
if i in {
'ms_per_observation',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to declare this set as a constant somewhere and refer to it by name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course!

@iKevinY
Copy link
Member

iKevinY commented Jun 11, 2018

Hmm, for some reason the PR build is passing but the actual commit build is failing? 🤔

@chadlagore
Copy link
Collaborator Author

Yeah. The failure means that the model build is not deterministic, which may not be a problem, but affects they way the tests may be written. I can run that test 100 times and view the output to be sure.

@chadlagore
Copy link
Collaborator Author

Posting a fix for this today!

@chadlagore
Copy link
Collaborator Author

chadlagore commented Jun 16, 2018

~/git/minutes chad/126-prediction-output*
minutes-c2ZPuskd ❯ pytest test/test_minutes.py
=============================================================================================== test session starts ===============================================================================================
platform darwin -- Python 3.6.4, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
rootdir: /Users/chadlagore/git/minutes, inifile:
plugins: cov-2.5.1
collected 3 items

test/test_minutes.py ..F                                                                                                                                                                                    [100%]

==================================================================================================== FAILURES =====================================================================================================
__________________________________________________________________________________________________ test_phrases ___________________________________________________________________________________________________

    def test_phrases():
        for model_name in Minutes.parents:
            minutes = Minutes(parent=model_name)
            minutes.add_speaker(c.SPEAKER1)
            minutes.add_speaker(c.SPEAKER2)
            minutes.fit()

            # Predict new phrases (make sure we ony predict once per obs)
            conversation = Conversation(c.CONVERSATION_AUDIO, minutes)
            raw, _ = conversation.get_observations(**minutes.preprocessing_params)
            assert len(conversation.phrases) == len(raw)
            print(conversation.phrases)

            # Make sure we ony predicted on speaker 1 and 2.
            names = [p.speaker.name for p in conversation.phrases]
>           assert sorted(list(np.unique(names))) == ['speaker1', 'speaker2']
E           AssertionError: assert ['speaker2'] == ['speaker1', 'speaker2']
E             At index 0 diff: 'speaker2' != 'speaker1'
E             Right contains more items, first extra item: 'speaker2'
E             Use -v to get the full diff

test/test_minutes.py:37: AssertionError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
[<minutes.conversation.Phrase object at 0x1201b7278>, <minutes.conversation.Phrase object at 0x1201b71d0>, <minutes.conversation.Phrase object at 0x1201b7978>, <minutes.conversation.Phrase object at 0x1201b7a58>, <minutes.conversation.Phrase object at 0x1201b7240>, <minutes.conversation.Phrase object at 0x1201b7940>, <minutes.conversation.Phrase object at 0x1201b7780>, <minutes.conversation.Phrase object at 0x1201b79e8>, <minutes.conversation.Phrase object at 0x1201b7a20>, <minutes.conversation.Phrase object at 0x1201b7898>]
======================================================================================= 1 failed, 2 passed in 7.35 seconds ========================================================================================

~/git/minutes chad/126-prediction-output* 10s
minutes-c2ZPuskd ❯ pytest test/test_minutes.py
=============================================================================================== test session starts ===============================================================================================
platform darwin -- Python 3.6.4, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
rootdir: /Users/chadlagore/git/minutes, inifile:
plugins: cov-2.5.1
collected 3 items

test/test_minutes.py ...                                                                                                                                                                                    [100%]

============================================================================================ 3 passed in 5.44 seconds =============================================================================================

I've replicated this locally. It is as we expected, the test is non-deterministic because we train the model each time!

Problem

We provide our model with a random_state parameter—this gets used by _generate_training_data to do a test-train split. Ideally, it would lock the Keras model as well, but its not that simple. Keras uses the Numpy seed and the Tensorflow seed, which is set globally.

from numpy.random import seed
from tensorflow import set_random_seed
seed(random_state)
set_random_seed(random_state)

Options

  1. Admit that minutes cannot guarantee reproducibility, use the random_state to generate training data, then let the user set the global Numpy state if they want. This will cause confusion if someone sets the random state of the Minutes model, and expects some sort of stability in the answer.
  2. Save the Numpy state, call the Keras functions, restore the Numpy state with a decorator. This makes fitting Minutes models non-thread safe 🤔.

@chadlagore
Copy link
Collaborator Author

chadlagore commented Jun 16, 2018

  1. Inform the user that they should set the np.random.seed and the tf.set_random_seed if they want reproducibility.

This seems like the simplest solution for now!

Copy link
Member

@iKevinY iKevinY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Agreed that seeding the state is probably the easiest way to achieve reproducibility (and we mostly just care about it for consistent CI testing anyways). :shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready For Review 👋 PR is looking for a +1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants