Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New task for adding scalar values (0 or 1) #4

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

Zeta36
Copy link

@Zeta36 Zeta36 commented Jan 7, 2017

Common Settings

The model is trained on 2-layer feedforward controller (with hidden sizes 128 and 256 respectively) with the following set of hyperparameters:

  • RMSProp Optimizer with learning rate of 10⁻⁴, momentum of 0.9.
  • Memory word size of 10, with a single read head.
  • A batch size of 1.
  • input_size = 3.
  • output_size = 1.
  • sequence_max_length = 100.
  • words_count = 15.
  • word_size = 10.
  • read_heads = 1.

A square loss function of the form: (y - y_)**2 is used. Where both 'y' and 'y_' are scalar numbers.

The input is a (1, random_length, 3) tensor, where the 3 is for a one-hot encoding vector of size 3, where:

010 is a '0'
100 is a '1'
001 is the end mark

So, and example of an input of length 10 will be the next 3D-tensor:

[[[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]]

This input is a represenation of a sequence of adding 0 or 1 values in the form of:

0 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + (end_mark)

The target outoput is a 3D-tensor with the result of this adding task. In the example above:

[[[2.0]]]

The DNC output is a 3D-tensor of shape (1, random_length, 1). For example:

[[[ 0.45]
[ -0.11]
[ 1.3]
[ 5.0]
[ 0.5]
[ 0.1]
[ 1.0]
[ -0.5]
[ 0.33]
[ 0.12]]]

The target output and the DNC output are both then reduced with tf.reduce_sum() so we end up with two scalar values. For example:

Target_output: 2.0
DNC_output: 5.89

And we apply then the square loss function:

loss = (Target_o - DNC_o)**2

and finally the gradient update.

Results

The model is going to recieve as input a random length sequence of 0 or 1 values like:

Input: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1

Then it will return a scalar value for this input adding proccess. For example, the DNC will output something like: 3.98824.
This value will be the predicted result for the input adding sequence (we are going to truncate the integer part of the result):

DNC prediction: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 3 [3.98824]

Once we train the model with:

$python tasks/copy/train.py --iterations=50000

we can see that the model learns in less than 1000 iterations to compute this adding function, and the loss drop from:

Iteration 0/1000
Avg. Logistic Loss: 24.9968

to:

Iteration 1000/1000
Avg. Logistic Loss: 0.0076

It seems like the DNC model is able to learn this pseudo-code:

function(x):
if (x == [ 1. 0. 0.])
return (near) 1.0 (float values)
else
return (near) 0.0 (float values)

Generalization test

We use for the model a sequence_max_length = 100, but in the training proccess we use just random length sequences up to 10 (sequence_max_length/10). Once the train is finished, we let the trained model to generalize to random length sequences up to 100 (sequence_max_length).

Results show that the model successfully generalize the adding task even with sequence 10 times larger than the training ones.

These are real data outputs:

Building Computational Graph ... Done!
Initializing Variables ... Done!

Iteration 0/1000
Avg. Logistic Loss: 24.9968
Real value: 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 = 5
Predicted: 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 = 0 [0.000319847]

Iteration 100/1000
Avg. Logistic Loss: 5.8042
Real value: 0 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 = 5
Predicted: 0 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 = 6 [6.1732]

Iteration 200/1000
Avg. Logistic Loss: 0.7492
Real value: 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 9
Predicted: 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 8 [8.91952]

Iteration 300/1000
Avg. Logistic Loss: 0.0253
Real value: 0 + 1 + 1 = 2
Predicted: 0 + 1 + 1 = 2 [2.0231]

Iteration 400/1000
Avg. Logistic Loss: 0.0089
Real value: 0 + 1 + 0 + 0 + 0 + 1 + 1 = 3
Predicted: 0 + 1 + 0 + 0 + 0 + 1 + 1 = 2 [2.83419]

Iteration 500/1000
Avg. Logistic Loss: 0.0444
Real value: 1 + 0 + 1 + 1 = 3
Predicted: 1 + 0 + 1 + 1 = 2 [2.95937]

Iteration 600/1000
Avg. Logistic Loss: 0.0093
Real value: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 4
Predicted: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 3 [3.98824]

Iteration 700/1000
Avg. Logistic Loss: 0.0224
Real value: 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 = 6
Predicted: 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 = 5 [5.93554]

Iteration 800/1000
Avg. Logistic Loss: 0.0115
Real value: 0 + 0 = 0
Predicted: 0 + 0 = -1 [-0.0118587]

Iteration 900/1000
Avg. Logistic Loss: 0.0023
Real value: 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 = 5
Predicted: 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 = 4 [4.97147]

Iteration 1000/1000
Avg. Logistic Loss: 0.0076
Real value: 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 = 4Done!

Testing generalization...

Iteration 0/1000
Predicted: 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 = 4 [4.123]

Saving Checkpoint ...
Real value: 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 0 = 6
Predicted: 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 0 = 6 [6.24339]

Iteration 1/1000
Real value: 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 11
Predicted: 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 11 [11.1931]

Iteration 2/1000
Real value: 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 1 = 33
Predicted: 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 1 = 32 [32.9866]

Iteration 3/1000
Real value: 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 = 16
Predicted: 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 = 16 [16.1541]

Iteration 4/1000
Real value: 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 = 44
Predicted: 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 = 43 [43.5211]

@Mostafa-Samir
Copy link
Owner

Mostafa-Samir commented Jan 14, 2017

Impressive work!
I'm certainly curious about how it was able to generalize with the same amount of memory locations!

What do you think about taking it up a notch?
Let's remove that reduce_sum and see if it can learn to add on its own. Here's how I think it could go: your input sequence would go something like this:
1 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 0 = - , and your target output would be the scalar 5. instead of attempting to copy the sequence via adding, we make the task that at the step containing '-' the model should output the value of the summation! Your loss would be the squared difference between between the output at that step and your target output, the loss at all previous step is omitted (you can find the technique of omitting the loss on specific steps in the recently pushed bAbI task).

I've just pushed new updates to the code that include optimizations in both memory and execution time performance, so you would be able to leave it training for more iterations while doing this more quickly!

I'm looking forwrad to see your results with this!

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2017

Hello, @Mostafa-Samir.

You can get the code of the adding task without the tf.resume_sum() in here: https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/adding/train_v2.py.

But I'm afraid that removing the tf.reduce_sum() makes the model unable to generalize with success with a fixed memory size as before. In this new version of the code, the model is still able to learn to resolve any sequence of 0 and 1 sums, but it fails when we try to use the learned model to larger sequences than that used in the training process.

I think that's because the original version I pulled here make use of the tf.reduce_sum() as a way of accumulator. I think the model learns an algorithm like this:

function(X):
for each x in X:
if (x == [ 1. 0. 0.])
return (near) 1.0 (float value)
else
return (near) 0.0 (float value)

And later, the tf.reduce_sum() makes the correct sum over all the sequence output. The output will have a nearly 1 for each [ 1. 0. 0.] input vector, and a nearly 0 in other case, and finally the tf.reduce_sum() will give the correct answer no matter the large the input is. And I think is because this little "if else" f(x) algorithm is easy to learn that the model is able to generalize to unlimited large inputs X sequences with a fixed memory size.

As soon as we remove the tf.reduce_sum() like in the version I made following your instructions, this trick doesn't work and the model has to learn other more complex and less generalizable algorithm than the f(x) I told you before.

What do you think, @Mostafa-Samir?

Regards,
Samu.

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2017

Here you have a little excerpt of a real training result of the new version (https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/adding/train_v2.py):
Iteration 800/1001

Avg. Cross-Entropy: 0.0231753
Avg. 100 iterations time: 0.03 minutes
Approx. time to completion: 0.00 hours
DNC input
[[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]]
Text input: 1 + 0 + 1 + 0 + 1 + 1 = -
Target_output
[[[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 4.]]]
DNC output
[[[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 3.52538943]]]
Real operation: 1 + 0 + 1 + 0 + 1 + 1 = 4
Predicted result: 1 + 0 + 1 + 0 + 1 + 1 = 4 [3.52539]
...
...
Iteration 1000/1001
Avg. Cross-Entropy: 0.0046492
Avg. 100 iterations time: 0.03 minutes
Approx. time to completion: 0.00 hours
DNC input
[[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]]]
Text input: 1 + 0 + 0 + 0 + 0 + 0 + 0 = -
Target_output
[[[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 1.]]]
DNC output
[[[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0.86268544]]]
Real operation: 1 + 0 + 0 + 0 + 0 + 0 + 0 = 1
Predicted result: 1 + 0 + 0 + 0 + 0 + 0 + 0 = 1 [0.862685]

Iteration 1001/1001
Saving Checkpoint ... Done!

Testing generalization...

Iteration 0/1000
Real operation: 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 56
Predicted result: 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 9316 [[ 9316.20117188]]

Iteration 1/1000
Real operation: 1 + 0 = 1
Predicted result: 1 + 0 = 1 [[ 0.853342]]

Iteration 2/1000
Real operation: 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 1 = 17
Predicted result: 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 1 = 74 [[ 73.88546753]]

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2017

@Mostafa-Samir, due to the great improvement in the core of your DNC implementation I've developed another task for testing the project. I've made a model that successfully is able to learn a argmax function over a input.

The model is feed with a vector of onehot integer values, and the target output is the index inside the vector with the maximum value. I'm glad to say to you that your DNC is able to learn this function using just a feedforward controller, and even better, ¡is able to generalize to larger vectors of those used in the training process!

You can see my code here: https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/argmax/train_v2.py.

And here you can see some results:
...
...
Iteration 9900/10001
Avg. Cross-Entropy: 0.1064857
Avg. 100 iterations time: 0.16 minutes
Approx. time to completion: 0.00 hours
DNC input [[[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]]
Target_output [[[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 1.]]]
DNC output [[[ 0. ]
[-0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 1.44688594]]]
Real argmax(X): 1
Predicted f(X): 1

Iteration 10000/10001
Avg. Cross-Entropy: 0.0603415
Avg. 100 iterations time: 0.16 minutes
Approx. time to completion: 0.00 hours
DNC input [[[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]]
Target_output [[[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 5.]]]
DNC output [[[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 4.93556786]]]
Real argmax(X): 5
Predicted f(X): 5

Saving Checkpoint ... Done!

Testing generalization...

Iteration 0/10000
Real argmax(X): 3
Predicted f(X): 3

Iteration 1/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 2/10000
Real argmax(X): 4
Predicted f(X): 3

Iteration 3/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 4/10000
Real argmax(X): 1
Predicted f(X): 1

Iteration 5/10000
Real argmax(X): 3
Predicted f(X): 3

Iteration 6/10000
Real argmax(X): 1
Predicted f(X): 2

Iteration 7/10000
Real argmax(X): 3
Predicted f(X): 2

Iteration 8/10000
Real argmax(X): 6
Predicted f(X): 6

Iteration 9/10000
Real argmax(X): 5
Predicted f(X): 4

Iteration 10/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 11/10000
Real argmax(X): 5
Predicted f(X): 4

Iteration 12/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 13/10000
Real argmax(X): 0
Predicted f(X): 2

Iteration 14/10000
Real argmax(X): 2
Predicted f(X): 5

Iteration 15/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 16/10000
Real argmax(X): 1
Predicted f(X): 1

Iteration 17/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 18/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 19/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 20/10000
Real argmax(X): 1
Predicted f(X): 1

Iteration 21/10000
Real argmax(X): 4
Predicted f(X): 4

Iteration 22/10000
Real argmax(X): 10
Predicted f(X): 10

Iteration 23/10000
Real argmax(X): 6
Predicted f(X): 5

Iteration 24/10000
Real argmax(X): 1
Predicted f(X): 2

Iteration 25/10000
Real argmax(X): 4
Predicted f(X): 3

Iteration 26/10000
Real argmax(X): 1
Predicted f(X): 3

Iteration 27/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 28/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 29/10000
Real argmax(X): 3
Predicted f(X): 3

Iteration 30/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 31/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 32/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 33/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 34/10000
Real argmax(X): 6
Predicted f(X): 6

Iteration 35/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 36/10000
Real argmax(X): 5
Predicted f(X): 4

Iteration 37/10000
Real argmax(X): 0
Predicted f(X): 0

Iteration 38/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 39/10000
Real argmax(X): 3
Predicted f(X): 3

Iteration 40/10000
Real argmax(X): 4
Predicted f(X): 4

Iteration 41/10000
Real argmax(X): 6
Predicted f(X): 6

Iteration 42/10000
Real argmax(X): 15
Predicted f(X): 14

Iteration 43/10000
Real argmax(X): 1
Predicted f(X): 1

Iteration 44/10000
Real argmax(X): 2
Predicted f(X): 2

Iteration 45/10000
Real argmax(X): 11
Predicted f(X): 10

Iteration 46/10000
Real argmax(X): 3
Predicted f(X): 3

Iteration 47/10000
Real argmax(X): 1
Predicted f(X): 1

Iteration 48/10000
Real argmax(X): 1
Predicted f(X): 1

Iteration 49/10000
Real argmax(X): 13
Predicted f(X): 13
...
...

I don't know how the model is able to figure out where has been the highest value in the sequence of onehot encoded input values but it does, and even is able to generalize this learned method to sequences double of the size used in the training process without more memory use. DeepMind has found something big with this DNC, and they are improving it with a sparse version able to use less resources: https://arxiv.org/pdf/1610.09027v1.pdf

Regards,
Samu.

@Mostafa-Samir
Copy link
Owner

Great work Samu @Zeta36 !

Regarding the adding task
I have a comment about how you apply the wights to the loss. You use the following:

loss = tf.reduce_mean(tf.square((loss_weights * output) - ncomputer.target_output))

while you should be using:

loss = tf.reduce_mean(loss_weights * tf.square(output - ncomputer.target_output))

Remember, you're weighting the contribution of the loss of each step not the significance of each step on its own. Mathematically it's written as

not

I don't really know how you generate the output vector, but the 1st formulation can easily overestimate your loss value.

Try to adopt this change and see if it has any effect on the model. You should also try to test the generalization of the adding by using the same trained model but with larger memory matrix (more locations) just as you can find in the visualization notebook of the copy task. It'd also be a good idea to separate the generalization tests into different scripts than the training one, and try to use a single descriptive statistic (like the percentage of correct answers, or the percentage of error or whatever you decide) to describe your results so instead of dumping the entire log in the README you can just add one or two examples from the log and describe your results with that statistic!

I'll be happy then to merge your contributions to repo!

@cornagli
Copy link

Hi @Zeta36 and @Mostafa-Samir ,
I am really excited about the results of your tasks and about the DNC's potentiality.

For this reason, I am trying to implement a further task by myself. I am interested in understanding if a DNC can solve it. I would really appreciate any feedback from you, thanks.

Task description

The task is to count the total number of repeated numbers in a list.

For example:

Input: [ 1, 2, 3] 
Output: [0]

Input: [ 1, 2, 3, 2, 4, 1, 5]
                  X     X     : Repetitions
Output: [2]

The pseudo code the DNC should learn is:

function(x, seenNumbers):
   if x in seenNumbers:
       return 1
   else:
       return 0

I am wondering if the DNC can manage by itself the seenNumber list.

Settings

Assuming that the DNC can solve the task (I suppose a simple LSTM net can), I would structure the data as follows:

  • Input: (1, random length, 1) tensor
  • Output: Either (1, random length, 1) tensor or scalar containing the sum of the repetitions
  • Loss: depending on the output structure, a square loss function element by element or between two scalars
  • DNC parameters: Currently it is obscure to me how to set the memory parameters (word size, number of words)

What do you think? Do you think that it would be feasible for the DNC to solve the task?

Thanks,
Alessandro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants