Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove train! from quickstart example #2110

Merged
merged 15 commits into from
Nov 27, 2022
Merged

Remove train! from quickstart example #2110

merged 15 commits into from
Nov 27, 2022

Conversation

mcabbott
Copy link
Member

As suggested in #2104, this replaces train! with an explicit loop in the quickstart example. Also steals the idea from there about showing logging code.

It does not completely scrap the example, which otherwise tries to make quite a few points carefully & concisely. Not sure it fits on one screen anymore, though.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 11, 2022

Once the build has completed, you can preview any updated documentation at this URL: https://fluxml.ai/Flux.jl/previews/PR2110/ in ~20 minutes

Edit: especially https://fluxml.ai/Flux.jl/previews/PR2110/models/quickstart/

@MilesCranmer
Copy link

LGTM! Although I would suggest you just remove the README example as it is intimidating and unclear. If the goal is to get the user to the docs, I would just leave the API a surprise, so they see it in the best light.

@MilesCranmer
Copy link

MilesCranmer commented Nov 11, 2022

For the start example, I would some syntax changes to make things more idiomatic to newcomers to Flux.jl/Julia who might have previous DL experience (probably in Python). In due time they will learn map, ..., eachcol, but I think they don't need it all at once. I also removed the BatchNorm - not sure why it was used there but found it a bit confusing. If people need it they can probably find it, right?

# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:
noisy = rand(Float32, 2, 1000)                                    # 2×1000 Matrix{Float32}
truth = map(col -> xor(col...), eachcol(noisy .> 0.5))            # 1000-element Vector{Bool}

# Define our model, a multi-layer perceptron with one hidden layer of size 3:
model = Chain(Dense(2 => 3, tanh), BatchNorm(3), Dense(3 => 2), softmax)

# The model encapsulates parameters, randomly initialised. Its initial output is:
out1 = model(noisy)                                               # 2×1000 Matrix{Float32}

into this:

# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:
noisy = rand(Float32, 2, 1000)                                    # 2×1000 Matrix{Float32}
truth = [xor(noisy[1, i] > 0.5, noisy[2, i] > 0.5) for i=1:1000]  # 1000-element Vector{Bool}

# Define our model, a multi-layer perceptron with one hidden layer of size 16:
model = Chain(
    Dense(2 => 16), tanh,
    Dense(16 => 16), tanh,
    Dense(16 => 2), softmax
)

# The model encapsulates parameters, randomly initialised. Its initial output is:
out1 = model(noisy)                                               # 2×1000 Matrix{Float32}

(keeping the activation outside the layer is more idiomatic, showing it inside makes me confused about what I can put inside Chain. They'll learn later on that they can do this though.)

@ToucheSir
Copy link
Member

ToucheSir commented Nov 11, 2022

I think it best not to put the activation functions outside of their associated layers, because Julia assigns special meanings to some of them beyond fn(::Array) == fn.(::Array):

julia> x = rand(2, 2)
2×2 Matrix{Float64}:
 0.479567  0.200181
 0.719861  0.0244788

julia> tanh(x) == tanh.(x)
false

julia> tanh(x)
2×2 Matrix{Float64}:
 0.409203  0.178483
 0.641832  0.00344357

julia> tanh.(x)
2×2 Matrix{Float64}:
 0.445897  0.197549
 0.616823  0.0244739

softmax however should stay outside, because it only comes as a "vectorized" function (makes no sense to apply element-wise). This also clues users into the difference between activation functions and functional layers.

@MilesCranmer
Copy link

softmax however should stay outside, because it only comes as a "vectorized" function (makes no sense to apply element-wise). This also clues users into the difference between activation functions and functional layers.

Good point! Okay fine with me to keep them inside.

@MilesCranmer
Copy link

I would implement this change though:

- truth = map(col -> xor(col...), eachcol(noisy .> 0.5))            # 1000-element Vector{Bool}
+ truth = [xor(noisy[1, i] > 0.5, noisy[2, i] > 0.5) for i=1:1000]  # 1000-element Vector{Bool}

map, ..., eachcol, .> - just too many things to learn in a single line. (In all honesty, I didn't even know eachcol myself until now, despite developing packages in Julia for a couple of years...!)

@mcabbott
Copy link
Member Author

mcabbott commented Nov 11, 2022

Worse, the model proposed doesn't run:

julia> # The model encapsulates parameters, randomly initialised. Its initial output is:
       out1 = model(noisy)                                               
ERROR: DimensionMismatch: matrix is not square: dimensions are (16, 1000)
Stacktrace:
 [1] checksquare
   @ ~/.julia/dev/julia/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/LinearAlgebra.jl:239 [inlined]

tanh acting on a matrix already means something. (This is why I was unhappy about making the others auto-broadcast.)

Some care has gone into this thing. The reason BatchNorm is there is partly that it really does improve performance on this small model. (The one you propose is much much larger.) It is arranged that training is good but not perfect, since with a perfect classifier the first and last images look identical.

And also because it's a common layer which has non-parameter state, which is also encapsulated in the layer. If you don't encapsulate & handle this separately, it's a bit of a pain.

@MilesCranmer
Copy link

tanh acting on a matrix already means something. (This is why I was unhappy about making the others auto-broadcast.)

Scary indeed! Maybe a warning should be printed if tanh or exp is passed nakedly to Chain? (since it would be really weird for a user to do this and actually mean it?) If I had set the batchsize equal to the number of features, it wouldn't have errored and I might never have realized this.

README.md Outdated
```julia
using Flux # should install everything for you, including CUDA

x = hcat(digits.(0:3, base=2, pad=2)...) |> gpu # let's solve the XOR problem!
y = Flux.onehotbatch(xor.(eachrow(x)...), 0:1) |> gpu
y = Flux.onehotbatch(xor.(eachrow(cpu(x))...), 0:1) |> gpu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cpu(x) |> gpu? Are some of these functions not implemented on GPU?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives scalar indexing warnings if run on the GPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just move them both to the GPU when creating the dataloader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few things, but they are all ugly, or don't work. Maybe simpler to give up on squeezing |> gpu into this model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ba975db gives up on train! here too. Every attempt seemed pretty convoluted. I guess this is a sign I don't believe in the interface.

It gains a few lines, but at least it's pretty clear what each one does. Explicit parameters will remove the Flux.params line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this, no more params:

model = Chain(Dense(2 => 3, sigmoid), BatchNorm(3), Dense(3 => 2))
optim = Flux.setup(Adam(0.1, (0.7, 0.95)), model)

for _ in 1:100
    grad = gradient(m -> Flux.logitcrossentropy(m(x), y), model)[1]
    Flux.update!(optim, model, grad)  # this changes model & optim
end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean! Multiple dispatch rules 🚀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a version of this model which was both compact and clear. So I gave up & replaced it with a curve-fitting problem, 918fc0b.

But IDK, maybe we should give up & go back to having no example in the readme.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the

    grad = gradient(m -> Flux.logitcrossentropy(m(x), y), model)[1]
    Flux.update!(optim, model, grad)  # this changes model & optim

example a lot. Why not just show this one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or the equivalent, but with the MSE loss)

@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2022

Codecov Report

Base: 86.86% // Head: 86.86% // No change to project coverage 👍

Coverage data is based on head (c9cde50) compared to base (065c191).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2110   +/-   ##
=======================================
  Coverage   86.86%   86.86%           
=======================================
  Files          19       19           
  Lines        1469     1469           
=======================================
  Hits         1276     1276           
  Misses        193      193           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

README.md Outdated
```julia
using Flux # should install everything for you, including CUDA
using Flux, Plots
data = [([x], x-cbrt(x)) for x in range(-2, 2, 100)]
Copy link

@MilesCranmer MilesCranmer Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = [([x], x-cbrt(x)) for x in range(-2, 2, 100)]
data = [([x], x - x^3) for x in range(-2, 2, 100)]

Something simpler (In all honesty I have never seen cbrt before... wasn't sure what it was)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the point is to make a pretty curve which isn't obviously trivial to fit:

Screenshot 2022-11-24 at 21 19 24

Does it matter if the function isn't immediately familiar?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I edited my comment to be x - x^3. I agree x - x^2 looks way too simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(When I read cbrt I wasn't sure if the output was scalar or not in the example - whereas ^ is more clear to me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with 2x-x^3
Screenshot 2022-11-24 at 21 43 33

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't sure if the output was scalar or not in the example

This example is a little odd in that it takes a trivial vector input, and returns a scalar. As it makes the loss nice and simple. Maybe that introduces confusion though? E.g.

https://discourse.julialang.org/t/input-to-neural-network/90770

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest that it take two features as input instead, like f(x, y) = 2x - y^3 over $(-2:2) \times (-2:2)$. However the downside is it would be harder to visualize (I think a heatmap is good too, though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think if this readme thing exists at all, it needs to be super-simple. It must confirm that Flux is running on your system, and solve a vaguely NN-like problem (not linear regression).

I think the present XOR one invites too much time spent figuring it out, when people should rather go into the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The rule BTW around here is that PRs need one approval to merge, not necc. from anyone with rights.)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved

# The model encapsulates parameters, randomly initialised. Its initial output is:
out1 = model(noisy) # 2×1000 Matrix{Float32}
out1 = model(noisy |> gpu) |> cpu # 2×1000 Matrix{Float32}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent commit runs things on the GPU, as that seems worth showing off. (Even though this is actually slower). One quirk is that model(noisy |> gpu) |> cpu is a bit noisy, but maybe not so confusing to figure out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks fine to me.

Copy link

@MilesCranmer MilesCranmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Couple minor comments which I'll leave up to you:

  • I'd either (1) upload images to a separate docs-specific repo, or (2) upload the image file directly to GitHub rather than the git repo (you can do this by drag-and-dropping an image into the markdown), so it doesn't weight down the git history with binary files.
  • Use a slightly large neural net in the README, so that the fit is better:
model = Chain(Dense(1 => 23, tanh), Dense(23 => 23, tanh), Dense(23 => 1, bias=false), only)

does the trick.

@mcabbott
Copy link
Member Author

either (1) upload images to a separate docs-specific repo, or (2) upload the image file directly to GitHub rather than the git repo (you can do this by drag-and-dropping an image into the markdown), so it doesn't weight down the git history

This is a good point. I have unfortunately committed sins to the tune of 1.5MB in #2125 , and if we want more tutorials in here we probably need a plan. This PR is 4% of that (the larger image is already here) so my inclination is to kick the can down the road for now.

@mcabbott mcabbott merged commit b015b7a into master Nov 27, 2022
@mcabbott mcabbott deleted the quickstart2 branch November 27, 2022 05:22
@MilesCranmer
Copy link

Sounds good to me! (I'm a sinner as well; I recently realized some of my repositories were getting large because of it and have started slowly changing this habit)

@mcabbott
Copy link
Member Author

Cool, many thanks for taking a look.

If you have a minute, your take on #2114 would also be welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants