Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove train! from quickstart example #2110

Merged
merged 15 commits into from
Nov 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,23 @@

Flux is an elegant approach to machine learning. It's a 100% pure-Julia stack, and provides lightweight abstractions on top of Julia's native GPU and AD support. Flux makes the easy things easy while remaining fully hackable.

Works best with [Julia 1.8](https://julialang.org/downloads/) or later. Here's a simple example to try it out:
Works best with [Julia 1.8](https://julialang.org/downloads/) or later. Here's a very short example to try it out:
```julia
using Flux # should install everything for you, including CUDA
using Flux, Plots
data = [([x], 2x-x^3) for x in -2:0.1f0:2]

x = hcat(digits.(0:3, base=2, pad=2)...) |> gpu # let's solve the XOR problem!
y = Flux.onehotbatch(xor.(eachrow(x)...), 0:1) |> gpu
data = ((Float32.(x), y) for _ in 1:100) # an iterator making Tuples
model = Chain(Dense(1 => 23, tanh), Dense(23 => 1, bias=false), only)

model = Chain(Dense(2 => 3, sigmoid), BatchNorm(3), Dense(3 => 2)) |> gpu
optim = Adam(0.1, (0.7, 0.95))
mloss(x, y) = Flux.logitcrossentropy(model(x), y) # closes over model
mloss(x,y) = (model(x) - y)^2
optim = Flux.Adam()
for epoch in 1:1000
Flux.train!(mloss, Flux.params(model), data, optim)
end

Flux.train!(mloss, Flux.params(model), data, optim) # updates model & optim

all((softmax(model(x)) .> 0.5) .== y) # usually 100% accuracy.
plot(x -> 2x-x^3, -2, 2, legend=false)
scatter!(-2:0.1:2, [model([x]) for x in -2:0.1:2])
```

See the [documentation](https://fluxml.github.io/Flux.jl/) for details, or the [model zoo](https://github.com/FluxML/model-zoo/) for examples. Ask questions on the [Julia discourse](https://discourse.julialang.org/) or [slack](https://discourse.julialang.org/t/announcing-a-julia-slack/4866).
The [quickstart page](https://fluxml.ai/Flux.jl/stable/models/quickstart/) has a longer example. See the [documentation](https://fluxml.github.io/Flux.jl/) for details, or the [model zoo](https://github.com/FluxML/model-zoo/) for examples. Ask questions on the [Julia discourse](https://discourse.julialang.org/) or [slack](https://discourse.julialang.org/t/announcing-a-julia-slack/4866).

If you use Flux in your research, please [cite](CITATION.bib) our work.
Binary file added docs/src/assets/quickstart/loss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
82 changes: 54 additions & 28 deletions docs/src/models/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,45 +6,54 @@ If you haven't, then you might prefer the [Fitting a Straight Line](overview.md)

```julia
# With Julia 1.7+, this will prompt if neccessary to install everything, including CUDA:
using Flux, Statistics
using Flux, Statistics, ProgressMeter

# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:
noisy = rand(Float32, 2, 1000) # 2×1000 Matrix{Float32}
truth = map(col -> xor(col...), eachcol(noisy .> 0.5)) # 1000-element Vector{Bool}
truth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)] # 1000-element Vector{Bool}

# Define our model, a multi-layer perceptron with one hidden layer of size 3:
model = Chain(Dense(2 => 3, tanh), BatchNorm(3), Dense(3 => 2), softmax)
model = Chain(
Dense(2 => 3, tanh), # activation function inside layer
BatchNorm(3),
Dense(3 => 2),
softmax) |> gpu # move model to GPU, if available

# The model encapsulates parameters, randomly initialised. Its initial output is:
out1 = model(noisy) # 2×1000 Matrix{Float32}
out1 = model(noisy |> gpu) |> cpu # 2×1000 Matrix{Float32}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent commit runs things on the GPU, as that seems worth showing off. (Even though this is actually slower). One quirk is that model(noisy |> gpu) |> cpu is a bit noisy, but maybe not so confusing to figure out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks fine to me.


# To train the model, we use batches of 64 samples:
mat = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix
data = Flux.DataLoader((noisy, mat), batchsize=64, shuffle=true);
first(data) .|> summary # ("2×64 Matrix{Float32}", "2×64 Matrix{Bool}")
# To train the model, we use batches of 64 samples, and one-hot encoding:
target = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix
loader = Flux.DataLoader((noisy, target) |> gpu, batchsize=64, shuffle=true);
# 16-element DataLoader with first element: (2×64 Matrix{Float32}, 2×64 OneHotMatrix)

pars = Flux.params(model) # contains references to arrays in model
opt = Flux.Adam(0.01) # will store optimiser momentum, etc.

# Training loop, using the whole data set 1000 times:
for epoch in 1:1_000
Flux.train!(pars, data, opt) do x, y
# First argument of train! is a loss function, here defined by a `do` block.
# This gets x and y, each a 2×64 Matrix, from data, and compares:
Flux.crossentropy(model(x), y)
losses = []
@showprogress for epoch in 1:1_000
for (x, y) in loader
loss, grad = Flux.withgradient(pars) do
# Evaluate model and loss inside gradient context:
y_hat = model(x)
Flux.crossentropy(y_hat, y)
end
Flux.update!(opt, pars, grad)
push!(losses, loss) # logging, outside gradient context
end
end

pars # has changed!
pars # parameters, momenta and output have all changed
opt
out2 = model(noisy)
out2 = model(noisy |> gpu) |> cpu # first row is prob. of true, second row p(false)

mean((out2[1,:] .> 0.5) .== truth) # accuracy 94% so far!
```

![](../assets/oneminute.png)
![](../assets/quickstart/oneminute.png)

```
```julia
using Plots # to draw the above figure

p_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title="True classification", legend=false)
Expand All @@ -54,26 +63,43 @@ p_done = scatter(noisy[1,:], noisy[2,:], zcolor=out2[1,:], title="Trained networ
plot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))
```

```@raw html
<img align="right" width="300px" src="../../assets/quickstart/loss.png">
```

Here's the loss during training:

```julia
plot(losses; xaxis=(:log10, "iteration"),
yaxis="loss", label="per batch")
n = length(loader)
plot!(n:n:length(losses), mean.(Iterators.partition(losses, n)),
label="epoch mean", dpi=200)
```

This XOR ("exclusive or") problem is a variant of the famous one which drove Minsky and Papert to invent deep neural networks in 1969. For small values of "deep" -- this has one hidden layer, while earlier perceptrons had none. (What they call a hidden layer, Flux calls the output of the first layer, `model[1](noisy)`.)

Since then things have developed a little.

## Features of Note
## Features to Note

Some things to notice in this example are:

* The batch dimension of data is always the last one. Thus a `2×1000 Matrix` is a thousand observations, each a column of length 2.

* The `model` can be called like a function, `y = model(x)`. It encapsulates the parameters (and state).
* The batch dimension of data is always the last one. Thus a `2×1000 Matrix` is a thousand observations, each a column of length 2. Flux defaults to `Float32`, but most of Julia to `Float64`.

* But the model does not contain the loss function, nor the optimisation rule. Instead the [`Adam()`](@ref Flux.Adam) object stores between iterations the momenta it needs.
* The `model` can be called like a function, `y = model(x)`. Each layer like [`Dense`](@ref Flux.Dense) is an ordinary `struct`, which encapsulates some arrays of parameters (and possibly other state, as for [`BatchNorm`](@ref Flux.BatchNorm)).

* The function [`train!`](@ref Flux.train!) likes data as an iterator generating `Tuple`s, here produced by [`DataLoader`](@ref). This mutates both the `model` and the optimiser state inside `opt`.
* But the model does not contain the loss function, nor the optimisation rule. The [`Adam`](@ref Flux.Adam) object stores between iterations the momenta it needs. And [`Flux.crossentropy`](@ref Flux.Losses.crossentropy) is an ordinary function.

There are other ways to train Flux models, for more control than `train!` provides:
* The `do` block creates an anonymous function, as the first argument of `gradient`. Anything executed within this is differentiated.

* Within Flux, you can easily write a training loop, calling [`gradient`](@ref) and [`update!`](@ref Flux.update!).
Instead of calling [`gradient`](@ref Zygote.gradient) and [`update!`](@ref Flux.update!) separately, there is a convenience function [`train!`](@ref Flux.train!). If we didn't want anything extra (like logging the loss), we could replace the training loop with the following:

* For a lower-level way, see the package [Optimisers.jl](https://github.com/FluxML/Optimisers.jl).

* For higher-level ways, see [FluxTraining.jl](https://github.com/FluxML/FluxTraining.jl) and [FastAI.jl](https://github.com/FluxML/FastAI.jl).
```julia
for epoch in 1:1_000
train!(pars, loader, opt) do x, y
y_hat = model(x)
Flux.crossentropy(y_hat, y)
end
end
```