From 287dbd27042c564a5ebc978e41f821e7925fe1b2 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Sat, 17 Aug 2024 20:33:54 +0000 Subject: [PATCH] build based on 060bed3 --- previews/PR2464/.documenter-siteinfo.json | 2 +- previews/PR2464/ecosystem/index.html | 2 +- previews/PR2464/guide/gpu/index.html | 45 ++++++---- .../PR2464/guide/models/basics/index.html | 2 +- .../guide/models/custom_layers/index.html | 2 +- .../PR2464/guide/models/overview/index.html | 2 +- .../PR2464/guide/models/quickstart/index.html | 2 +- .../PR2464/guide/models/recurrence/index.html | 2 +- previews/PR2464/guide/performance/index.html | 2 +- previews/PR2464/guide/saving/index.html | 2 +- .../PR2464/guide/training/training/index.html | 2 +- previews/PR2464/index.html | 2 +- previews/PR2464/objects.inv | Bin 6123 -> 6123 bytes .../PR2464/reference/data/mlutils/index.html | 2 +- .../PR2464/reference/data/onehot/index.html | 2 +- .../PR2464/reference/destructure/index.html | 8 +- .../reference/models/activation/index.html | 50 +++++------ .../reference/models/functors/index.html | 8 +- .../PR2464/reference/models/layers/index.html | 84 +++++++++--------- .../PR2464/reference/models/losses/index.html | 34 +++---- .../PR2464/reference/models/nnlib/index.html | 52 +++++------ .../PR2464/reference/outputsize/index.html | 4 +- .../reference/training/callbacks/index.html | 8 +- .../reference/training/optimisers/index.html | 2 +- .../reference/training/reference/index.html | 6 +- .../reference/training/zygote/index.html | 2 +- .../PR2464/reference/utilities/index.html | 20 ++--- previews/PR2464/search_index.js | 2 +- .../tutorials/linear_regression/index.html | 2 +- .../tutorials/logistic_regression/index.html | 2 +- .../PR2464/tutorials/model_zoo/index.html | 2 +- 31 files changed, 185 insertions(+), 172 deletions(-) diff --git a/previews/PR2464/.documenter-siteinfo.json b/previews/PR2464/.documenter-siteinfo.json index 2c9965b71a..77aed0f431 100644 --- a/previews/PR2464/.documenter-siteinfo.json +++ b/previews/PR2464/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-08-03T20:41:35","documenter_version":"1.5.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-08-17T20:30:34","documenter_version":"1.5.0"}} \ No newline at end of file diff --git a/previews/PR2464/ecosystem/index.html b/previews/PR2464/ecosystem/index.html index 155368bad4..1b9ad212a1 100644 --- a/previews/PR2464/ecosystem/index.html +++ b/previews/PR2464/ecosystem/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -

The Julia Ecosystem around Flux

One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.

This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.

Flux models

  • Flux's model-zoo contains examples from many domains.

Computer vision

  • ObjectDetector.jl provides ready-to-go image detection via YOLO.
  • Metalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.
  • UNet.jl is a generic UNet implementation.

Natural language processing

  • Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.
  • TextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.

Reinforcement learning

  • AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
  • ReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.

Graph learning

  • GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.
  • GeometricFlux.jl is the first graph neural network library for julia.
  • NeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.
  • SeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.

Time series

Robust networks

  • RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.

Tools closely associated with Flux

Utility tools you're unlikely to have met if you never used Flux!

High-level training flows

  • FastAI.jl is a Julia port of Python's fast.ai library.
  • FluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl
  • Ignite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.
  • Tsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.

Datasets

Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:

Plumbing

Tools to put data into the right order for creating a model.

  • Augmentor.jl is a real-time library augmentation library for increasing the number of training images.
  • DataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.
  • MLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.

Parameters


Differentiable programming

Packages based on differentiable programming but not necessarily related to Machine Learning.

  • The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.
  • DiffEqFlux.jl provides tools for creating Neural Differential Equations.
  • Flux3D.jl shows off machine learning on 3D data.
  • RayTracer.jl combines ML with computer vision via a differentiable renderer.
  • Duckietown.jl Differentiable Duckietown simulator.
  • The Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.
  • AtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.
  • DiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.

Probabilistic programming

  • Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.
  • Omega.jl is a research project aimed at causal, higher-order probabilistic programming.
  • Stheno.jl provides flexible Gaussian processes.

Statistics


Useful miscellaneous packages

Some useful and random packages!

  • AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.
  • Mill.jl helps to prototype flexible multi-instance learning models.
  • MLMetrics.jl is a utility for scoring models in data science and machine learning.
  • Torch.jl exposes torch in Julia.
  • ValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.
  • InvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.
  • ProgressMeter.jl progress meters for long-running computations.
  • TensorBoardLogger.jl easy peasy logging to tensorboard in Julia
  • ArgParse.jl is a package for parsing command-line arguments to Julia programs.
  • Parameters.jl types with default field values, keyword constructors and (un-)pack macros.
  • BSON.jl is a package for working with the Binary JSON serialisation format.
  • DataFrames.jl in-memory tabular data in Julia.
  • DrWatson.jl is a scientific project assistant software.

This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.


Alternatives to Flux

Julia has several other libraries for making neural networks.

  • SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl)

  • Knet.jl is a neural network library built around AutoGrad.jl.

  • Lux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.

Explicit or explicit?

Flux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word "explicit" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.

+

The Julia Ecosystem around Flux

One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.

This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.

Flux models

  • Flux's model-zoo contains examples from many domains.

Computer vision

  • ObjectDetector.jl provides ready-to-go image detection via YOLO.
  • Metalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.
  • UNet.jl is a generic UNet implementation.

Natural language processing

  • Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.
  • TextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.

Reinforcement learning

  • AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
  • ReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.

Graph learning

  • GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.
  • GeometricFlux.jl is the first graph neural network library for julia.
  • NeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.
  • SeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.

Time series

Robust networks

  • RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.

Tools closely associated with Flux

Utility tools you're unlikely to have met if you never used Flux!

High-level training flows

  • FastAI.jl is a Julia port of Python's fast.ai library.
  • FluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl
  • Ignite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.
  • Tsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.

Datasets

Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:

Plumbing

Tools to put data into the right order for creating a model.

  • Augmentor.jl is a real-time library augmentation library for increasing the number of training images.
  • DataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.
  • MLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.

Parameters


Differentiable programming

Packages based on differentiable programming but not necessarily related to Machine Learning.

  • The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.
  • DiffEqFlux.jl provides tools for creating Neural Differential Equations.
  • Flux3D.jl shows off machine learning on 3D data.
  • RayTracer.jl combines ML with computer vision via a differentiable renderer.
  • Duckietown.jl Differentiable Duckietown simulator.
  • The Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.
  • AtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.
  • DiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.

Probabilistic programming

  • Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.
  • Omega.jl is a research project aimed at causal, higher-order probabilistic programming.
  • Stheno.jl provides flexible Gaussian processes.

Statistics


Useful miscellaneous packages

Some useful and random packages!

  • AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.
  • Mill.jl helps to prototype flexible multi-instance learning models.
  • MLMetrics.jl is a utility for scoring models in data science and machine learning.
  • Torch.jl exposes torch in Julia.
  • ValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.
  • InvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.
  • ProgressMeter.jl progress meters for long-running computations.
  • TensorBoardLogger.jl easy peasy logging to tensorboard in Julia
  • ArgParse.jl is a package for parsing command-line arguments to Julia programs.
  • Parameters.jl types with default field values, keyword constructors and (un-)pack macros.
  • BSON.jl is a package for working with the Binary JSON serialisation format.
  • DataFrames.jl in-memory tabular data in Julia.
  • DrWatson.jl is a scientific project assistant software.

This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.


Alternatives to Flux

Julia has several other libraries for making neural networks.

  • SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl)

  • Knet.jl is a neural network library built around AutoGrad.jl.

  • Lux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.

Explicit or explicit?

Flux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word "explicit" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.

diff --git a/previews/PR2464/guide/gpu/index.html b/previews/PR2464/guide/gpu/index.html index 4636d8d59a..a153d9daf2 100644 --- a/previews/PR2464/guide/gpu/index.html +++ b/previews/PR2464/guide/gpu/index.html @@ -169,10 +169,10 @@ julia> CUDA.device(dense_model.weight) CuDevice(1): GeForce RTX 2080 Ti -

Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.

Printing models after moving to a different device

Due to a limitation in how GPU packages currently work, printing models on the REPL after moving them to a GPU device which is different from the current device will lead to an error.

Flux.AbstractDeviceType
Flux.AbstractDevice <: Function

An abstract type representing device objects for different GPU backends. The currently supported backends are "CUDA", "AMDGPU", "Metal" and "CPU"; the "CPU" backend is the fallback case when no GPU is available. GPU extensions of Flux define subtypes of this type.

source
Flux.FluxCPUDeviceType
Flux.FluxCPUDevice <: Flux.AbstractDevice

A type representing device objects for the "CPU" backend for Flux. This is the fallback case when no GPU is available to Flux.

source
Flux.FluxCUDADeviceType
FluxCUDADevice <: AbstractDevice

A type representing device objects for the "CUDA" backend for Flux.

source
Flux.FluxAMDGPUDeviceType
FluxAMDGPUDevice <: AbstractDevice

A type representing device objects for the "AMDGPU" backend for Flux.

source
Flux.FluxMetalDeviceType
FluxMetalDevice <: AbstractDevice

A type representing device objects for the "Metal" backend for Flux.

source
Flux.supported_devicesFunction
Flux.supported_devices()

Get all supported backends for Flux, in order of preference.

Example

julia> using Flux;
+

Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.

Printing models after moving to a different device

Due to a limitation in how GPU packages currently work, printing models on the REPL after moving them to a GPU device which is different from the current device will lead to an error.

Flux.AbstractDeviceType
Flux.AbstractDevice <: Function

An abstract type representing device objects for different GPU backends. The currently supported backends are "CUDA", "AMDGPU", "Metal" and "CPU"; the "CPU" backend is the fallback case when no GPU is available. GPU extensions of Flux define subtypes of this type.

source
Flux.FluxCPUDeviceType
Flux.FluxCPUDevice <: Flux.AbstractDevice

A type representing device objects for the "CPU" backend for Flux. This is the fallback case when no GPU is available to Flux.

source
Flux.FluxCUDADeviceType
FluxCUDADevice <: AbstractDevice

A type representing device objects for the "CUDA" backend for Flux.

source
Flux.FluxAMDGPUDeviceType
FluxAMDGPUDevice <: AbstractDevice

A type representing device objects for the "AMDGPU" backend for Flux.

source
Flux.FluxMetalDeviceType
FluxMetalDevice <: AbstractDevice

A type representing device objects for the "Metal" backend for Flux.

source
Flux.supported_devicesFunction
Flux.supported_devices()

Get all supported backends for Flux, in order of preference.

Example

julia> using Flux;
 
 julia> Flux.supported_devices()
-("CUDA", "AMDGPU", "Metal", "CPU")
source
Flux.get_deviceFunction
Flux.get_device(; verbose=false)::Flux.AbstractDevice

Returns a device object for the most appropriate backend for the current Julia session.

First, the function checks whether a backend preference has been set via the Flux.gpu_backend! function. If so, an attempt is made to load this backend. If the corresponding trigger package has been loaded and the backend is functional, a device corresponding to the given backend is loaded. Otherwise, the backend is chosen automatically. To update the backend preference, use Flux.gpu_backend!.

If there is no preference, then for each of the "CUDA", "AMDGPU", "Metal" and "CPU" backends in the given order, this function checks whether the given backend has been loaded via the corresponding trigger package, and whether the backend is functional. If so, the device corresponding to the backend is returned. If no GPU backend is available, a Flux.FluxCPUDevice is returned.

If verbose is set to true, then the function prints informative log messages.

Examples

For the example given below, the backend preference was set to "AMDGPU" via the gpu_backend! function.

julia> using Flux;
+("CUDA", "AMDGPU", "Metal", "CPU")
source
Flux.get_deviceFunction
Flux.get_device(; verbose=false)::Flux.AbstractDevice

Returns a device object for the most appropriate backend for the current Julia session.

First, the function checks whether a backend preference has been set via the Flux.gpu_backend! function. If so, an attempt is made to load this backend. If the corresponding trigger package has been loaded and the backend is functional, a device corresponding to the given backend is loaded. Otherwise, the backend is chosen automatically. To update the backend preference, use Flux.gpu_backend!.

If there is no preference, then for each of the "CUDA", "AMDGPU", "Metal" and "CPU" backends in the given order, this function checks whether the given backend has been loaded via the corresponding trigger package, and whether the backend is functional. If so, the device corresponding to the backend is returned. If no GPU backend is available, a Flux.FluxCPUDevice is returned.

If verbose is set to true, then the function prints informative log messages.

Examples

For the example given below, the backend preference was set to "AMDGPU" via the gpu_backend! function.

julia> using Flux;
 
 julia> model = Dense(2 => 3)
 Dense(2 => 3)       # 9 parameters
@@ -212,7 +212,7 @@
 3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
   0.820013   0.527131
  -0.915589   0.549048
-  0.290744  -0.0592499
source
Flux.get_device(backend::String, idx::Int = 0)::Flux.AbstractDevice

Get a device object for a backend specified by the string backend and idx. The currently supported values of backend are "CUDA", "AMDGPU" and "CPU". idx must be an integer value between 0 and the number of available devices.

Examples

julia> using Flux, CUDA;
+  0.290744  -0.0592499
source
Flux.get_device(backend::String, idx::Int = 0)::Flux.AbstractDevice

Get a device object for a backend specified by the string backend and idx. The currently supported values of backend are "CUDA", "AMDGPU" and "CPU". idx must be an integer value between 0 and the number of available devices.

Examples

julia> using Flux, CUDA;
 
 julia> CUDA.devices()
 CUDA.DeviceIterator() for 3 devices:
@@ -234,7 +234,7 @@
 
 julia> cpu_device = Flux.get_device("CPU")
 (::Flux.FluxCPUDevice) (generic function with 1 method)
-
source
Flux.gpu_backend!Function
gpu_backend!(backend::String)

Set the GPU backend to backend in the LocalPreferences.toml file in you project directory. After restarting Julia, the new backend will affect all subsequent calls to gpu and get_device.

The supported backends are "CUDA", "AMDGPU", "Metal" and "CPU".

source

Distributed data parallel training

Flux supports now distributed data parallel training with DistributedUtils module. If you want to run your code on multiple GPUs, you have to install MPI.jl (see docs for more info).

julia> using MPI
+
source
Flux.gpu_backend!Function
gpu_backend!(backend::String)

Set the GPU backend to backend in the LocalPreferences.toml file in you project directory. After restarting Julia, the new backend will affect all subsequent calls to gpu and get_device.

The supported backends are "CUDA", "AMDGPU", "Metal" and "CPU".

source

Distributed data parallel training

Experimental

Distributed support is experimental and could change in the future.

Flux supports now distributed data parallel training with DistributedUtils module. If you want to run your code on multiple GPUs, you have to install MPI.jl (see docs for more info).

julia> using MPI
 
 julia> MPI.install_mpiexecjl()

Now you can run your code with mpiexecjl --project=. -n <np> julia <filename>.jl from CLI.

You can use either the MPIBackend or NCCLBackend, the latter only if also NCCL.jl is loaded. First, initialize a backend with DistributedUtils.initialize, e.g.

julia> using Flux, MPI, NCCL
 
@@ -253,20 +253,33 @@
 
 julia> y = x .^ 3
 1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:
- 0.0137076  0.0362744  0.791443  0.171815  0.620854  0.668804  0.53197  0.819654  0.108651  0.179971  0.312918  0.388508  0.907292  0.00155418  0.29  0.435899

You can also use DistributedUtils.DistributedDataContainer to split the data uniformly across processes.

julia> data = DistributedUtils.DistributedDataContainer(backend, x)
+ 0.0137076  0.0362744  0.791443  0.171815  0.620854  0.668804  0.53197  0.819654  0.108651  0.179971  0.312918  0.388508  0.907292  0.00155418  0.29  0.435899

In this case, we are training on a total of 16 * number of processes samples. You can also use DistributedUtils.DistributedDataContainer to split the data uniformly across processes (or split the data manually).

julia> data = DistributedUtils.DistributedDataContainer(backend, x)
 Flux.DistributedUtils.DistributedDataContainer(Float32[0.23932439 0.33102947 … 0.66191036 0.75822026], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])

You have to wrap your model in DistributedUtils.FluxDistributedModel and synchronize it (broadcast accross all processes):

julia> model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0)
 Chain(
   Dense(1 => 256, tanh),                # 512 parameters
 
   Dense(256 => 1),                      # 257 parameters
-)                   # Total: 4 arrays, 769 parameters, 744 bytes.

Time to set up an optimizer by using DistributedUtils.DistributedOptimizer and synchronize it as well.

using Optimisers
-opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))
-st_opt = Optimisers.setup(opt, model)
-st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0) 

Now you can define loss and train the model.

for epoch in 1:100
-  global model, st_opt
-  l, grad = Zygote.withgradient(loss, model)
-  println("Epoch $epoch: Loss $l")
-  st_opt, model = Optimisers.update(st_opt, model, grad[1])
-end

Remember that in order to run it on multiple GPUs you have to run from CLI mpiexecjl --project=. -n <np> julia <filename>.jl, where <np> is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.

By default MPI.jl MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more here on custom installation and rebuilding MPI.jl. Then test if your MPI is CUDA-aware by

import Pkg
-Pkg.test("MPI"; test_args=["--backend=CUDA"])

If it is, set your local preference as below

using Preferences
-set_preferences!("Flux", "FluxDistributedMPICUDAAware" => true)
Known shortcomings

We don't run CUDA-aware tests so you're running it at own risk.

+) # Total: 4 arrays, 769 parameters, 744 bytes.

Time to set up an optimizer by using DistributedUtils.DistributedOptimizer and synchronize it as well.

julia> using Optimisers
+
+julia> opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))
+DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8))
+
+julia> st_opt = Optimisers.setup(opt, model)
+(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)
+
+julia> st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0) 
+(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)

Now you can define loss and train the model.

julia> loss(model) = mean((model(x) .- y).^2)
+loss (generic function with 1 method)
+
+julia> for epoch in 1:100
+           global model, st_opt
+           l, grad = Zygote.withgradient(loss, model)
+           println("Epoch $epoch: Loss $l")
+           st_opt, model = Optimisers.update(st_opt, model, grad[1])
+         end
+Epoch 1: Loss 0.011638729
+Epoch 2: Loss 0.0116432225
+Epoch 3: Loss 0.012763695
+...

Remember that in order to run it on multiple GPUs you have to run from CLI mpiexecjl --project=. -n <np> julia <filename>.jl, where <np> is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.

By default MPI.jl MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more here on custom installation and rebuilding MPI.jl. Then test if your MPI is CUDA-aware by

julia> import Pkg
+julia> Pkg.test("MPI"; test_args=["--backend=CUDA"])

If it is, set your local preference as below

julia> using Preferences
+julia> set_preferences!("Flux", "FluxDistributedMPICUDAAware" => true)
Known shortcomings

We don't run CUDA-aware tests so you're running it at own risk.

diff --git a/previews/PR2464/guide/models/basics/index.html b/previews/PR2464/guide/models/basics/index.html index 6d43677259..549db9a6d2 100644 --- a/previews/PR2464/guide/models/basics/index.html +++ b/previews/PR2464/guide/models/basics/index.html @@ -109,4 +109,4 @@ return Affine(W, b) end -Affine(3 => 1, bias=false) |> gpu +Affine(3 => 1, bias=false) |> gpu diff --git a/previews/PR2464/guide/models/custom_layers/index.html b/previews/PR2464/guide/models/custom_layers/index.html index 98163baa70..7f444c274a 100644 --- a/previews/PR2464/guide/models/custom_layers/index.html +++ b/previews/PR2464/guide/models/custom_layers/index.html @@ -104,4 +104,4 @@ # rms over all the mse ŷs = model(x) return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs))) -end
Note

This Split layer is available from the Fluxperimental.jl package.

+end
Note

This Split layer is available from the Fluxperimental.jl package.

diff --git a/previews/PR2464/guide/models/overview/index.html b/previews/PR2464/guide/models/overview/index.html index 4ed88429cd..4763e98031 100644 --- a/previews/PR2464/guide/models/overview/index.html +++ b/previews/PR2464/guide/models/overview/index.html @@ -56,4 +56,4 @@ julia> y_test 1×5 Matrix{Int64}: - 26 30 34 38 42

The predictions are good. Here's how we got there.

First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.

Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.

After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.

After we trained the model, we verified it with the test data to verify the results.

This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.

+ 26 30 34 38 42

The predictions are good. Here's how we got there.

First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.

Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.

After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.

After we trained the model, we verified it with the test data to verify the results.

This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.

diff --git a/previews/PR2464/guide/models/quickstart/index.html b/previews/PR2464/guide/models/quickstart/index.html index 3815659a78..3b91cfbd35 100644 --- a/previews/PR2464/guide/models/quickstart/index.html +++ b/previews/PR2464/guide/models/quickstart/index.html @@ -59,4 +59,4 @@ y_hat = m(x) Flux.logitcrossentropy(y_hat, y) end -end +end diff --git a/previews/PR2464/guide/models/recurrence/index.html b/previews/PR2464/guide/models/recurrence/index.html index f10ea14f38..d67f5f1f40 100644 --- a/previews/PR2464/guide/models/recurrence/index.html +++ b/previews/PR2464/guide/models/recurrence/index.html @@ -99,4 +99,4 @@ true

In many situations, such as when dealing with a language model, the sentences in each batch are independent (i.e. the last item of the first sentence of the first batch is independent from the first item of the first sentence of the second batch), so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situations, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:

function loss(x, y)
   Flux.reset!(m)
   sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))
-end

A potential source of ambiguity with RNN in Flux can come from the different data layout compared to some common frameworks where data is typically a 3 dimensional array: (features, seq length, samples). In Flux, those 3 dimensions are provided through a vector of seq length containing a matrix (features, samples).

+end

A potential source of ambiguity with RNN in Flux can come from the different data layout compared to some common frameworks where data is typically a 3 dimensional array: (features, seq length, samples). In Flux, those 3 dimensions are provided through a vector of seq length containing a matrix (features, samples).

diff --git a/previews/PR2464/guide/performance/index.html b/previews/PR2464/guide/performance/index.html index 77a2142d8e..241a6a0337 100644 --- a/previews/PR2464/guide/performance/index.html +++ b/previews/PR2464/guide/performance/index.html @@ -14,4 +14,4 @@ function loss_total(x_batch::Matrix, y_batch::Matrix) y_preds = model(x_batch) sum(loss.(y_preds, y_batch)) -end

When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.

Be aware of GPU memory inefficiencies

Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.

+end

When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.

Be aware of GPU memory inefficiencies

Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.

diff --git a/previews/PR2464/guide/saving/index.html b/previews/PR2464/guide/saving/index.html index aa432af5f0..c9c75b18d1 100644 --- a/previews/PR2464/guide/saving/index.html +++ b/previews/PR2464/guide/saving/index.html @@ -59,4 +59,4 @@ Chain( Dense(10 => 5, relu), # 55 parameters Dense(5 => 2), # 12 parameters -) # Total: 4 arrays, 67 parameters, 524 bytes.
Warning

Saving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.

+) # Total: 4 arrays, 67 parameters, 524 bytes.
Warning

Saving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.

diff --git a/previews/PR2464/guide/training/training/index.html b/previews/PR2464/guide/training/training/index.html index e426cf05bd..a788f8ada4 100644 --- a/previews/PR2464/guide/training/training/index.html +++ b/previews/PR2464/guide/training/training/index.html @@ -118,4 +118,4 @@ train!(loss, bimodel, data, opt_state) # Un-freeze the entire model: -Flux.thaw!(opt_state)

While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).

+Flux.thaw!(opt_state)

While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).

diff --git a/previews/PR2464/index.html b/previews/PR2464/index.html index 2ad88b197a..82e2adf9dd 100644 --- a/previews/PR2464/index.html +++ b/previews/PR2464/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.
  • Extensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.
  • Play nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.

Installation

Download Julia 1.9 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.

Learning Flux

The quick start page trains a simple neural network.

This rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.

The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.

Community

Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.

+

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.
  • Extensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.
  • Play nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.

Installation

Download Julia 1.9 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.

Learning Flux

The quick start page trains a simple neural network.

This rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.

The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.

Community

Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.

diff --git a/previews/PR2464/objects.inv b/previews/PR2464/objects.inv index 6beb0fd889749870e75100b723b4bb6c9b688093..e4d1e7e1bab223a6f12a26a24e5f759cd7e3feb2 100644 GIT binary patch delta 12 TcmaE@|5|^79i!z&`!(VKBrgQ; delta 12 TcmaE@|5|^79i#b1`!(VKBqapy diff --git a/previews/PR2464/reference/data/mlutils/index.html b/previews/PR2464/reference/data/mlutils/index.html index 39916e947d..869f930b78 100644 --- a/previews/PR2464/reference/data/mlutils/index.html +++ b/previews/PR2464/reference/data/mlutils/index.html @@ -570,4 +570,4 @@ julia> zeros_like(x, Float64) 2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}: 0.0 0.0 - 0.0 0.0source + 0.0 0.0source diff --git a/previews/PR2464/reference/data/onehot/index.html b/previews/PR2464/reference/data/onehot/index.html index 2bb3844d60..255f43a2b1 100644 --- a/previews/PR2464/reference/data/onehot/index.html +++ b/previews/PR2464/reference/data/onehot/index.html @@ -73,4 +73,4 @@ 3 6 15 3 9 3 12 3 6 15 3source
OneHotArrays.OneHotArrayType
OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}
 OneHotArray(indices, L)

A one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.

Typically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.

source
OneHotArrays.OneHotVectorType
OneHotVector{T} = OneHotArray{T, 0, 1, T}
 OneHotVector(indices, L)

A one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.

source
OneHotArrays.OneHotMatrixType
OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}
-OneHotMatrix(indices, L)

A one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.

source
+OneHotMatrix(indices, L)

A one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.

source diff --git a/previews/PR2464/reference/destructure/index.html b/previews/PR2464/reference/destructure/index.html index 4e5c99fbfb..98a69d0bb4 100644 --- a/previews/PR2464/reference/destructure/index.html +++ b/previews/PR2464/reference/destructure/index.html @@ -106,7 +106,7 @@ L2 (generic function with 1 method) julia> L2(m2) isa Float32 -truesource

Save and Load

Flux.stateFunction
state(x)

Return an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).

Besides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.

This method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.

The state can be passed to loadmodel! to restore the model.

Examples

Copy the state into another model

julia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));
+true
source

Save and Load

Flux.stateFunction
state(x)

Return an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).

Besides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.

This method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.

The state can be passed to loadmodel! to restore the model.

Examples

Copy the state into another model

julia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));
 
 julia> s = Flux.state(m1)
 (layers = ((weight = [1.0; 1.0;;], bias = [0.0, 0.0], σ = ()), (weight = [1.0 1.0], bias = [0.0], σ = ())),)
@@ -132,7 +132,7 @@
 
 julia> JLD2.jldsave("checkpoint.jld2", model_state = s)
 
-julia> Flux.loadmodel!(m2, JLD2.load("checkpoint.jld2", "model_state"))
source
Flux.loadmodel!Function
loadmodel!(dst, src)

Copy all the parameters (trainable and non-trainable) from src into dst.

Recursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).

See also Flux.state.

Examples

julia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))
+julia> Flux.loadmodel!(m2, JLD2.load("checkpoint.jld2", "model_state"))
source
Flux.loadmodel!Function
loadmodel!(dst, src)

Copy all the parameters (trainable and non-trainable) from src into dst.

Recursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).

See also Flux.state.

Examples

julia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))
 Chain(
   Dense(5 => 2, tanh),                  # 12 parameters
   Dense(2 => 1),                        # 3 parameters
@@ -149,7 +149,7 @@
 false
 
 julia> iszero(dst[2].bias)
-true

Extended help

Throws an error when:

  • dst and src do not share the same fields (at any level)
  • the sizes of leaf nodes are mismatched between dst and src
  • copying non-array values to/from an array parameter (except inactive parameters described below)
  • dst is a "tied" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values

Inactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.

source

KeyPath

Functors.KeyPathType
KeyPath(keys...)

A type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, or Int.

For custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.

For string and integer keys instead, the access is done with getindex.

See also getkeypath, haskeypath.

Examples

julia> kp = KeyPath(:b, 3)
+true

Extended help

Throws an error when:

  • dst and src do not share the same fields (at any level)
  • the sizes of leaf nodes are mismatched between dst and src
  • copying non-array values to/from an array parameter (except inactive parameters described below)
  • dst is a "tied" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values

Inactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.

source

KeyPath

Functors.KeyPathType
KeyPath(keys...)

A type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, or Int.

For custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.

For string and integer keys instead, the access is done with getindex.

See also getkeypath, haskeypath.

Examples

julia> kp = KeyPath(:b, 3)
 KeyPath(:b, 3)
 
 julia> KeyPath(:a, kp, :c, 4) # construct mixing keys and keypaths
@@ -196,4 +196,4 @@
 true
 
 julia> haskeypath(x, KeyPath(:b, "d", 4))
-false
source
+falsesource diff --git a/previews/PR2464/reference/models/activation/index.html b/previews/PR2464/reference/models/activation/index.html index d91a31ab68..7bb849d2b8 100644 --- a/previews/PR2464/reference/models/activation/index.html +++ b/previews/PR2464/reference/models/activation/index.html @@ -17,7 +17,7 @@ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ julia> celu(-10f0) --0.9999546f0source
NNlib.eluFunction
elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)

Exponential Linear Unit activation function. See "Fast and Accurate Deep Network Learning by Exponential Linear Units". You can also specify the coefficient explicitly, e.g. elu(x, 1).

julia> lineplot(elu, -2, 2, height=7)
+-0.9999546f0
source
NNlib.eluFunction
elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)

Exponential Linear Unit activation function. See "Fast and Accurate Deep Network Learning by Exponential Linear Units". You can also specify the coefficient explicitly, e.g. elu(x, 1).

julia> lineplot(elu, -2, 2, height=7)
            ┌────────────────────────────────────────┐       
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ elu(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│       
@@ -34,7 +34,7 @@
 -0.9999546f0
 
 julia> elu(-10f0, 2)
--1.9999092f0
source
NNlib.geluFunction
gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))

Activation function from "Gaussian Error Linear Units".

julia> lineplot(gelu, -2, 2, height=7)
+-1.9999092f0
source
NNlib.geluFunction
gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))

Activation function from "Gaussian Error Linear Units".

julia> lineplot(gelu, -2, 2, height=7)
            ┌────────────────────────────────────────┐        
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊│ gelu(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀│        
@@ -60,7 +60,7 @@
         -0.2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠇⠀⠀⠀│         
              └────────────────────────────────────────┘         
              ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀         
-             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.hardsigmoidFunction
hardσ(x) = max(0, min(1, (x + 3) / 6))

Piecewise linear approximation of sigmoid.

julia> lineplot(hardsigmoid, -5, 5, height=7)
+             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.hardsigmoidFunction
hardσ(x) = max(0, min(1, (x + 3) / 6))

Piecewise linear approximation of sigmoid.

julia> lineplot(hardsigmoid, -5, 5, height=7)
           ┌────────────────────────────────────────┐         
         1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠉⠉⠉⠉⠉⠉⠉⠉│ hardσ(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
@@ -84,7 +84,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
           └────────────────────────────────────────┘     
           ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀     
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
source
NNlib.hardswishFunction
hardswish(x) = x * hardσ(x)

Hard-Swish activation function. See "Searching for MobileNetV3".

julia> lineplot(hardswish, -2, 5, height = 7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
source
NNlib.hardswishFunction
hardswish(x) = x * hardσ(x)

Hard-Swish activation function. See "Searching for MobileNetV3".

julia> lineplot(hardswish, -2, 5, height = 7)
            ┌────────────────────────────────────────┐             
          5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠒⠉│ hardswish(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠒⠉⠁⠀⠀⠀⠀│             
@@ -114,7 +114,7 @@
 
 julia> hardswish.(-5:5)'
 1×11 adjoint(::Vector{Float64}) with eltype Float64:
- -0.0  -0.0  -0.0  -0.333333  -0.333333  0.0  0.666667  1.66667  3.0  4.0  5.0
source
NNlib.hardtanhFunction
hardtanh(x) = max(-1, min(1, x))

Segment-wise linear approximation of tanh, much cheaper to compute. See "Large Scale Machine Learning".

See also tanh_fast.

julia> lineplot(hardtanh, -2, 2, height=7)
+ -0.0  -0.0  -0.0  -0.333333  -0.333333  0.0  0.666667  1.66667  3.0  4.0  5.0
source
NNlib.hardtanhFunction
hardtanh(x) = max(-1, min(1, x))

Segment-wise linear approximation of tanh, much cheaper to compute. See "Large Scale Machine Learning".

See also tanh_fast.

julia> lineplot(hardtanh, -2, 2, height=7)
            ┌────────────────────────────────────────┐            
          1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⠔⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ hardtanh(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│            
@@ -138,7 +138,7 @@
         -1 │⣀⣀⣀⡠⠤⠤⠤⠖⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.leakyreluFunction
leakyrelu(x, a=0.01) = max(a*x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

julia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.leakyreluFunction
leakyrelu(x, a=0.01) = max(a*x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

julia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)
            ┌────────────────────────────────────────┐       
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ #42(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│       
@@ -155,7 +155,7 @@
 -2.0f0
 
 julia> leakyrelu(-10f0, 0.02)
--0.5f0
source
NNlib.lishtFunction
lisht(x) = x * tanh(x)

Activation function from "LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ..."

julia> lineplot(lisht, -2, 2, height=7)
+-0.5f0
source
NNlib.lishtFunction
lisht(x) = x * tanh(x)

Activation function from "LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ..."

julia> lineplot(lisht, -2, 2, height=7)
           ┌────────────────────────────────────────┐         
         2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x)
           │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│         
@@ -179,7 +179,7 @@
         0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠪⠷⣦⣄⣀⣀⣇⣀⣀⣤⠶⠕⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│           
           └────────────────────────────────────────┘           
           ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀           
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logcoshFunction
logcosh(x)

Return log(cosh(x)) which is computed in a numerically stable way.

julia> lineplot(logcosh, -5, 5, height=7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logcoshFunction
logcosh(x)

Return log(cosh(x)) which is computed in a numerically stable way.

julia> lineplot(logcosh, -5, 5, height=7)
           ┌────────────────────────────────────────┐           
         5 │⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ logcosh(x)
           │⠉⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│           
@@ -190,7 +190,7 @@
         0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠢⢄⣀⣀⣇⣀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│           
           └────────────────────────────────────────┘           
           ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀           
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logsigmoidFunction
logσ(x)

Return log(σ(x)) which is computed in a numerically stable way.

julia> lineplot(logsigmoid, -5, 5, height=7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logsigmoidFunction
logσ(x)

Return log(σ(x)) which is computed in a numerically stable way.

julia> lineplot(logsigmoid, -5, 5, height=7)
            ┌────────────────────────────────────────┐        
          0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡧⠤⠔⠒⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ logσ(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
@@ -201,7 +201,7 @@
         -6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.mishFunction
mish(x) = x * tanh(softplus(x))

Activation function from "Mish: A Self Regularized Non-Monotonic Neural Activation Function".

julia> lineplot(mish, -5, 5, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.mishFunction
mish(x) = x * tanh(softplus(x))

Activation function from "Mish: A Self Regularized Non-Monotonic Neural Activation Function".

julia> lineplot(mish, -5, 5, height=7)
            ┌────────────────────────────────────────┐        
          5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋│ mish(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀│        
@@ -212,7 +212,7 @@
         -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.reluFunction
relu(x) = max(0, x)

Rectified Linear Unit activation function.

julia> lineplot(relu, -2, 2, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.reluFunction
relu(x) = max(0, x)

Rectified Linear Unit activation function.

julia> lineplot(relu, -2, 2, height=7)
           ┌────────────────────────────────────────┐        
         2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ relu(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀│        
@@ -223,7 +223,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
           └────────────────────────────────────────┘        
           ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀        
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.relu6Function
relu6(x) = min(max(0, x), 6)

Rectified Linear Unit activation function capped at 6. See "Convolutional Deep Belief Networks" from CIFAR-10.

julia> lineplot(relu6, -10, 10, height=7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.relu6Function
relu6(x) = min(max(0, x), 6)

Rectified Linear Unit activation function capped at 6. See "Convolutional Deep Belief Networks" from CIFAR-10.

julia> lineplot(relu6, -10, 10, height=7)
           ┌────────────────────────────────────────┐         
         6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠎⠉⠉⠉⠉⠉⠉⠉⠉│ relu6(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
@@ -234,7 +234,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⡧⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
           └────────────────────────────────────────┘         
           ⠀-10⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀         
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.rreluFunction
rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.rreluFunction
rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)
 # where `a` is randomly sampled from uniform distribution `U(lo, hi)`

Randomized Leaky Rectified Linear Unit activation function. See "Empirical Evaluation of Rectified Activations" You can also specify the bound explicitly, e.g. rrelu(x, 0.0, 1.0).

julia> lineplot(rrelu, -20, 10, height=7)
             ┌────────────────────────────────────────┐         
          10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ rrelu(x)
@@ -249,7 +249,7 @@
             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
 
 julia> extrema(rrelu.(fill(-10f0, 1000)))
-(-3.3316886f0, -1.2548422f0)
source
NNlib.seluFunction
selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))
+(-3.3316886f0, -1.2548422f0)
source
NNlib.seluFunction
selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))
 
 λ ≈ 1.05070...
 α ≈ 1.67326...

Scaled exponential linear units. See "Self-Normalizing Neural Networks".

julia> lineplot(selu, -3, 2, height=7)
@@ -266,7 +266,7 @@
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
 
 julia> selu(-10f0)
--1.7580194f0
source
NNlib.sigmoidFunction
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function. Unicode σ can be entered as \sigma then tab, in many editors. The ascii name sigmoid is also exported.

See also sigmoid_fast.

julia> using UnicodePlots
+-1.7580194f0
source
NNlib.sigmoidFunction
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function. Unicode σ can be entered as \sigma then tab, in many editors. The ascii name sigmoid is also exported.

See also sigmoid_fast.

julia> using UnicodePlots
 
 julia> lineplot(sigmoid, -5, 5, height=7)
           ┌────────────────────────────────────────┐     
@@ -282,14 +282,14 @@
           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
 
 julia> sigmoid === σ
-true
source
NNlib.sigmoid_fastFunction
sigmoid_fast(x)

This is a faster, and very slightly less accurate, version of sigmoid. For `x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.

See also tanh_fast.

julia> sigmoid(0.2f0)
+true
source
NNlib.sigmoid_fastFunction
sigmoid_fast(x)

This is a faster, and very slightly less accurate, version of sigmoid. For `x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.

See also tanh_fast.

julia> sigmoid(0.2f0)
 0.54983395f0
 
 julia> sigmoid_fast(0.2f0)
 0.54983395f0
 
 julia> hardσ(0.2f0)
-0.53333336f0
source
NNlib.softplusFunction
softplus(x) = log(exp(x) + 1)

See "Deep Sparse Rectifier Neural Networks", JMLR 2011.

julia> lineplot(softplus, -3, 3, height=7)
+0.53333336f0
source
NNlib.softplusFunction
softplus(x) = log(exp(x) + 1)

See "Deep Sparse Rectifier Neural Networks", JMLR 2011.

julia> lineplot(softplus, -3, 3, height=7)
           ┌────────────────────────────────────────┐            
         4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠│            
@@ -316,7 +316,7 @@
           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀            
 
 julia> softplus(16f0)
-16.0f0
source
NNlib.softshrinkFunction
softshrink(x, λ=0.5) =
+16.0f0
source
NNlib.softshrinkFunction
softshrink(x, λ=0.5) =
     (x ≥ λ ? x - λ : (-λ ≥ x ? x + λ : 0))

See "Softshrink Activation Function".

julia> lineplot(softshrink, -2, 2, height=7)
            ┌────────────────────────────────────────┐              
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)
@@ -344,7 +344,7 @@
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀
 
 julia> softshrink.((-10f0, 10f0))
-(-9.5f0, 9.5f0)
source
NNlib.softsignFunction
softsign(x) = x / (1 + |x|)

See "Quadratic Polynomials Learn Better Image Features" (2009).

julia> lineplot(softsign, -5, 5, height=7)
+(-9.5f0, 9.5f0)
source
NNlib.softsignFunction
softsign(x) = x / (1 + |x|)

See "Quadratic Polynomials Learn Better Image Features" (2009).

julia> lineplot(softsign, -5, 5, height=7)
            ┌────────────────────────────────────────┐            
          1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣀⣀⠤⠤⠤⠤⠤│ softsign(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│            
@@ -374,7 +374,7 @@
 0.5f0
 
 julia> softsign(100f0)
-0.990099f0
source
NNlib.swishFunction
swish(x) = x * σ(x)

Self-gated activation function. See "Swish: a Self-Gated Activation Function".

julia> lineplot(swish, -2, 2, height=7)
+0.990099f0
source
NNlib.swishFunction
swish(x) = x * σ(x)

Self-gated activation function. See "Swish: a Self-Gated Activation Function".

julia> lineplot(swish, -2, 2, height=7)
            ┌────────────────────────────────────────┐         
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤│ swish(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋⠁⠀│         
@@ -385,7 +385,7 @@
         -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
            └────────────────────────────────────────┘         
            ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀         
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.tanhshrinkFunction
tanhshrink(x) = x - tanh(x)

See "Tanhshrink Activation Function".

julia> lineplot(tanhshrink, -3, 3, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.tanhshrinkFunction
tanhshrink(x) = x - tanh(x)

See "Tanhshrink Activation Function".

julia> lineplot(tanhshrink, -3, 3, height=7)
            ┌────────────────────────────────────────┐              
          3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanhshrink(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠊│              
@@ -399,14 +399,14 @@
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀              
 
 julia> tanhshrink.((-10f0, 10f0))
-(-9.0f0, 9.0f0)
source
NNlib.tanh_fastFunction
tanh_fast(x)

This is a faster but slighly less accurate version of tanh.

Where Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit.

For x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.

See also sigmoid_fast.

julia> tanh(0.5f0)
+(-9.0f0, 9.0f0)
source
NNlib.tanh_fastFunction
tanh_fast(x)

This is a faster but slighly less accurate version of tanh.

Where Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit.

For x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.

See also sigmoid_fast.

julia> tanh(0.5f0)
 0.46211717f0
 
 julia> tanh_fast(0.5f0)
 0.46211714f0
 
 julia> hard_tanh(0.5f0)
-0.5f0
source
NNlib.treluFunction
trelu(x, theta=1) = x > theta ? x : 0

Threshold gated rectified linear activation function. See "Zero-bias autoencoders and the benefits of co-adapting features"

julia> lineplot(trelu, -2, 4, height=7)
+0.5f0
source
NNlib.treluFunction
trelu(x, theta=1) = x > theta ? x : 0

Threshold gated rectified linear activation function. See "Zero-bias autoencoders and the benefits of co-adapting features"

julia> lineplot(trelu, -2, 4, height=7)
           ┌────────────────────────────────────────┐         
         4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ trelu(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│         
@@ -417,7 +417,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣀⣀⣀⣀⣀⣀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
           └────────────────────────────────────────┘         
           ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀4⠀         
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source

One More

Julia's Base.Math also provides tanh, which can be used as an activation function.

Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.

julia> using UnicodePlots
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source

One More

Julia's Base.Math also provides tanh, which can be used as an activation function.

Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.

julia> using UnicodePlots
 
 julia> lineplot(tanh, -3, 3, height=7)
            ┌────────────────────────────────────────┐        
@@ -430,4 +430,4 @@
         -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⡤⠤⠔⠒⠉⠁⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
+ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ diff --git a/previews/PR2464/reference/models/functors/index.html b/previews/PR2464/reference/models/functors/index.html index 04eebcab4f..3543b77b0d 100644 --- a/previews/PR2464/reference/models/functors/index.html +++ b/previews/PR2464/reference/models/functors/index.html @@ -23,7 +23,7 @@ Dense(2 => 1, tanh), # 3 parameters Dense(1 => 1; bias=false), # 1 parameters Dropout(0.4), -) # Total: 3 arrays, 4 parameters, 224 bytes.source
Functors.@functorMacro
@functor T
+)                   # Total: 3 arrays, 4 parameters, 224 bytes.
source
Functors.@functorMacro
@functor T
 @functor T (x,)

Adds methods to functor allowing recursion into objects of type T, and reconstruction. Assumes that T has a constructor accepting all of its fields, which is true unless you have provided an inner constructor which does not.

By default all fields of T are considered children; this can be restricted be restructed by providing a tuple of field names.

Examples

julia> struct Foo; x; y; end
 
 julia> @functor Foo
@@ -193,7 +193,7 @@
 julia> m.bias
 2-element Vector{Float32}:
  0.0
- 0.0
source
Flux.gpuMethod
gpu(m)

Copies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time).

On arrays, this calls CUDA's cu, which also changes arrays with Float64 elements to Float32 while copying them to the device (same for AMDGPU). To act on arrays within a struct, the struct type must be marked with @functor.

Use cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.

See the CUDA.jl docs to help identify the current device.

Example

julia> m = Dense(rand(2, 3))  # constructed with Float64 weight matrix
+ 0.0
source
Flux.gpuMethod
gpu(m)

Copies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time).

On arrays, this calls CUDA's cu, which also changes arrays with Float64 elements to Float32 while copying them to the device (same for AMDGPU). To act on arrays within a struct, the struct type must be marked with @functor.

Use cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.

See the CUDA.jl docs to help identify the current device.

Example

julia> m = Dense(rand(2, 3))  # constructed with Float64 weight matrix
 Dense(3 => 2)       # 8 parameters
 
 julia> typeof(m.weight)
@@ -203,7 +203,7 @@
 Dense(3 => 2)       # 8 parameters
 
 julia> typeof(m_gpu.weight)
-CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
source
Flux.gpuMethod
gpu(data::DataLoader)
+CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
source
Flux.gpuMethod
gpu(data::DataLoader)
 cpu(data::DataLoader)

Transforms a given DataLoader to apply gpu or cpu to each batch of data, when iterated over. (If no GPU is available, this does nothing.)

Example

julia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)
 4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)
   with first element:
@@ -223,4 +223,4 @@
  1.0  1.0  1.0

For large datasets, this is preferred over moving all the data to the GPU before creating the DataLoader, like this:

julia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)
 4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)
   with first element:
-  (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
Warning

This only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.

source
+ (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
Warning

This only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.

source
diff --git a/previews/PR2464/reference/models/layers/index.html b/previews/PR2464/reference/models/layers/index.html index 3356427e59..02bf2516c2 100644 --- a/previews/PR2464/reference/models/layers/index.html +++ b/previews/PR2464/reference/models/layers/index.html @@ -23,7 +23,7 @@ julia> Flux.trainables(model2) # no trainable bias 1-element Vector{AbstractArray}: - [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]source
Flux.BilinearType
Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)
+ [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]
source
Flux.BilinearType
Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)
 Bilinear(W::AbstractArray, [bias, σ])

Creates a layer which is fully connected between two inputs and the output, and otherwise similar to Dense. Its output, given vectors x & y, is another vector z with, for all i ∈ 1:out:

z[i] = σ(x' * W[i,:,:] * y + bias[i])

If x and y are matrices, then each column of the output z = B(x, y) is of this form, with B the Bilinear layer.

If the second input y is not given, it is taken to be equal to x, i.e. B(x) == B(x, x)

The two inputs may also be provided as a tuple, B((x, y)) == B(x, y), which is accepted as the input to a Chain.

If the two input sizes are the same, in1 == in2, then you may write Bilinear(in => out, σ).

The initialisation works as for Dense layer, with W = init(out, in1, in2). By default the bias vector is zeros(Float32, out), option bias=false will switch off trainable bias. Either of these may be provided explicitly.

Examples

julia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);
 
 julia> B = Flux.Bilinear((5, 5) => 7)
@@ -44,7 +44,7 @@
 (3, 32)
 
 julia> Flux.Bilinear(rand(4,8,16), false, tanh)  # first dim of weight is the output
-Bilinear((8, 16) => 4, tanh; bias=false)  # 512 parameters
source
Flux.ScaleType
Scale(size::Integer..., σ=identity; bias=true, init=ones32)
+Bilinear((8, 16) => 4, tanh; bias=false)  # 512 parameters
source
Flux.ScaleType
Scale(size::Integer..., σ=identity; bias=true, init=ones32)
 Scale(scale::AbstractArray, [bias, σ])

Create an element-wise layer, whose forward pass is given by:

y = σ.(scale .* x .+ bias)

This uses .* instead of matrix multiplication * of Dense.

The learnable scale & bias are initialised init(size...) and zeros32(size...), with init=ones32 by default. You may specify the function init, turn off trainable bias with bias=false, or provide the array(s) explicitly.

Used by LayerNorm with affine=true.

Examples

julia> a = Flux.Scale(2)
 Scale(2)            # 4 parameters
 
@@ -68,7 +68,7 @@
 
 julia> Flux.trainables(b)
 1-element Vector{AbstractArray}:
- Float32[1.0 2.0 3.0 4.0]
source

Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.

Convolution Models

These layers are used to build convolutional neural networks (CNNs).

They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).

Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.

To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.

Flux.ConvType
Conv(filter, in => out, σ = identity;
+ Float32[1.0 2.0 3.0 4.0]
source

Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.

Convolution Models

These layers are used to build convolutional neural networks (CNNs).

They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).

Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.

To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.

Flux.ConvType
Conv(filter, in => out, σ = identity;
      stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])

Standard convolutional layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Image data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.

To take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:

  • filter should be a tuple of N integers.
  • Keywords stride and dilation should each be either single integer, or a tuple with N integers.
  • Keyword pad specifies the number of elements added to the borders of the data array. It can be
    • a single integer for equal padding all around,
    • a tuple of N integers, to apply the same padding at begin/end of each spatial dimension,
    • a tuple of 2*N integers, for asymmetric padding, or
    • the singleton SamePad(), to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.
  • Keyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.

Keywords to control initialization of the layer:

  • init - Function used to generate initial weights. Defaults to glorot_uniform.
  • bias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).

See also ConvTranspose, DepthwiseConv, CrossCor.

Examples

julia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images
 
 julia> layer = Conv((5,5), 3 => 7, relu; bias = false)
@@ -87,7 +87,7 @@
 (130, 100, 7, 50)
 
 julia> Conv((5,5), 3 => 7; stride = 2, dilation = 4)(xs) |> size
-(42, 42, 7, 50)
source
Flux.ConvMethod
Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

julia> weight = rand(3, 4, 5);
+(42, 42, 7, 50)
source
Flux.ConvMethod
Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

julia> weight = rand(3, 4, 5);
 
 julia> bias = zeros(5);
 
@@ -98,7 +98,7 @@
 (98, 5, 64)
 
 julia> Flux.params(layer) |> length
-2
source
Flux.ConvTransposeType
ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])

Standard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.

Note that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.

To conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of 50 RGB images
+2
source
Flux.ConvTransposeType
ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])

Standard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.

Note that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.

To conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = ConvTranspose((5,5), 3 => 7, relu)
 ConvTranspose((5, 5), 3 => 7, relu)  # 532 parameters
@@ -113,7 +113,7 @@
 (204, 204, 7, 50)
 
 julia> ConvTranspose((5,5), 3 => 7, stride=3, pad=SamePad())(xs) |> size
-(300, 300, 7, 50)
source
Flux.ConvTransposeMethod
ConvTranspose(weight::AbstractArray, [bias, activation; stride, pad, outpad, dilation, groups])

Constructs a ConvTranspose layer with the given weight and bias. Accepts the same keywords and has the same defaults as ConvTranspose(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
+(300, 300, 7, 50)
source
Flux.ConvTransposeMethod
ConvTranspose(weight::AbstractArray, [bias, activation; stride, pad, outpad, dilation, groups])

Constructs a ConvTranspose layer with the given weight and bias. Accepts the same keywords and has the same defaults as ConvTranspose(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
 
 julia> bias = zeros(4);
 
@@ -124,7 +124,7 @@
 (102, 4, 64)
 
 julia> Flux.params(layer) |> length
-2
source
Flux.CrossCorType
CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])

Standard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
+2
source
Flux.CrossCorType
CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])

Standard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)
 CrossCor((5, 5), 3 => 6, relu, bias=false)  # 450 parameters
@@ -133,7 +133,7 @@
 (96, 96, 6, 50)
 
 julia> CrossCor((5,5), 3 => 7, stride=3, pad=(2,0))(xs) |> size
-(34, 32, 7, 50)
source
Flux.CrossCorMethod
CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a CrossCor layer with the given weight and bias. Accepts the same keywords and has the same defaults as CrossCor(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
+(34, 32, 7, 50)
source
Flux.CrossCorMethod
CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a CrossCor layer with the given weight and bias. Accepts the same keywords and has the same defaults as CrossCor(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
 
 julia> bias = zeros(5);
 
@@ -141,7 +141,7 @@
 CrossCor((3,), 4 => 5, relu)  # 65 parameters
 
 julia> layer(randn(100, 4, 64)) |> size
-(98, 5, 64)
source
Flux.DepthwiseConvFunction
DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
+(98, 5, 64)
source
Flux.DepthwiseConvFunction
DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
 DepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Return a depthwise convolutional layer, that is a Conv layer with number of groups equal to the number of input channels.

See Conv for a description of the arguments.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)
@@ -151,7 +151,7 @@
 (96, 96, 6, 50)
 
 julia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size
-(50, 50, 9, 50)
source
Flux.SamePadType
SamePad()

Passed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).

See also Conv, MaxPool.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of images
+(50, 50, 9, 50)
source
Flux.SamePadType
SamePad()

Passed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).

See also Conv, MaxPool.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of images
 
 julia> layer = Conv((2,2), 3 => 7, pad=SamePad())
 Conv((2, 2), 3 => 7, pad=(1, 0, 1, 0))  # 91 parameters
@@ -169,7 +169,7 @@
 Conv((5, 5), 3 => 7, pad=2, stride=2)  # 532 parameters
 
 julia> layer3(xs) |> size  # output size = `ceil(input_size/stride)` = 50
-(50, 50, 7, 50)
source
Flux.flattenFunction

flatten(x)

Same as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.

source

MultiHeadAttention

The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.

Flux.MultiHeadAttentionType
MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])

The multi-head dot-product attention layer used in Transformer architectures [1].

Returns the transformed input sequence and the attention scores.

[1] Vaswani et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Arguments

  • dims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.
  • nheads: number of heads. Default 8.
  • init: weight initializer for the Dense layers. Default glorot_uniform.
  • bias : whether pointwise QKVO dense transforms use bias. Default false.
  • dropout_prob: dropout probability for the attention scores. Default 0.0.

Forward

(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])

The arguments of the forward pass are:

  • q_in: Input query array of size (q_in_dim, q_len, batch_size).
  • k_in: Input key array of size (k_in_dim, kv_len, batch_size).
  • v_in: Input value array of size (v_in_dim, kv_len, batch_size).
  • bias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.
  • mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.

Alternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).

See also NNlib.dot_product_attention.

Examples

mha = MultiHeadAttention(64, nheads = 8)
+(50, 50, 7, 50)
source
Flux.flattenFunction

flatten(x)

Same as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.

source

MultiHeadAttention

The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.

Flux.MultiHeadAttentionType
MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])

The multi-head dot-product attention layer used in Transformer architectures [1].

Returns the transformed input sequence and the attention scores.

[1] Vaswani et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Arguments

  • dims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.
  • nheads: number of heads. Default 8.
  • init: weight initializer for the Dense layers. Default glorot_uniform.
  • bias : whether pointwise QKVO dense transforms use bias. Default false.
  • dropout_prob: dropout probability for the attention scores. Default 0.0.

Forward

(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])

The arguments of the forward pass are:

  • q_in: Input query array of size (q_in_dim, q_len, batch_size).
  • k_in: Input key array of size (k_in_dim, kv_len, batch_size).
  • v_in: Input value array of size (v_in_dim, kv_len, batch_size).
  • bias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.
  • mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.

Alternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).

See also NNlib.dot_product_attention.

Examples

mha = MultiHeadAttention(64, nheads = 8)
 q = rand(Float32, (64, 10, 32))
 k = rand(Float32, (64, 20, 32))
 v = rand(Float32, (64, 20, 32))
@@ -180,13 +180,13 @@
 mha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)
 y, α = mha(q) # self-attention
 # [y] = [1024, 10, 32]
-# [α] = [10, 10, 8, 32]
source

Pooling

These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

Flux.AdaptiveMaxPoolType
AdaptiveMaxPool(out::NTuple)

Adaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+# [α] = [10, 10, 8, 32]
source

Pooling

These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

Flux.AdaptiveMaxPoolType
AdaptiveMaxPool(out::NTuple)

Adaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> AdaptiveMaxPool((25, 25))(xs) |> size
 (25, 25, 3, 50)
 
 julia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)
-true
source
Flux.MaxPoolType
MaxPool(window::NTuple; pad=0, stride=window)

Max pooling layer, which replaces all pixels in a block of size window with one.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+true
source
Flux.MaxPoolType
MaxPool(window::NTuple; pad=0, stride=window)

Max pooling layer, which replaces all pixels in a block of size window with one.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))
 Chain(
@@ -204,7 +204,7 @@
 MaxPool((5,), pad=2, stride=3)
 
 julia> layer(rand(Float32, 100, 7, 50)) |> size
-(34, 7, 50)
source
Flux.GlobalMaxPoolType
GlobalMaxPool()

Global max pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.

See also MaxPool, GlobalMeanPool.

julia> xs = rand(Float32, 100, 100, 3, 50);
+(34, 7, 50)
source
Flux.GlobalMaxPoolType
GlobalMaxPool()

Global max pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.

See also MaxPool, GlobalMeanPool.

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());
 
@@ -212,13 +212,13 @@
 (1, 1, 7, 50)
 
 julia> GlobalMaxPool()(rand(3,5,7)) |> size  # preserves 2 dimensions
-(1, 5, 7)
source
Flux.AdaptiveMeanPoolType
AdaptiveMeanPool(out::NTuple)

Adaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+(1, 5, 7)
source
Flux.AdaptiveMeanPoolType
AdaptiveMeanPool(out::NTuple)

Adaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> AdaptiveMeanPool((25, 25))(xs) |> size
 (25, 25, 3, 50)
 
 julia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)
-true
source
Flux.MeanPoolType
MeanPool(window::NTuple; pad=0, stride=window)

Mean pooling layer, averaging all pixels in a block of size window.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);
+true
source
Flux.MeanPoolType
MeanPool(window::NTuple; pad=0, stride=window)

Mean pooling layer, averaging all pixels in a block of size window.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))
 Chain(
@@ -230,12 +230,12 @@
 (96, 96, 7, 50)
 
 julia> m(xs) |> size
-(20, 20, 7, 50)
source
Flux.GlobalMeanPoolType
GlobalMeanPool()

Global mean pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.

julia> xs = rand(Float32, 100, 100, 3, 50);
+(20, 20, 7, 50)
source
Flux.GlobalMeanPoolType
GlobalMeanPool()

Global mean pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());
 
 julia> m(xs) |> size
-(1, 1, 7, 50)
source

Upsampling

The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.

Flux.UpsampleType
Upsample(mode = :nearest; [scale, size]) 
+(1, 1, 7, 50)
source

Upsampling

The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.

Flux.UpsampleType
Upsample(mode = :nearest; [scale, size]) 
 Upsample(scale, mode = :nearest)

An upsampling layer. One of two keywords must be given:

If scale is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword size accepts a tuple, to directly specify the leading dimensions of the output.

Currently supported upsampling modes and corresponding NNlib's methods are:

Examples

julia> m = Upsample(scale = (2, 3))
 Upsample(:nearest, scale = (2, 3))
 
@@ -246,7 +246,7 @@
 Upsample(:bilinear, size = (4, 5))
 
 julia> m(ones(2, 2, 1, 1)) |> size
-(4, 5, 1, 1)
source
Flux.PixelShuffleType
PixelShuffle(r::Int)

Pixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.

See NNlib.pixel_shuffle.

Examples

julia> p = PixelShuffle(2);
+(4, 5, 1, 1)
source
Flux.PixelShuffleType
PixelShuffle(r::Int)

Pixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.

See NNlib.pixel_shuffle.

Examples

julia> p = PixelShuffle(2);
 
 julia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]
 2×2×4×1 Array{Float64, 4}:
@@ -298,7 +298,7 @@
  4.1  4.3  5.1  5.3  6.1  6.3
  4.2  4.4  5.2  5.4  6.2  6.4
  7.1  7.3  8.1  8.3  9.1  9.3
- 7.2  7.4  8.2  8.4  9.2  9.4
source

Embedding Vectors

These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.

Flux.EmbeddingType
Embedding(in => out; init=randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.

This layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.

For indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).

Examples

julia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))
+ 7.2  7.4  8.2  8.4  9.2  9.4
source

Embedding Vectors

These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.

Flux.EmbeddingType
Embedding(in => out; init=randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.

This layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.

For indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).

Examples

julia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))
 Embedding(26 => 4)  # 104 parameters
 
 julia> emb(2)  # one column of e.weight (here not random!)
@@ -319,7 +319,7 @@
 true
 
 julia> emb(rand(1:26, (10, 1, 12))) |> size  # three batch dimensions
-(4, 10, 1, 12)
source
Flux.EmbeddingBagType
EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a "bag". Their individual embedding vectors are reduced to one, using mean or some other function.

Instead of acting on one "bag", such as x::Vector{Int}, the layer can also act on several:

  • Acting on a vector of "bags", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).

  • Any higher-rank array of integers is interpreted as a collection of "bags" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all "bags" have the same length.

  • A vector of "bags" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.

The "bag" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.

Examples

julia> vocab_size = 26;  # embed into 3 dimensions, with non-random vectors:
+(4, 10, 1, 12)
source
Flux.EmbeddingBagType
EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a "bag". Their individual embedding vectors are reduced to one, using mean or some other function.

Instead of acting on one "bag", such as x::Vector{Int}, the layer can also act on several:

  • Acting on a vector of "bags", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).

  • Any higher-rank array of integers is interpreted as a collection of "bags" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all "bags" have the same length.

  • A vector of "bags" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.

The "bag" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.

Examples

julia> vocab_size = 26;  # embed into 3 dimensions, with non-random vectors:
 
 julia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))
 EmbeddingBag(26 => 3)  # 78 parameters
@@ -368,7 +368,7 @@
 3×2 Matrix{Float32}:
  33.3333    0.0
  66.6667    0.0
-  0.0     100.0
source

Dataflow Layers, or Containers

The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.

Flux.ChainType
Chain(layers...)
+  0.0     100.0
source

Dataflow Layers, or Containers

The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.

Flux.ChainType
Chain(layers...)
 Chain(name = layer, ...)

Collects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, m[2] or m[1:end-1], and if names are given, m[:name] == m[1] etc.

Examples

julia> m = Chain(x -> x^2, x -> x+1);
 
 julia> m(5) == 26
@@ -385,12 +385,12 @@
                   dec = Dense(5 => 2));
 
 julia> m2(x) == (m2[:dec] ∘ m2[:enc])(x)
-true

For large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!

source
Flux.activationsFunction
activations(c::Chain, input)

Like calling a Chain, but saves the result of each layer as an output.

Examples

julia> using Flux: activations
+true

For large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!

source
Flux.activationsFunction
activations(c::Chain, input)

Like calling a Chain, but saves the result of each layer as an output.

Examples

julia> using Flux: activations
 
 julia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);
 
 julia> activations(c, 1)
-(2, 4, 64)
source
Flux.MaxoutType
Maxout(layers...)
+(2, 4, 64)
source
Flux.MaxoutType
Maxout(layers...)
 Maxout(f, n_alts)

This contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.

Instead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.

Maxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio "Maxout Networks" https://arxiv.org/abs/1302.4389.

See also Parallel to reduce with other operators.

Examples

julia> m = Maxout(x -> abs2.(x), x -> x .* 3);
 
 julia> m([-2 -1 0 1 2])
@@ -405,7 +405,7 @@
 )                   # Total: 6 arrays, 126 parameters, 888 bytes.
 
 julia> Flux.outputsize(m3, (5, 11))
-(7, 11)
source
Flux.SkipConnectionType
SkipConnection(layer, connection)

Create a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, "skipped" input.

The simplest "ResNet"-type connection is just SkipConnection(layer, +). Here is a more complicated example:

julia> m = Conv((3,3), 4 => 7, pad=(1,1));
+(7, 11)
source
Flux.SkipConnectionType
SkipConnection(layer, connection)

Create a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, "skipped" input.

The simplest "ResNet"-type connection is just SkipConnection(layer, +). Here is a more complicated example:

julia> m = Conv((3,3), 4 => 7, pad=(1,1));
 
 julia> x = ones(Float32, 5, 5, 4, 10);
 
@@ -415,7 +415,7 @@
 julia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));
 
 julia> size(sm(x)) == (5, 5, 11, 10)
-true

See also Parallel, Maxout.

source
Flux.ParallelType
Parallel(connection, layers...)
+true

See also Parallel, Maxout.

source
Flux.ParallelType
Parallel(connection, layers...)
 Parallel(connection; name = layer, ...)

Create a layer which passes an input array to each path in layers, before reducing the output with connection.

Called with one input x, this is equivalent to connection([l(x) for l in layers]...). If called with multiple inputs, one is passed to each layer, thus Parallel(+, f, g)(x, y) = f(x) + g(y).

Like Chain, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: m[1] == m[:name] is the first layer.

See also SkipConnection which is Parallel with one identity, and Maxout which reduces by broadcasting max.

Examples

julia> model = Chain(Dense(3 => 5),
                      Parallel(vcat, Dense(5 => 4), Chain(Dense(5 => 7), Dense(7 => 4))),
                      Dense(8 => 17));
@@ -437,7 +437,7 @@
 (2,)
 
 julia> model2[:β] == model2[2]
-true
source
Flux.PairwiseFusionType
PairwiseFusion(connection, layers...)

Arguments

  • connection: A function taking 2 inputs and combining them into a single output
  • layers: The layers whose outputs are combined

Inputs

This layer behaves differently based on input type:

  1. If input x is a tuple of length N (or the input is xs with N x's), matching the number of layers,

then each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:

x1 → layer1 → y1 ↘
+true
source
Flux.PairwiseFusionType
PairwiseFusion(connection, layers...)

Arguments

  • connection: A function taking 2 inputs and combining them into a single output
  • layers: The layers whose outputs are combined

Inputs

This layer behaves differently based on input type:

  1. If input x is a tuple of length N (or the input is xs with N x's), matching the number of layers,

then each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:

x1 → layer1 → y1 ↘
                   connection → layer2 → y2 ↘
               x2 ↗                          connection → layer3 → y3
                                         x3 ↗

... or written as:

y1 = layer1(x1)
@@ -445,7 +445,7 @@
 y3 = layer3(connection(y2, x3))
  1. With just one input, each layer receives the same x combined with the previous output. Thus y = PairwiseFusion(connection, layers...)(x) obeys:
y[1] == layers[1](x)
 for i in 2:length(layers)
     y[i] == connection(layers[i](y[i-1]), x)
-end

Returns

A tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).

source

Recurrent Models

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction
RNN(in => out, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(RNNCell(a...)), and so RNNs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

Examples

julia> r = RNN(3 => 5)
+end

Returns

A tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).

source

Recurrent Models

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction
RNN(in => out, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(RNNCell(a...)), and so RNNs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

Examples

julia> r = RNN(3 => 5)
 Recur(
   RNNCell(3 => 5, tanh),                # 50 parameters
 )         # Total: 4 trainable arrays, 50 parameters,
@@ -484,7 +484,7 @@
 julia> r = Flux.Recur(Flux.RNNCell(tanh, rand(5, 4), Tridiagonal(rand(5, 5)), rand(5), rand(5, 1)))
 
 julia> r(rand(4, 10)) |> size # batch size of 10
-(5, 10)
source
Flux.LSTMFunction
LSTM(in => out)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(LSTMCell(a...)), and so LSTMs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> l = LSTM(3 => 5)
+(5, 10)
source
Flux.LSTMFunction
LSTM(in => out)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(LSTMCell(a...)), and so LSTMs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> l = LSTM(3 => 5)
 Recur(
   LSTMCell(3 => 5),                     # 190 parameters
 )         # Total: 5 trainable arrays, 190 parameters,
@@ -496,7 +496,7 @@
 julia> Flux.reset!(l);
 
 julia> l(rand(Float32, 3, 10)) |> size # batch size of 10
-(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

LSTMCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUFunction
GRU(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.

The integer arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUCell(a...)), and so GRUs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRU(3 => 5)
+(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

LSTMCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUFunction
GRU(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.

The integer arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUCell(a...)), and so GRUs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRU(3 => 5)
 Recur(
   GRUCell(3 => 5),                      # 140 parameters
 )         # Total: 4 trainable arrays, 140 parameters,
@@ -508,7 +508,7 @@
 julia> Flux.reset!(g);
 
 julia> g(rand(Float32, 3, 10)) |> size # batch size of 10
-(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUv3Function
GRUv3(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUv3Cell(a...)), and so GRUv3s are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRUv3(3 => 5)
+(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUv3Function
GRUv3(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUv3Cell(a...)), and so GRUv3s are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRUv3(3 => 5)
 Recur(
   GRUv3Cell(3 => 5),                    # 140 parameters
 )         # Total: 5 trainable arrays, 140 parameters,
@@ -520,7 +520,7 @@
 julia> Flux.reset!(g);
 
 julia> g(rand(Float32, 3, 10)) |> size # batch size of 10
-(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUv3Cells can be constructed directly by specifying the non-linear function, the Wi, Wh, and Wh_h internal matrices, a bias vector b, and a learnable initial state state0. The Wi, Wh, and Wh_h matrices do not need to be the same type. See the example in RNN.

source
Flux.RecurType
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs:

Examples

julia> accum(h, x) = (h + x, x)
+(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUv3Cells can be constructed directly by specifying the non-linear function, the Wi, Wh, and Wh_h internal matrices, a bias vector b, and a learnable initial state state0. The Wi, Wh, and Wh_h matrices do not need to be the same type. See the example in RNN.

source
Flux.RecurType
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs:

Examples

julia> accum(h, x) = (h + x, x)
 accum (generic function with 1 method)
 
 julia> rnn = Flux.Recur(accum, 0)
@@ -571,7 +571,7 @@
 
 julia> rnn.state
 1×1 Matrix{Int64}:
- 60
source
Flux.reset!Function
reset!(rnn)

Reset the hidden state of a recurrent layer back to its original value.

Assuming you have a Recur layer rnn, this is roughly equivalent to:

rnn.state = hidden(rnn.cell)

Examples

julia> r = Flux.RNNCell(relu, ones(1,1), zeros(1,1), ones(1,1), zeros(1,1));  # users should use the RNN wrapper struct instead
+ 60
source
Flux.reset!Function
reset!(rnn)

Reset the hidden state of a recurrent layer back to its original value.

Assuming you have a Recur layer rnn, this is roughly equivalent to:

rnn.state = hidden(rnn.cell)

Examples

julia> r = Flux.RNNCell(relu, ones(1,1), zeros(1,1), ones(1,1), zeros(1,1));  # users should use the RNN wrapper struct instead
 
 julia> y = Flux.Recur(r, ones(1,1));
 
@@ -593,7 +593,7 @@
 
 julia> y.state
 1×1 Matrix{Float64}:
- 0.0
source

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.

Flux.BatchNormType
BatchNorm(channels::Integer, λ=identity;
+ 0.0
source

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.

Flux.BatchNormType
BatchNorm(channels::Integer, λ=identity;
           initβ=zeros32, initγ=ones32,
           affine=true, track_stats=true, active=nothing,
           eps=1f-5, momentum= 0.1f0)

Batch Normalization layer. channels should be the size of the channel dimension in your data (see below).

Given an array with N dimensions, call the N-1th the channel dimension. For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.

BatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input slice and normalises the input accordingly.

If affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.

After normalisation, elementwise activation λ is applied.

If track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

Use testmode! during inference.

Examples

julia> using Statistics
@@ -605,7 +605,7 @@
 julia> Flux.trainmode!(m);
 
 julia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))
-true
source
Flux.DropoutType
Dropout(p; [dims, rng, active])

Layer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.

While training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.

By default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.

By default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).

Keyword rng lets you specify a custom random number generator. (Only supported on the CPU.)

Examples

julia> m = Chain(Dense(ones(3,2)), Dropout(0.4))
+true
source
Flux.DropoutType
Dropout(p; [dims, rng, active])

Layer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.

While training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.

By default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.

By default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).

Keyword rng lets you specify a custom random number generator. (Only supported on the CPU.)

Examples

julia> m = Chain(Dense(ones(3,2)), Dropout(0.4))
 Chain(
   Dense(2 => 3),                        # 9 parameters
   Dropout(0.4),
@@ -637,7 +637,7 @@
 1.9989999999999961
 
 julia> mean(iszero, y)  # is about 0.4
-0.4003
source
Flux.AlphaDropoutType
AlphaDropout(p; [rng, active])

A dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.

Does nothing to the input once testmode! is true.

Examples

julia> using Statistics
+0.4003
source
Flux.AlphaDropoutType
AlphaDropout(p; [rng, active])

A dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.

Does nothing to the input once testmode! is true.

Examples

julia> using Statistics
 
 julia> x = randn32(1000,1);
 
@@ -648,7 +648,7 @@
 julia> y = m(x);
 
 julia> isapprox(std(x), std(y), atol=0.2)
-true
source
Flux.LayerNormType
LayerNorm(size..., λ=identity; affine=true, eps=1f-5)

A normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.

In the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.

If affine=true, it also applies a learnable shift and rescaling using the Scale layer.

See also BatchNorm, InstanceNorm, GroupNorm, and normalise.

Examples

julia> using Statistics
+true
source
Flux.LayerNormType
LayerNorm(size..., λ=identity; affine=true, eps=1f-5)

A normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.

In the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.

If affine=true, it also applies a learnable shift and rescaling using the Scale layer.

See also BatchNorm, InstanceNorm, GroupNorm, and normalise.

Examples

julia> using Statistics
 
 julia> xs = rand(3, 3, 3, 2);  # a batch of 2 images, each having 3 channels
 
@@ -657,7 +657,7 @@
 julia> y = m(xs);
 
 julia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)
-true
source
Flux.InstanceNormType
InstanceNorm(channels::Integer, λ=identity;
+true
source
Flux.InstanceNormType
InstanceNorm(channels::Integer, λ=identity;
              initβ=zeros32, initγ=ones32,
              affine=false, track_stats=false,
              eps=1f-5, momentum=0.1f0)

Instance Normalization layer. channels should be the size of the channel dimension in your data (see below).

Given an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.

InstanceNorm computes the mean and variance for each D_1×...×D_{N-2}×1×1 input slice and normalises the input accordingly.

If affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.

If track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

Warning: the defaults for affine and track_stats used to be true in previous Flux versions (< v0.12).

Examples

julia> using Statistics
@@ -669,7 +669,7 @@
 julia> y = m(xs);
 
 julia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)
-true
source
Flux.GroupNormType
GroupNorm(channels::Int, G::Int, λ = identity;
+true
source
Flux.GroupNormType
GroupNorm(channels::Int, G::Int, λ = identity;
           initβ = zeros32,
           initγ = ones32,
           affine = true,
@@ -686,7 +686,7 @@
 true
 
 julia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])
-true
source
Flux.normaliseFunction
normalise(x; dims=ndims(x), eps=1e-5)

Normalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the denominator for numerical stability.

Examples

julia> using Statistics
+true
source
Flux.normaliseFunction
normalise(x; dims=ndims(x), eps=1e-5)

Normalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the denominator for numerical stability.

Examples

julia> using Statistics
 
 julia> x = [90, 100, 110, 130, 70];
 
@@ -709,7 +709,7 @@
 julia> y = Flux.normalise(x, dims=1);
 
 julia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)
-true
source

Test vs. Train

Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.

Warning

This automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.

The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.

Flux.testmode!Method
testmode!(model, [mode]) -> model

Set a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.

If you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.

There is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.

Example

julia> d = Dropout(0.3)
+true
source

Test vs. Train

Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.

Warning

This automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.

The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.

Flux.testmode!Method
testmode!(model, [mode]) -> model

Set a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.

If you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.

There is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.

Example

julia> d = Dropout(0.3)
 Dropout(0.3)
 
 julia> testmode!(d)   # dropout is now always disabled
@@ -719,4 +719,4 @@
 Dropout(0.3, active=true)
 
 julia> testmode!(d, :auto)  # back to default
-Dropout(0.3)
source
Flux.testmode!Method
testmode!(model, inactive)

This two-argument method is largely internal. It recurses into the model, and until a method like testmode!(d::Dropout, inactive) alters the activity of a layer. Custom layers can support manual testmode! / trainmode! switching by defining such a method.

Possible values of inactive are:

  • true for testing, i.e. active=false
  • false for training, same as trainmode!(m)
  • :auto or nothing for Flux to detect training automatically.
Compat

This method may be removed in a future breaking change, to separate the user-facing testmode! from the internal recursion.

source
Flux.trainmode!Function
trainmode!(model) -> model

Set a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.

source
trainmode!(m, active)
Warning

This two-argument method is deprecated.

Possible values of active are:

  • true for training, or
  • false for testing, same as testmode!(m)
  • :auto or nothing for Flux to detect training automatically.
source
+Dropout(0.3)source
Flux.testmode!Method
testmode!(model, inactive)

This two-argument method is largely internal. It recurses into the model, and until a method like testmode!(d::Dropout, inactive) alters the activity of a layer. Custom layers can support manual testmode! / trainmode! switching by defining such a method.

Possible values of inactive are:

  • true for testing, i.e. active=false
  • false for training, same as trainmode!(m)
  • :auto or nothing for Flux to detect training automatically.
Compat

This method may be removed in a future breaking change, to separate the user-facing testmode! from the internal recursion.

source
Flux.trainmode!Function
trainmode!(model) -> model

Set a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.

source
trainmode!(m, active)
Warning

This two-argument method is deprecated.

Possible values of active are:

  • true for training, or
  • false for testing, same as testmode!(m)
  • :auto or nothing for Flux to detect training automatically.
source
diff --git a/previews/PR2464/reference/models/losses/index.html b/previews/PR2464/reference/models/losses/index.html index c264a7e8c5..c70c3c1034 100644 --- a/previews/PR2464/reference/models/losses/index.html +++ b/previews/PR2464/reference/models/losses/index.html @@ -10,16 +10,16 @@ loss(ŷ, y, agg=identity) # no aggregation.

Function listing

Flux.Losses.maeFunction
mae(ŷ, y; agg = mean)

Return the loss corresponding to mean absolute error:

agg(abs.(ŷ .- y))

Example

julia> y_model = [1.1, 1.9, 3.1];
 
 julia> Flux.mae(y_model, 1:3)
-0.10000000000000009
source
Flux.Losses.mseFunction
mse(ŷ, y; agg = mean)

Return the loss corresponding to mean square error:

agg((ŷ .- y) .^ 2)

See also: mae, msle, crossentropy.

Example

julia> y_model = [1.1, 1.9, 3.1];
+0.10000000000000009
source
Flux.Losses.mseFunction
mse(ŷ, y; agg = mean)

Return the loss corresponding to mean square error:

agg((ŷ .- y) .^ 2)

See also: mae, msle, crossentropy.

Example

julia> y_model = [1.1, 1.9, 3.1];
 
 julia> y_true = 1:3;
 
 julia> Flux.mse(y_model, y_true)
-0.010000000000000018
source
Flux.Losses.msleFunction
msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

The loss corresponding to mean squared logarithmic errors, calculated as

agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)

The ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.

Example

julia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)
+0.010000000000000018
source
Flux.Losses.msleFunction
msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

The loss corresponding to mean squared logarithmic errors, calculated as

agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)

The ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.

Example

julia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)
 0.009084041f0
 
 julia> Flux.msle(Float32[0.9, 1.8, 2.7], 1:3)
-0.011100831f0
source
Flux.Losses.huber_lossFunction
huber_loss(ŷ, y; delta = 1, agg = mean)

Return the mean of the Huber loss given the prediction and true values y.

             | 0.5 * |ŷ - y|^2,            for |ŷ - y| <= δ
+0.011100831f0
source
Flux.Losses.huber_lossFunction
huber_loss(ŷ, y; delta = 1, agg = mean)

Return the mean of the Huber loss given the prediction and true values y.

             | 0.5 * |ŷ - y|^2,            for |ŷ - y| <= δ
 Huber loss = |
              |  δ * (|ŷ - y| - 0.5 * δ), otherwise

Example

julia> ŷ = [1.1, 2.1, 3.1];
 
@@ -27,7 +27,7 @@
 0.005000000000000009
 
 julia> Flux.huber_loss(ŷ, 1:3, delta=0.05)  # changes behaviour as |ŷ - y| > δ
-0.003750000000000005
source
Flux.Losses.label_smoothingFunction
label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)

Returns smoothed labels, meaning the confidence on label values are relaxed.

When y is given as one-hot vector or batch of one-hot, its calculated as

y .* (1 - α) .+ α / size(y, dims)

when y is given as a number or batch of numbers for binary classification, its calculated as

y .* (1 - α) .+ α / 2

in which case the labels are squeezed towards 0.5.

α is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.

dims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.

Example

julia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)
+0.003750000000000005
source
Flux.Losses.label_smoothingFunction
label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)

Returns smoothed labels, meaning the confidence on label values are relaxed.

When y is given as one-hot vector or batch of one-hot, its calculated as

y .* (1 - α) .+ α / size(y, dims)

when y is given as a number or batch of numbers for binary classification, its calculated as

y .* (1 - α) .+ α / 2

in which case the labels are squeezed towards 0.5.

α is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.

dims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.

Example

julia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)
 2×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  ⋅  ⋅  ⋅  1  ⋅  1
  1  1  1  ⋅  1  ⋅
@@ -51,7 +51,7 @@
 true
 
 julia> Flux.crossentropy(y_dis, y) > Flux.crossentropy(y_dis, y_smoothed)
-true
source
Flux.Losses.crossentropyFunction
crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)

Return the cross entropy between the given probability distributions; calculated as

agg(-sum(y .* log.(ŷ .+ ϵ); dims))

Cross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction is supposed to sum to one across dims, as would be the case with the output of a softmax operation.

For numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .

Use label_smoothing to smooth the true labels as preprocessing before computing the loss.

See also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.

Example

julia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)
+true
source
Flux.Losses.crossentropyFunction
crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)

Return the cross entropy between the given probability distributions; calculated as

agg(-sum(y .* log.(ŷ .+ ϵ); dims))

Cross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction is supposed to sum to one across dims, as would be the case with the output of a softmax operation.

For numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .

Use label_smoothing to smooth the true labels as preprocessing before computing the loss.

See also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.

Example

julia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)
 3×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  1  ⋅  ⋅  ⋅  1
  ⋅  1  ⋅  1  ⋅
@@ -80,7 +80,7 @@
  0.05  0.05  0.9   0.05  0.05
 
 julia> Flux.crossentropy(y_model, y_smooth)
-1.5776052f0
source
Flux.Losses.logitcrossentropyFunction
logitcrossentropy(ŷ, y; dims = 1, agg = mean)

Return the cross entropy calculated by

agg(-sum(y .* logsoftmax(ŷ; dims); dims))

This is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.

See also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.

Example

julia> y_label = Flux.onehotbatch(collect("abcabaa"), 'a':'c')
+1.5776052f0
source
Flux.Losses.logitcrossentropyFunction
logitcrossentropy(ŷ, y; dims = 1, agg = mean)

Return the cross entropy calculated by

agg(-sum(y .* logsoftmax(ŷ; dims); dims))

This is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.

See also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.

Example

julia> y_label = Flux.onehotbatch(collect("abcabaa"), 'a':'c')
 3×7 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  1  ⋅  ⋅  1  ⋅  1  1
  ⋅  1  ⋅  ⋅  1  ⋅  ⋅
@@ -96,7 +96,7 @@
 1.5791205f0
 
 julia> Flux.crossentropy(softmax(y_model), y_label)
-1.5791197f0
source
Flux.Losses.binarycrossentropyFunction
binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the binary cross-entropy loss, computed as

agg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))

Where typically, the prediction is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.

Use label_smoothing to smooth the y value as preprocessing before computing the loss.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1]
+1.5791197f0
source
Flux.Losses.binarycrossentropyFunction
binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the binary cross-entropy loss, computed as

agg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))

Where typically, the prediction is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.

Use label_smoothing to smooth the y value as preprocessing before computing the loss.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1]
 3-element Vector{Bool}:
  1
  0
@@ -119,7 +119,7 @@
  1  ⋅  1
 
 julia> Flux.crossentropy(y_prob, y_hot)
-0.43989f0
source
Flux.Losses.logitbinarycrossentropyFunction
logitbinarycrossentropy(ŷ, y; agg = mean)

Mathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1];
+0.43989f0
source
Flux.Losses.logitbinarycrossentropyFunction
logitbinarycrossentropy(ŷ, y; agg = mean)

Mathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1];
 
 julia> y_model = Float32[2, -1, pi]
 3-element Vector{Float32}:
@@ -131,7 +131,7 @@
 0.160832f0
 
 julia> Flux.binarycrossentropy(sigmoid.(y_model), y_bin)
-0.16083185f0
source
Flux.Losses.kldivergenceFunction
kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the Kullback-Leibler divergence between the given probability distributions.

The KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.

Example

julia> p1 = [1 0; 0 1]
+0.16083185f0
source
Flux.Losses.kldivergenceFunction
kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the Kullback-Leibler divergence between the given probability distributions.

The KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.

Example

julia> p1 = [1 0; 0 1]
 2×2 Matrix{Int64}:
  1  0
  0  1
@@ -151,10 +151,10 @@
 0.0
 
 julia> Flux.kldivergence(p1, p2; eps = 0)  # about 17.3 with the regulator
-Inf
source
Flux.Losses.poisson_lossFunction
poisson_loss(ŷ, y; agg = mean)

Return how much the predicted distribution diverges from the expected Poisson distribution y; calculated as -

sum(ŷ .- y .* log.(ŷ)) / size(y, 2)

More information..

Example

julia> y_model = [1, 3, 3];  # data should only take integral values
+Inf
source
Flux.Losses.poisson_lossFunction
poisson_loss(ŷ, y; agg = mean)

Return how much the predicted distribution diverges from the expected Poisson distribution y; calculated as -

sum(ŷ .- y .* log.(ŷ)) / size(y, 2)

More information..

Example

julia> y_model = [1, 3, 3];  # data should only take integral values
 
 julia> Flux.poisson_loss(y_model, 1:3)
-0.5023128522198171
source
Flux.Losses.hinge_lossFunction
hinge_loss(ŷ, y; agg = mean)

Return the hinge_loss given the prediction and true labels y (containing 1 or -1); calculated as

sum(max.(0, 1 .- ŷ .* y)) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: squared_hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
+0.5023128522198171
source
Flux.Losses.hinge_lossFunction
hinge_loss(ŷ, y; agg = mean)

Return the hinge_loss given the prediction and true labels y (containing 1 or -1); calculated as

sum(max.(0, 1 .- ŷ .* y)) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: squared_hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
 
 julia> y_pred = [0.1, 0.3, 1, 1.5];
 
@@ -168,7 +168,7 @@
 true
 
 julia> Flux.hinge_loss(y_pred[2], y_true[2]) != 0 # opposite signs
-true
source
Flux.Losses.squared_hinge_lossFunction
squared_hinge_loss(ŷ, y)

Return the squared hinge_loss loss given the prediction and true labels y (containing 1 or -1); calculated as

sum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
+true
source
Flux.Losses.squared_hinge_lossFunction
squared_hinge_loss(ŷ, y)

Return the squared hinge_loss loss given the prediction and true labels y (containing 1 or -1); calculated as

sum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
 
 julia> y_pred = [0.1, 0.3, 1, 1.5];
 
@@ -182,13 +182,13 @@
 true
 
 julia> Flux.squared_hinge_loss(y_pred[2], y_true[2]) != 0
-true
source
Flux.Losses.dice_coeff_lossFunction
dice_coeff_loss(ŷ, y; smooth = 1)

Return a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:

1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)

Example

julia> y_pred = [1.1, 2.1, 3.1];
+true
source
Flux.Losses.dice_coeff_lossFunction
dice_coeff_loss(ŷ, y; smooth = 1)

Return a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:

1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)

Example

julia> y_pred = [1.1, 2.1, 3.1];
 
 julia> Flux.dice_coeff_loss(y_pred, 1:3)
 0.000992391663909964
 
 julia> 1 - Flux.dice_coeff_loss(y_pred, 1:3)  # ~ F1 score for image segmentation
-0.99900760833609
source
Flux.Losses.tversky_lossFunction
tversky_loss(ŷ, y; beta = 0.7)

Return the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:

1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)
source
Flux.Losses.binary_focal_lossFunction
binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).

For gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.

See also: Losses.focal_loss for multi-class setting

Example

julia> y = [0  1  0
+0.99900760833609
source
Flux.Losses.tversky_lossFunction
tversky_loss(ŷ, y; beta = 0.7)

Return the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:

1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)
source
Flux.Losses.binary_focal_lossFunction
binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).

For gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.

See also: Losses.focal_loss for multi-class setting

Example

julia> y = [0  1  0
             1  0  1]
 2×3 Matrix{Int64}:
  0  1  0
@@ -201,7 +201,7 @@
  0.731059  0.5  0.731059
 
 julia> Flux.binary_focal_loss(ŷ, y) ≈ 0.0728675615927385
-true
source
Flux.Losses.focal_lossFunction
focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).

The modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.

Example

julia> y = [1  0  0  0  1
+true
source
Flux.Losses.focal_lossFunction
focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).

The modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.

Example

julia> y = [1  0  0  0  1
             0  1  0  1  0
             0  0  1  0  0]
 3×5 Matrix{Int64}:
@@ -216,10 +216,10 @@
  0.665241   0.665241   0.665241   0.665241   0.665241
 
 julia> Flux.focal_loss(ŷ, y) ≈ 1.1277571935622628
-true

See also: Losses.binary_focal_loss for binary (not one-hot) labels

source
Flux.Losses.siamese_contrastive_lossFunction
siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)

Return the contrastive loss which can be useful for training Siamese Networks. It is given by

agg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)

Specify margin to set the baseline for distance at which pairs are dissimilar.

Example

julia> ŷ = [0.5, 1.5, 2.5];
+true

See also: Losses.binary_focal_loss for binary (not one-hot) labels

source
Flux.Losses.siamese_contrastive_lossFunction
siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)

Return the contrastive loss which can be useful for training Siamese Networks. It is given by

agg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)

Specify margin to set the baseline for distance at which pairs are dissimilar.

Example

julia> ŷ = [0.5, 1.5, 2.5];
 
 julia> Flux.siamese_contrastive_loss(ŷ, 1:3)
 -4.833333333333333
 
 julia> Flux.siamese_contrastive_loss(ŷ, 1:3, margin = 2)
--4.0
source
+-4.0source diff --git a/previews/PR2464/reference/models/nnlib/index.html b/previews/PR2464/reference/models/nnlib/index.html index 2d30956f3f..03ce74c2a7 100644 --- a/previews/PR2464/reference/models/nnlib/index.html +++ b/previews/PR2464/reference/models/nnlib/index.html @@ -4,7 +4,7 @@ gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash});

Neural Network primitives from NNlib.jl

Flux re-exports all of the functions exported by the NNlib package. This includes activation functions, described on their own page. Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.

Attention

Primitives for the MultiHeadAttention layer.

NNlib.dot_product_attentionFunction
dot_product_attention(query, key, value, [bias]; [fdrop, mask, nheads])

Multihead dot product attention used in transformer architectures.

The input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.

Returns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).

See also dot_product_attention_scores if you only need the attention scores.

Arguments

  • query: Query array of size (qk_dim, q_len, batch_size...).
  • key: Key array of size (qk_dim, kv_len, batch_size...).
  • value: Value array of size (v_dim, kv_len, batch_size...).
  • bias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.
  • fdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).
  • mask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.
  • nheads: Number of heads to split the input arrays into. Default 1.

Examples

q, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)
-y, α = dot_product_attention(q, k, v)
source
NNlib.make_causal_maskFunction
make_causal_mask(x, dims=2)

Return a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.

Can be used to mask the attention scores in dot_product_attention.

source

Softmax

Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.

NNlib.softmaxFunction
softmax(x; dims = 1)

Softmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:

softmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)

with additional manipulations enhancing numerical stability.

For a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.

See also logsoftmax.

Examples

julia> softmax([1, 2, 3])
+y, α = dot_product_attention(q, k, v)
source
NNlib.make_causal_maskFunction
make_causal_mask(x, dims=2)

Return a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.

Can be used to mask the attention scores in dot_product_attention.

source

Softmax

Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.

NNlib.softmaxFunction
softmax(x; dims = 1)

Softmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:

softmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)

with additional manipulations enhancing numerical stability.

For a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.

See also logsoftmax.

Examples

julia> softmax([1, 2, 3])
 3-element Vector{Float64}:
  0.09003057317038046
  0.24472847105479764
@@ -28,8 +28,8 @@
 (7, 13)
 
 julia> Dense(4 => 7, softmax)(x)
-ERROR: `softmax(x)` called with a number, but it expects an array. 
source
NNlib.logsoftmaxFunction
logsoftmax(x; dims = 1)

Computes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.

It is semantically equivalent to the following:

logsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))

See also softmax.

source

Pooling

Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.

NNlib.PoolDimsType
PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};
-        stride=k, padding=0, dilation=1)  where {M, L}

Dimensions for a "pooling" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.

source
NNlib.lpnormpoolFunction
lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)

Perform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • p is restricted to 0 < p < Inf.
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.

For all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.

Thus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).

source
NNlib.maxpoolFunction
maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform max pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source
NNlib.meanpoolFunction
meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform mean pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source

Padding

NNlib.pad_circularFunction
pad_circular(x, pad::Tuple; [dims])
+ERROR: `softmax(x)` called with a number, but it expects an array. 
source
NNlib.logsoftmaxFunction
logsoftmax(x; dims = 1)

Computes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.

It is semantically equivalent to the following:

logsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))

See also softmax.

source

Pooling

Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.

NNlib.PoolDimsType
PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};
+        stride=k, padding=0, dilation=1)  where {M, L}

Dimensions for a "pooling" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.

source
NNlib.lpnormpoolFunction
lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)

Perform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • p is restricted to 0 < p < Inf.
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.

For all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.

Thus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).

source
NNlib.maxpoolFunction
maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform max pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source
NNlib.meanpoolFunction
meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform mean pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source

Padding

NNlib.pad_circularFunction
pad_circular(x, pad::Tuple; [dims])
 pad_circular(x, pad::Int; [dims])

Pad the array x "circularly" across the border by wrapping around values from the opposite side of x.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

The pad length on either side in any dimension must not exceed the size of x in that dimension, i.e. pad_circular is not able to create abitrary sized tilings of x.

See also pad_repeat, pad_reflect, pad_symmetric, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -43,7 +43,7 @@
  8  2  5  8  2  5
  9  3  6  9  3  6
  7  1  4  7  1  4
- 8  2  5  8  2  5
source
NNlib.pad_constantFunction
pad_constant(x, pad::Tuple, val = 0; [dims = :])
 pad_constant(x, pad::Int, val = 0; [dims = :])

Pad the array x with the constant value val.

pad can be a tuple of integers. If it is of some length 2 * length(dims) that specifies the left and right padding size for each of the dimensions in dims as (l1, r1, ..., ln, rn). If supplied with a tuple of length length(dims) instead, it applies symmetric padding. If dims is not given, it defaults to all dimensions.

For integer pad input, it is applied on both sides on every dimension in dims.

See also pad_zeros, pad_repeat, pad_reflect, pad_symmetric, and pad_circular.

julia> r = reshape(1:4, 2, 2)
 2×2 reshape(::UnitRange{Int64}, 2, 2) with eltype Int64:
  1  3
@@ -110,7 +110,7 @@
 julia> pad_constant(r, (2,1, 3), dims = (1,2)) # padding must always be either the same length as dims, or double it
 ERROR: ArgumentError: Could not parse padding (2, 1, 3) and dims (1, 2)
 Stacktrace:
-[...]
source
NNlib.pad_reflectFunction
pad_reflect(x, pad::Tuple; [dims])
 pad_reflect(x, pad::Int; [dims])

Pad the array x reflecting its values across the border.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

See also pad_repeat, pad_symmetric, pad_circular, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -124,7 +124,7 @@
  5  2  5  8  5  2
  6  3  6  9  6  3
  5  2  5  8  5  2
- 4  1  4  7  4  1
source
NNlib.pad_repeatFunction
pad_repeat(x, pad::Tuple; [dims])
 pad_repeat(x, pad::Int; [dims])

Pad the array x repeating the values on the border.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

See also pad_reflect, pad_symmetric, pad_circular, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -138,7 +138,7 @@
  2  2  2  2  5  8  8  8  8  8
  3  3  3  3  6  9  9  9  9  9
  3  3  3  3  6  9  9  9  9  9
- 3  3  3  3  6  9  9  9  9  9
source
NNlib.pad_symmetricFunction
pad_symmetric(x, pad::Tuple; [dims])
 pad_symmetric(x, pad::Int; [dims])

Pad the array x reflecting its values symmetrically across the border, i.e. the border values of x are present in the padding values, in contrast to pad_reflect.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

See also pad_repeat, pad_reflect, pad_circular, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -152,8 +152,8 @@
  2  2  5  8  8  5
  3  3  6  9  9  6
  3  3  6  9  9  6
- 2  2  5  8  8  5
source
NNlib.pad_zerosFunction
pad_zeros(x, pad::Tuple; [dims])
-pad_zeros(x, pad::Int; [dims])

Pad the array x with zeros. Equivalent to pad_constant with the constant equal to 0.

source

Convolution

Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally.

NNlib.convFunction
conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)

Apply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.

source
NNlib.ConvDimsType
ConvDims

Type system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.

source
NNlib.depthwiseconvFunction
depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)

Depthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.

source
NNlib.DepthwiseConvDimsType
DepthwiseConvDims

Concrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.

source

Dropout

NNlib.dropoutFunction
dropout([rng], A, p; [dims])

Returns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).

By default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.

Optional first argument is the random number generator used.

Examples

julia> dropout(ones(2, 10), 0.2)
+ 2  2  5  8  8  5
source
NNlib.pad_zerosFunction
pad_zeros(x, pad::Tuple; [dims])
+pad_zeros(x, pad::Int; [dims])

Pad the array x with zeros. Equivalent to pad_constant with the constant equal to 0.

source

Convolution

Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally.

NNlib.convFunction
conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)

Apply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.

source
NNlib.ConvDimsType
ConvDims

Type system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.

source
NNlib.depthwiseconvFunction
depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)

Depthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.

source
NNlib.DepthwiseConvDimsType
DepthwiseConvDims

Concrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.

source

Dropout

NNlib.dropoutFunction
dropout([rng], A, p; [dims])

Returns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).

By default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.

Optional first argument is the random number generator used.

Examples

julia> dropout(ones(2, 10), 0.2)
 2×10 Matrix{Float64}:
  1.25  1.25  0.0   1.25  1.25  1.25  1.25  1.25  1.25  1.25
  1.25  1.25  1.25  0.0   1.25  1.25  0.0   1.25  1.25  1.25
@@ -172,7 +172,7 @@
 
 julia> mean(dropout(ones(10^4, 5), 0.3, dims=1), dims=1)
 1×5 Matrix{Float64}:
- 1.00571  1.00571  1.00571  1.00571  1.00571
source
NNlib.dropout!Function
dropout!(B, A, p; [dims])

This does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.

source

Upsampling

Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.

NNlib.dropout!Function
dropout!(B, A, p; [dims])

This does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.

source

Upsampling

Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.

NNlib.upsample_nearestFunction
upsample_nearest(x, scale::NTuple{S,Int})
 upsample_nearest(x; size::NTuple{S,Int})

Upsamples the array x by integer multiples along the first S dimensions. Subsequent dimensions of x are not altered.

Either the scale factors or the final output size can be specified.

See also upsample_bilinear, for two dimensions of an N=4 array.

Example

julia> upsample_nearest([1 2 3; 4 5 6], (2, 3))
 4×9 Matrix{Int64}:
  1  1  1  2  2  2  3  3  3
@@ -191,8 +191,8 @@
  4  5  6
 
 julia> ans == upsample_nearest([1 2 3; 4 5 6], size=(4,))
-true
source
NNlib.upsample_linearFunction
upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)
-upsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)

Upsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.

The size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).

source
NNlib.∇upsample_linearFunction
∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_linearFunction
upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)
+upsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)

Upsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.

The size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).

source
NNlib.∇upsample_linearFunction
∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_bilinearFunction
upsample_bilinear(x::AbstractArray{T,4}, scale::NTuple{2,Real}; align_corners::Bool = true)
 upsample_bilinear(x::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true)

Upsamples the first 2 dimensions of the array x by the upsample factors stored in scale, using bilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.

The size of the output is equal to (scale[1]*S1, scale[2]*S2, S3, S4), where S1, S2, S3, S4 = size(x).

Examples

julia> x = reshape(Float32[1 2 3; 4 5 6], (2,3,1,1))
 2×3×1×1 Array{Float32, 4}:
 [:, :, 1, 1] =
@@ -217,10 +217,10 @@
  1.75  1.97222  2.19444  2.41667  2.63889     3.08333  3.30556  3.52778  3.75
  2.5   2.72222  2.94444  3.16667  3.38889     3.83333  4.05556  4.27778  4.5
  3.25  3.47222  3.69444  3.91667  4.13889     4.58333  4.80556  5.02778  5.25
- 4.0   4.22222  4.44444  4.66667  4.88889     5.33333  5.55556  5.77778  6.0
source
NNlib.∇upsample_bilinearFunction
∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral (W,H) size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_trilinearFunction
upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)
+ 4.0   4.22222  4.44444  4.66667  4.88889     5.33333  5.55556  5.77778  6.0
source
NNlib.∇upsample_bilinearFunction
∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral (W,H) size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_trilinearFunction
upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)
 upsample_trilinear(x::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true)

Upsamples the first 3 dimensions of the array x by the upsample factors stored in scale, using trilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.

The size of the output is equal to (scale[1]*S1, scale[2]*S2, scale[3]*S3, S4, S5), where S1, S2, S3, S4, S5 = size(x).

Examples

upsample_trilinear(x, (2, 3, 4))
 upsample_trilinear(x; size=(4, 9, 11))  # specify ouput size instead
-upsample_trilinear(x, (2.5, 3.5, pi))  # non-integer scaling factors are allowed
source
NNlib.∇upsample_trilinearFunction
∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral size & depth (W,H,D) of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.pixel_shuffleFunction
pixel_shuffle(x, r::Integer)

Pixel shuffling operation, upscaling by a factor r.

For 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.

Used in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., "Real-Time Single Image and Video Super-Resolution ...", CVPR 2016, https://arxiv.org/abs/1609.05158

Examples

julia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]
+upsample_trilinear(x, (2.5, 3.5, pi))  # non-integer scaling factors are allowed
source
NNlib.∇upsample_trilinearFunction
∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral size & depth (W,H,D) of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.pixel_shuffleFunction
pixel_shuffle(x, r::Integer)

Pixel shuffling operation, upscaling by a factor r.

For 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.

Used in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., "Real-Time Single Image and Video Super-Resolution ...", CVPR 2016, https://arxiv.org/abs/1609.05158

Examples

julia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]
 2×3×4×1 Array{Float64, 4}:
 [:, :, 1, 1] =
  11.1  12.1  13.1
@@ -261,7 +261,7 @@
  2.1  2.3  2.5
  2.2  2.4  2.6
  3.1  3.3  3.5
- 3.2  3.4  3.6
source

Batched Operations

Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.

Batched Operations

Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.

NNlib.batched_mulFunction
batched_mul(A, B) -> C
 A ⊠ B  # \boxtimes

Batched matrix multiplication. Result has C[:,:,k...] == A[:,:,k...] * B[:,:,k...] where k... represent any indices in the last dimensions.

If ndims(A) == ndims(B) == 3 and size(B,3) == 1 then instead C[:,:,k] == A[:,:,k] * B[:,:,1], and similarly for A.

To transpose each matrix, apply batched_transpose to the array, or batched_adjoint for conjugate-transpose:

julia> A, B = randn(2,5,17), randn(5,9,17);
 
 julia> A ⊠ B |> size
@@ -277,7 +277,7 @@
 (2, 9, 17)
 
 julia> batched_transpose(A) == PermutedDimsArray(A, (2,1,3))
-true

The equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.

However, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.

Both this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV["JULIA_DEBUG"] = NNlib will display them.

source
batched_mul(A::Array{T,3}, B::Matrix)
+true

The equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.

However, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.

Both this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV["JULIA_DEBUG"] = NNlib will display them.

source
batched_mul(A::Array{T,3}, B::Matrix)
 batched_mul(A::Matrix, B::Array{T,3})
 A ⊠ B

This is always matrix-matrix multiplication, but either A or B may lack a batch index.

  • When B is a matrix, result has C[:,:,k] == A[:,:,k] * B[:,:] for all k.

  • When A is a matrix, then C[:,:,k] == A[:,:] * B[:,:,k]. This can also be done by reshaping and calling *, for instance A ⊡ B using TensorCore.jl, but is implemented here using batched_gemm instead of gemm.

julia> randn(16,8,32) ⊠ randn(8,4) |> size
 (16, 4, 32)
@@ -286,19 +286,19 @@
 (16, 4, 32)
 
 julia> randn(16,8) ⊠ randn(8,4,32) |> size
-(16, 4, 32)

See also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].

source
NNlib.batched_mul!Function
batched_mul!(C, A, B) -> C
-batched_mul!(C, A, B, α=1, β=0)

In-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.

This will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.

For complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.

source
NNlib.batched_adjointFunction
batched_transpose(A::AbstractArray{T,3})
+(16, 4, 32)

See also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].

source
NNlib.batched_mul!Function
batched_mul!(C, A, B) -> C
+batched_mul!(C, A, B, α=1, β=0)

In-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.

This will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.

For complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.

source
NNlib.batched_adjointFunction
batched_transpose(A::AbstractArray{T,3})
 batched_adjoint(A)

Equivalent to applying transpose or adjoint to each matrix A[:,:,k].

These exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.

PermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).

BatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}
-BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_transposeFunction
batched_transpose(A::AbstractArray{T,3})
+BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_transposeFunction
batched_transpose(A::AbstractArray{T,3})
 batched_adjoint(A)

Equivalent to applying transpose or adjoint to each matrix A[:,:,k].

These exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.

PermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).

BatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}
-BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_vecFunction
batched_vec(A::Array{T,3}, B::Matrix)
+BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_vecFunction
batched_vec(A::Array{T,3}, B::Matrix)
 batched_vec(A::Array{T,3}, b::Vector)

Batched matrix-vector multiplication: the result has C[:,:,k] == A[:,:,k] * B[:,k] for all k, or else C[:,:,k] == A[:,:,k] * b for b::Vector.

With the same argument types, batched_mul(A, B) would regard B as a fixed matrix, not a batch of vectors. Both reshape and then call batched_mul(::Array{T,3}, ::Array{T,3}).

julia> A, B, b = randn(16,8,32), randn(8,32), randn(8);
 
 julia> batched_vec(A,B) |> size
 (16, 32)
 
 julia> batched_vec(A,b) |> size
-(16, 32)
source

Gather and Scatter

Flux's Embedding layer uses NNlib.gather as its backend.

NNlib.gatherFunction
NNlib.gather(src, idx) -> dst

Reverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather! for an in-place version.

Examples

julia> NNlib.gather([1,20,300,4000], [2,4,2])
+(16, 32)
source

Gather and Scatter

Flux's Embedding layer uses NNlib.gather as its backend.

NNlib.gatherFunction
NNlib.gather(src, idx) -> dst

Reverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather! for an in-place version.

Examples

julia> NNlib.gather([1,20,300,4000], [2,4,2])
 3-element Vector{Int64}:
    20
  4000
@@ -307,7 +307,7 @@
 julia> NNlib.gather([1 2 3; 4 5 6], [1,3,1,3,1])
 2×5 Matrix{Int64}:
  1  3  1  3  1
- 4  6  4  6  4
source
gather(src, IJK...)

Convert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).

Examples

julia> src = reshape([1:15;], 3, 5)
+ 4  6  4  6  4
source
gather(src, IJK...)

Convert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).

Examples

julia> src = reshape([1:15;], 3, 5)
 3×5 Matrix{Int64}:
  1  4  7  10  13
  2  5  8  11  14
@@ -316,7 +316,7 @@
 julia> NNlib.gather(src, [1, 2], [2, 4])
 2-element Vector{Int64}:
   4
- 11
source
NNlib.gather!Function
NNlib.gather!(dst, src, idx)

Reverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather for an allocating version.

source
NNlib.scatterFunction
NNlib.scatter(op, src, idx; [init, dstsize])

Scatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.

  • If keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).

  • If dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.

See scatter! for full details on how idx works.

Examples

julia> NNlib.scatter(+, [10,100,1000], [3,1,2])
+ 11
source
NNlib.gather!Function
NNlib.gather!(dst, src, idx)

Reverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather for an allocating version.

source
NNlib.scatterFunction
NNlib.scatter(op, src, idx; [init, dstsize])

Scatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.

  • If keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).

  • If dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.

See scatter! for full details on how idx works.

Examples

julia> NNlib.scatter(+, [10,100,1000], [3,1,2])
 3-element Vector{Int64}:
   100
  1000
@@ -334,7 +334,7 @@
     10
   2000
     10
-    10
source
NNlib.scatter!Function
NNlib.scatter!(op, dst, src, idx)

Scatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to

dst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])

See also scatter, gather.

Arguments

  • op: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.
  • dst: The destination for src to aggregate to. This argument will be mutated.
  • src: The source data for aggregating.
  • idx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.

Examples

julia> NNlib.scatter!(+, ones(3), [10,100], [1,3])
+    10
source
NNlib.scatter!Function
NNlib.scatter!(op, dst, src, idx)

Scatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to

dst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])

See also scatter, gather.

Arguments

  • op: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.
  • dst: The destination for src to aggregate to. This argument will be mutated.
  • src: The source data for aggregating.
  • idx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.

Examples

julia> NNlib.scatter!(+, ones(3), [10,100], [1,3])
 3-element Vector{Float64}:
   11.0
    1.0
@@ -343,7 +343,7 @@
 julia> NNlib.scatter!(*, fill(0.5, 2, 4), [1 10; 100 1000], [3,2])
 2×4 Matrix{Float64}:
  0.5    5.0   0.5  0.5
- 0.5  500.0  50.0  0.5
source

Sampling

NNlib.grid_sampleFunction
grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)

Given input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.

This implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).

Arguments

  • input: Input array in (W_in, H_in, C, N) shape.

  • grid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.

    Therefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.

    Out-of-bound values are handled according to the padding_mode.

  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.

Returns

(W_out, H_out, C, N) sampled grid from input.

Examples

In the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.

julia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))
+ 0.5  500.0  50.0  0.5
source

Sampling

NNlib.grid_sampleFunction
grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)

Given input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.

This implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).

Arguments

  • input: Input array in (W_in, H_in, C, N) shape.

  • grid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.

    Therefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.

    Out-of-bound values are handled according to the padding_mode.

  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.

Returns

(W_out, H_out, C, N) sampled grid from input.

Examples

In the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.

julia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))
 2×2×1×1 Array{Float64, 4}:
 [:, :, 1, 1] =
  1.0  3.0
@@ -375,4 +375,4 @@
 [:, :, 1, 1] =
  1.0  3.0
  1.5  3.5
- 2.0  4.0
source
NNlib.∇grid_sampleFunction
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T

Arguments

  • Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
  • input: Input from primal computation in (W_in, H_in, C, N) shape.
  • grid: Grid from primal computation in (2, W_out, H_out, N) shape.
  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.

Returns

dinput (same shape as input) and dgrid (same shape as grid) gradients.

source

Losses

NNlib.ctc_lossFunction
ctc_loss(ŷ, y)

Computes the connectionist temporal classification loss between and y. must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to , so must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with . The blank label is assumed to be the last label category in , so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.

source

Miscellaneous

NNlib.logsumexpFunction
logsumexp(x; dims = :)

Computes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.

See also logsoftmax.

source
+ 2.0 4.0source
NNlib.∇grid_sampleFunction
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T

Arguments

  • Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
  • input: Input from primal computation in (W_in, H_in, C, N) shape.
  • grid: Grid from primal computation in (2, W_out, H_out, N) shape.
  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.

Returns

dinput (same shape as input) and dgrid (same shape as grid) gradients.

source

Losses

NNlib.ctc_lossFunction
ctc_loss(ŷ, y)

Computes the connectionist temporal classification loss between and y. must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to , so must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with . The blank label is assumed to be the last label category in , so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.

source

Miscellaneous

NNlib.logsumexpFunction
logsumexp(x; dims = :)

Computes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.

See also logsoftmax.

source
NNlib.gluFunction
glu(x, dim = 1)

The gated linear unit from the "Language Modeling with Gated Convolutional Networks" paper.

Calculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.

source
diff --git a/previews/PR2464/reference/outputsize/index.html b/previews/PR2464/reference/outputsize/index.html index 2e306a4ef1..fe8cc1582f 100644 --- a/previews/PR2464/reference/outputsize/index.html +++ b/previews/PR2464/reference/outputsize/index.html @@ -68,7 +68,7 @@ # plus 2 non-trainable, 10 parameters, summarysize 10.469 KiB. julia> outputsize(ans, (28, 28, 1, 32)) -(10, 32)

Limitations:

source
Flux.outputsizeFunction
outputsize(m, x_size, y_size, ...; padbatch=false)

For model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.

Examples

julia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);
+(10, 32)

Limitations:

  • While @autosize (5, 32) Flux.Bilinear(_ => 7) is OK, something like Bilinear((_, _) => 7) will fail.
  • While Scale(_) and LayerNorm(_) are fine (and use the first dimension), Scale(_,_) and LayerNorm(_,_) will fail if size(x,1) != size(x,2).
source
Flux.outputsizeFunction
outputsize(m, x_size, y_size, ...; padbatch=false)

For model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.

Examples

julia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);
 
 julia> par = Parallel(vcat, Dense(5 => 9), Dense(7 => 11));
 
@@ -81,4 +81,4 @@
 (13, 1)
 
 julia> par(x, y) == par((x, y)) == Chain(par, identity)((x, y))
-true

Notice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.

source
+true

Notice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.

source diff --git a/previews/PR2464/reference/training/callbacks/index.html b/previews/PR2464/reference/training/callbacks/index.html index 83870d3b74..a96ee5a6b5 100644 --- a/previews/PR2464/reference/training/callbacks/index.html +++ b/previews/PR2464/reference/training/callbacks/index.html @@ -10,7 +10,7 @@ sleep(1) end Flux -Fluxsource

Patience Helpers

Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.

For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.

# create a pseudo-loss that decreases for 4 calls, then starts increasing
+Flux
source

Patience Helpers

Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.

For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.

# create a pseudo-loss that decreases for 4 calls, then starts increasing
 # we call this like loss()
 loss = let t = 0
   () -> begin
@@ -61,7 +61,7 @@
        end
 [ Info: Epoch 1
 [ Info: Epoch 2
-[ Info: Epoch 3
source
Flux.early_stoppingFunction
early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)

Return a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.

Examples

julia> loss = let l = 0
+[ Info: Epoch 3
source
Flux.early_stoppingFunction
early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)

Return a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.

Examples

julia> loss = let l = 0
          () -> l += 1
        end; # pseudo loss function that returns increasing values
 
@@ -74,7 +74,7 @@
        end
 [ Info: Epoch 1
 [ Info: Epoch 2
-[ Info: Epoch 3
source
Flux.plateauFunction
plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)

Return a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.

Examples

julia> f = let v = 10
+[ Info: Epoch 3
source
Flux.plateauFunction
plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)

Return a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.

Examples

julia> f = let v = 10
          () -> v = v / abs(v) - v
        end; # -9, 8, -7, 6, ...
 
@@ -88,4 +88,4 @@
 [ Info: Epoch 1
 [ Info: Epoch 2
 [ Info: Epoch 3
-[ Info: Epoch 4
source
+[ Info: Epoch 4source diff --git a/previews/PR2464/reference/training/optimisers/index.html b/previews/PR2464/reference/training/optimisers/index.html index 0376043d40..db36a55972 100644 --- a/previews/PR2464/reference/training/optimisers/index.html +++ b/previews/PR2464/reference/training/optimisers/index.html @@ -40,4 +40,4 @@ for epoch in 1:100 opt.eta = next!(schedule) # your training code here -end

ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.

Decays

Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.

Optimisers.SignDecayType
SignDecay(λ = 1e-3)

Implements $L_1$ regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.

See also [WeightDecay] for $L_2$ normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source
Optimisers.WeightDecayType
WeightDecay(λ = 5e-4)

Implements $L_2$ regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.

See also [SignDecay] for $L_1$ normalisation.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source

Gradient Clipping

Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is

opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))
Optimisers.ClipGradType
ClipGrad(δ = 10)

Restricts every gradient component to obey -δ ≤ dx[i] ≤ δ.

Typically composed with other rules using OptimiserChain.

See also ClipNorm.

source
Optimisers.ClipNormType
ClipNorm(ω = 10, p = 2; throw = true)

Scales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).

Throws an error if the norm is infinite or NaN, which you can turn off with throw = false.

Typically composed with other rules using OptimiserChain.

See also ClipGrad.

source
+end

ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.

Decays

Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.

Optimisers.SignDecayType
SignDecay(λ = 1e-3)

Implements $L_1$ regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.

See also [WeightDecay] for $L_2$ normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source
Optimisers.WeightDecayType
WeightDecay(λ = 5e-4)

Implements $L_2$ regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.

See also [SignDecay] for $L_1$ normalisation.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source

Gradient Clipping

Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is

opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))
Optimisers.ClipGradType
ClipGrad(δ = 10)

Restricts every gradient component to obey -δ ≤ dx[i] ≤ δ.

Typically composed with other rules using OptimiserChain.

See also ClipNorm.

source
Optimisers.ClipNormType
ClipNorm(ω = 10, p = 2; throw = true)

Scales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).

Throws an error if the norm is infinite or NaN, which you can turn off with throw = false.

Typically composed with other rules using OptimiserChain.

See also ClipGrad.

source
diff --git a/previews/PR2464/reference/training/reference/index.html b/previews/PR2464/reference/training/reference/index.html index 64e81c8dea..e4e39ffa2f 100644 --- a/previews/PR2464/reference/training/reference/index.html +++ b/previews/PR2464/reference/training/reference/index.html @@ -19,14 +19,14 @@ 10.19 julia> opt_state # mutated by Flux.train! -(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())source
Flux.Optimise.train!Method
train!(loss, model, data, opt_state)

Uses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.

If model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.

For example, with these definitions...

data = [(x1, y1), (x2, y2), (x3, y3)]
+(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())
source
Flux.Optimise.train!Method
train!(loss, model, data, opt_state)

Uses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.

If model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.

For example, with these definitions...

data = [(x1, y1), (x2, y2), (x3, y3)]
 
 loss3(m, x, y) = norm(m(x) .- y)        # the model is the first argument
 
 opt_state = Flux.setup(Adam(), model)   # explicit setup of optimiser momenta

...calling Flux.train!(loss3, model, data, opt_state) runs a loop much like this:

for d in data
     ∂L∂m = gradient(loss3, model, d...)[1]
     update!(opt_state, model, ∂L∂m)
-end

You can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:

  • Stop with a DomainError if the loss is infinite or NaN at any point.

  • Show a progress bar using @withprogress.

New

This method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:

  • It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's "implicit" parameter handling, with Grads.)
  • Instead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.
  • opt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.
  • Callback functions are not supported. (But any code can be included in the above for loop.)
source
Optimisers.updateFunction
Optimisers.update(tree, model, gradient) -> (tree, model)

Uses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.

See also update!, which will be faster for models of ordinary Arrays or CuArrays.

Example

julia> m = (x = Float32[1,2,3], y = tanh);
+end

You can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:

  • Stop with a DomainError if the loss is infinite or NaN at any point.

  • Show a progress bar using @withprogress.

New

This method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:

  • It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's "implicit" parameter handling, with Grads.)
  • Instead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.
  • opt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.
  • Callback functions are not supported. (But any code can be included in the above for loop.)
source
Optimisers.updateFunction
Optimisers.update(tree, model, gradient) -> (tree, model)

Uses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.

See also update!, which will be faster for models of ordinary Arrays or CuArrays.

Example

julia> m = (x = Float32[1,2,3], y = tanh);
 
 julia> t = Optimisers.setup(Descent(0.1), m)
 (x = Leaf(Descent(0.1), nothing), y = ())
@@ -115,4 +115,4 @@
 julia> Optimisers.thaw!(s)
 
 julia> s.x
-(Leaf(Momentum(0.01, 0.9), [0.0]), ())
source
Optimisers.thaw!Function
Optimisers.thaw!(tree)

The reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).

source
+(Leaf(Momentum(0.01, 0.9), [0.0]), ())source
Optimisers.thaw!Function
Optimisers.thaw!(tree)

The reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).

source
diff --git a/previews/PR2464/reference/training/zygote/index.html b/previews/PR2464/reference/training/zygote/index.html index 4349629848..bd02005cc0 100644 --- a/previews/PR2464/reference/training/zygote/index.html +++ b/previews/PR2464/reference/training/zygote/index.html @@ -203,4 +203,4 @@ # this definition of map is for any AD that only defines a reverse mode. # It is not as good as the rrule that can be used if the AD defines a forward-mode as well. -rrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...

For more details see rule configurations and calling back into AD.

source
ChainRulesCore.TangentType
Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent

This type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.

Tangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.

T is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.

For Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.

source
ChainRulesCore.canonicalizeFunction
canonicalize(tangent::Tangent{P}) -> Tangent{P}

Return the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().

source
+rrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...

For more details see rule configurations and calling back into AD.

source
ChainRulesCore.TangentType
Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent

This type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.

Tangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.

T is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.

For Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.

source
ChainRulesCore.canonicalizeFunction
canonicalize(tangent::Tangent{P}) -> Tangent{P}

Return the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().

source
diff --git a/previews/PR2464/reference/utilities/index.html b/previews/PR2464/reference/utilities/index.html index bbf1d0cb79..a8fe8c3663 100644 --- a/previews/PR2464/reference/utilities/index.html +++ b/previews/PR2464/reference/utilities/index.html @@ -32,7 +32,7 @@ julia> ans.bias 2-element Vector{Float32}: 0.0 - 0.0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.glorot_normalFunction
glorot_normal([rng], size...; gain = 1) -> Array
+ 0.0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.glorot_normalFunction
glorot_normal([rng], size...; gain = 1) -> Array
 glorot_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers drawn from a normal distribution with standard deviation gain * sqrt(2 / (fan_in + fan_out)), using nfan.

This method is described in [1] and also known as Xavier initialization.

Examples

julia> using Statistics
 
 julia> round(std(Flux.glorot_normal(10, 1000)), digits=3)
@@ -48,7 +48,7 @@
 Dense(10 => 1000, tanh)  # 11_000 parameters
 
 julia> round(std(ans.weight), sigdigits=3)
-4.45f0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.kaiming_uniformFunction
kaiming_uniform([rng], size...; gain = √2) -> Array
+4.45f0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.kaiming_uniformFunction
kaiming_uniform([rng], size...; gain = √2) -> Array
 kaiming_uniform([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval [-x, x], where x = gain * sqrt(3/fan_in) using nfan.

This method is described in [1] and also known as He initialization.

Examples

julia> round.(extrema(Flux.kaiming_uniform(100, 10)), digits=3)
 (-0.774f0, 0.773f0)
 
@@ -56,7 +56,7 @@
 (-0.243f0, 0.245f0)
 
 julia> round.(extrema(Flux.kaiming_uniform(100, 100)), digits=3)
-(-0.245f0, 0.245f0)

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.kaiming_normalFunction
kaiming_normal([rng], size...; gain = √2) -> Array
+(-0.245f0, 0.245f0)

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.kaiming_normalFunction
kaiming_normal([rng], size...; gain = √2) -> Array
 kaiming_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers taken from a normal distribution standard deviation gain / sqrt(fan_in), using nfan.

This method is described in [1] and also known as He initialization.

Examples

julia> using Statistics
 
 julia> round(std(Flux.kaiming_normal(10, 1000)), digits=3)
@@ -66,7 +66,7 @@
 0.449f0
 
 julia> round(std(Flux.kaiming_normal(1000, 1000)), digits=3)
-0.045f0

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.truncated_normalFunction
truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array
+0.045f0

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.truncated_normalFunction
truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array
 truncated_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size where each element is drawn from a truncated normal distribution. The numbers are distributed like filter(x -> lo<=x<=hi, mean .+ std .* randn(100)).

The values are generated by sampling a Uniform(0, 1) (rand()) and then applying the inverse CDF of the truncated normal distribution. This method works best when lo ≤ mean ≤ hi.

Examples

julia> using Statistics
 
 julia> Flux.truncated_normal(3, 4) |> summary
@@ -76,7 +76,7 @@
 (-2.0f0, 2.0f0)
 
 julia> round(std(Flux.truncated_normal(10^6; lo = -100, hi = 100)))
-1.0f0
source
Flux.orthogonalFunction
orthogonal([rng], size...; gain = 1) -> Array
+1.0f0
source
Flux.orthogonalFunction
orthogonal([rng], size...; gain = 1) -> Array
 orthogonal([rng]; kw...) -> Function

Return an Array{Float32} of the given size which is a (semi) orthogonal matrix, as described in [1].

Cannot construct a vector, i.e. length(size) == 1 is forbidden. For length(size) > 2, a prod(size[1:(end - 1)]) by size[end] orthogonal matrix is computed before reshaping it to the original dimensions.

Examples

julia> W = Flux.orthogonal(5, 7);
 
 julia> summary(W)
@@ -96,7 +96,7 @@
 julia> W3 = Flux.orthogonal(3, 3, 2, 4);
 
 julia> transpose(reshape(W3, :, 4)) * reshape(W3, :, 4) ≈ I(4)
-true

References

[1] Saxe, McClelland, Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks", ICLR 2014, https://arxiv.org/abs/1312.6120

source
Flux.sparse_initFunction
sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array
+true

References

[1] Saxe, McClelland, Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks", ICLR 2014, https://arxiv.org/abs/1312.6120

source
Flux.sparse_initFunction
sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array
 sparse_init([rng]; kw...) -> Function

Return a Matrix{Float32} of size rows, cols where each column contains a fixed fraction of zero elements given by sparsity. Non-zero elements are normally distributed with a mean of zero and standard deviation std.

This method is described in [1].

Examples

julia> count(iszero, Flux.sparse_init(10, 10, sparsity=1/5))
 20
 
@@ -109,7 +109,7 @@
 
 julia> count(iszero, ans.weight, dims=1)
 1×3 Matrix{Int64}:
- 5  5  5

References

[1] Martens, J, "Deep learning via Hessian-free optimization" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.

source
Flux.identity_initFunction
identity_init(size...; gain=1, shift=0) -> Array
+ 5  5  5

References

[1] Martens, J, "Deep learning via Hessian-free optimization" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.

source
Flux.identity_initFunction
identity_init(size...; gain=1, shift=0) -> Array
 identity_init(; kw...) -> Function

Return an Array{Float32} of the given size which yields an identity mapping when used as parameters in most Flux layers. Use gain to scale the identity by a constant.

Often useful in the context of transfer learning, i.e when one wants to add more capacity to a model but start from the same mapping.

Has the following behaviour

  • 1D: A Vector of zeros (useful for an identity bias)
  • 2D: An identity matrix (useful for an identity matrix multiplication)
  • More than 2D: A dense block array of center tap spatial filters (useful for an identity convolution)

Some caveats:

  • Not all layers will be identity mapping when used with this init. Exceptions include recurrent layers and normalization layers.

  • Layers must have input_size == output_size for identity mapping to be possible. When this is not the case, extra dimensions of the array are padded with zeros.

  • For convolutional layers, in addition to the above, the kernel sizes must also be odd and padding must be applied so that output feature maps have the same size as input feature maps, e.g by using SamePad.

Use keyword shift (integer or tuple) to apply circular shift to the output, equivalent to Base.circshift(identity_init(size...), shift).

For consistency with other initialisers, it accepts rng::AbstractRNG as an optional first argument. But this is ignored, since the result is not random.

Examples

julia> Flux.identity_init(3,5)
 3×5 Matrix{Float32}:
  1.0  0.0  0.0  0.0  0.0
@@ -141,7 +141,7 @@
 [:, :, 1, 1] =
  10.0  20.0  30.0
  40.0  50.0  60.0
- 70.0  80.0  90.0
source
Flux.ones32Function
ones32(size...) = ones(Float32, size...)

Return an Array{Float32} of the given size filled with 1s.

source
Flux.zeros32Function
zeros32(size...) = zeros(Float32, size...)

Return an Array{Float32} of the given size filled with 0s.

source
Flux.rand32Function
rand32([rng], size...)

Return an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.

source
Flux.randn32Function
randn32([rng], size...)

Return an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.

source
Flux.create_biasFunction
create_bias(weights, bias, size...)

Return a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.

  • bias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.
  • bias == false returns false, which is understood by AD to be non-differentiable.
  • bias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.
source

These functions call:

Flux.rng_from_arrayFunction
rng_from_array(x)

Create an instance of the RNG most appropriate for x. The current defaults are:

  • x isa CuArray: CUDA.default_rng()
  • x isa AbstractArray: `Random.default_rng()
source
Flux.nfanFunction
nfan(n_out, n_in=1) -> Tuple
+ 70.0  80.0  90.0
source
Flux.ones32Function
ones32(size...) = ones(Float32, size...)

Return an Array{Float32} of the given size filled with 1s.

source
Flux.zeros32Function
zeros32(size...) = zeros(Float32, size...)

Return an Array{Float32} of the given size filled with 0s.

source
Flux.rand32Function
rand32([rng], size...)

Return an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.

source
Flux.randn32Function
randn32([rng], size...)

Return an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.

source
Flux.create_biasFunction
create_bias(weights, bias, size...)

Return a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.

  • bias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.
  • bias == false returns false, which is understood by AD to be non-differentiable.
  • bias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.
source

These functions call:

Flux.rng_from_arrayFunction
rng_from_array(x)

Create an instance of the RNG most appropriate for x. The current defaults are:

  • x isa CuArray: CUDA.default_rng()
  • x isa AbstractArray: `Random.default_rng()
source
Flux.nfanFunction
nfan(n_out, n_in=1) -> Tuple
 nfan(dims...)
 nfan(dims::Tuple)

For a layer characterized by dimensions dims, return a tuple (fan_in, fan_out), where fan_in is the number of input neurons connected to an output one, and fan_out is the number of output neurons connected to an input one.

This function is mainly used by weight initializers, e.g., kaiming_normal.

Examples

julia> layer = Dense(10, 20);
 
@@ -151,7 +151,7 @@
 julia> layer = Conv((3, 3), 2=>10);
 
 julia> Flux.nfan(size(layer.weight))
-(18, 90)
source

Changing the type of all parameters

The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):

Flux.f64Function
f64(m)

Converts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.

See also f32 and f16.

source
Flux.f32Function
f32(m)

Converts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.

See also f64 and f16.

source
Flux.f16Function
f16(m)

Converts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.

Support for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.

See also f32 and f64.

Example

julia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10))  # all Float32
+(18, 90)
source

Changing the type of all parameters

The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):

Flux.f64Function
f64(m)

Converts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.

See also f32 and f16.

source
Flux.f32Function
f32(m)

Converts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.

See also f64 and f16.

source
Flux.f16Function
f16(m)

Converts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.

Support for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.

See also f32 and f64.

Example

julia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10))  # all Float32
 Chain(
   Dense(784 => 2048, relu),             # 1_607_680 parameters
   Dense(2048 => 10),                    # 20_490 parameters
@@ -161,4 +161,4 @@
 Chain(
   Dense(784 => 2048, relu),             # 1_607_680 parameters
   Dense(2048 => 10),                    # 20_490 parameters
-)                   # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.
source
+) # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.source diff --git a/previews/PR2464/search_index.js b/previews/PR2464/search_index.js index 2ac258f864..1d4b5df72c 100644 --- a/previews/PR2464/search_index.js +++ b/previews/PR2464/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"guide/models/quickstart/#man-quickstart","page":"Quick Start","title":"A Neural Network in One Minute","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you have used neural networks before, then this simple example might be helpful for seeing how the major parts of Flux work together. Try pasting the code into the REPL prompt.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you haven't, then you might prefer the Fitting a Straight Line page.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"# This will prompt if neccessary to install everything, including CUDA:\nusing Flux, CUDA, Statistics, ProgressMeter\n\n# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:\nnoisy = rand(Float32, 2, 1000) # 2×1000 Matrix{Float32}\ntruth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)] # 1000-element Vector{Bool}\n\n# Define our model, a multi-layer perceptron with one hidden layer of size 3:\nmodel = Chain(\n Dense(2 => 3, tanh), # activation function inside layer\n BatchNorm(3),\n Dense(3 => 2)) |> gpu # move model to GPU, if available\n\n# The model encapsulates parameters, randomly initialised. Its initial output is:\nout1 = model(noisy |> gpu) |> cpu # 2×1000 Matrix{Float32}\nprobs1 = softmax(out1) # normalise to get probabilities\n\n# To train the model, we use batches of 64 samples, and one-hot encoding:\ntarget = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix\nloader = Flux.DataLoader((noisy, target) |> gpu, batchsize=64, shuffle=true);\n# 16-element DataLoader with first element: (2×64 Matrix{Float32}, 2×64 OneHotMatrix)\n\noptim = Flux.setup(Flux.Adam(0.01), model) # will store optimiser momentum, etc.\n\n# Training loop, using the whole data set 1000 times:\nlosses = []\n@showprogress for epoch in 1:1_000\n for (x, y) in loader\n loss, grads = Flux.withgradient(model) do m\n # Evaluate model and loss inside gradient context:\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\n Flux.update!(optim, model, grads[1])\n push!(losses, loss) # logging, outside gradient context\n end\nend\n\noptim # parameters, momenta and output have all changed\nout2 = model(noisy |> gpu) |> cpu # first row is prob. of true, second row p(false)\nprobs2 = softmax(out2) # normalise to get probabilities\nmean((probs2[1,:] .> 0.5) .== truth) # accuracy 94% so far!","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"(Image: )","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"using Plots # to draw the above figure\n\np_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title=\"True classification\", legend=false)\np_raw = scatter(noisy[1,:], noisy[2,:], zcolor=probs1[1,:], title=\"Untrained network\", label=\"\", clims=(0,1))\np_done = scatter(noisy[1,:], noisy[2,:], zcolor=probs2[1,:], title=\"Trained network\", legend=false)\n\nplot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Here's the loss during training:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"plot(losses; xaxis=(:log10, \"iteration\"),\n yaxis=\"loss\", label=\"per batch\")\nn = length(loader)\nplot!(n:n:length(losses), mean.(Iterators.partition(losses, n)),\n label=\"epoch mean\", dpi=200)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"This XOR (\"exclusive or\") problem is a variant of the famous one which drove Minsky and Papert to invent deep neural networks in 1969. For small values of \"deep\" – this has one hidden layer, while earlier perceptrons had none. (What they call a hidden layer, Flux calls the output of the first layer, model[1](noisy).)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Since then things have developed a little. ","category":"page"},{"location":"guide/models/quickstart/#Features-to-Note","page":"Quick Start","title":"Features to Note","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Some things to notice in this example are:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"The batch dimension of data is always the last one. Thus a 2×1000 Matrix is a thousand observations, each a column of length 2. Flux defaults to Float32, but most of Julia to Float64.\nThe model can be called like a function, y = model(x). Each layer like Dense is an ordinary struct, which encapsulates some arrays of parameters (and possibly other state, as for BatchNorm).\nBut the model does not contain the loss function, nor the optimisation rule. The momenta needed by Adam are stored in the object returned by setup. And Flux.logitcrossentropy is an ordinary function that combines the softmax and crossentropy functions.\nThe do block creates an anonymous function, as the first argument of gradient. Anything executed within this is differentiated.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Instead of calling gradient and update! separately, there is a convenience function train!. If we didn't want anything extra (like logging the loss), we could replace the training loop with the following:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"for epoch in 1:1_000\n Flux.train!(model, loader, optim) do m, x, y\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\nend","category":"page"},{"location":"reference/training/reference/#Training-API-Reference","page":"Training API","title":"Training API Reference","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The new version of Flux's training code was written as an independent package, Optimisers.jl. Only the function train! belongs to Flux itself.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The Optimisers package is designed to allow for immutable objects. But at present all Flux models contain parameter arrays (such as Arrays and CuArrays) which can be updated in-place. Because of this:","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The objects returned by Optimisers.update! can be ignored.\nFlux defines its own version of setup which checks this assumption. (Using instead Optimisers.setup will also work, they return the same thing.)","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The available optimization rules are listed the optimisation rules page here. See the Optimisers documentation for details on how the rules work.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Flux.Train.setup\nFlux.Train.train!(loss, model, data, state)\nOptimisers.update\nOptimisers.update!\nOptimisers.setup","category":"page"},{"location":"reference/training/reference/#Flux.Train.setup","page":"Training API","title":"Flux.Train.setup","text":"opt_state = setup(rule, model)\n\nThis is a version of Optimisers.setup, and is the first step before using train!. It differs from Optimisers.setup in that it:\n\nhas one extra check for mutability (since Flux expects to mutate the model in-place, while Optimisers.jl is designed to return an updated model)\nhas methods which accept Flux's old optimisers, and convert them. (The old Flux.Optimise.Adam and new Optimisers.Adam are distinct types.)\n\nExample\n\njulia> model = Dense(2 => 1, leakyrelu; init=ones);\n\njulia> opt_state = Flux.setup(Momentum(0.1), model) # this encodes the optimiser and its state\n(weight = Leaf(Momentum(0.1, 0.9), [0.0 0.0]), bias = Leaf(Momentum(0.1, 0.9), [0.0]), σ = ())\n\njulia> x1, y1 = [0.2, -0.3], [0.4]; # use the same data for two steps:\n\njulia> Flux.train!(model, [(x1, y1), (x1, y1)], opt_state) do m, x, y\n sum(abs.(m(x) .- y)) * 100\n end\n\njulia> model.bias # was zero, mutated by Flux.train!\n1-element Vector{Float64}:\n 10.19\n\njulia> opt_state # mutated by Flux.train!\n(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Flux.Optimise.train!-NTuple{4, Any}","page":"Training API","title":"Flux.Optimise.train!","text":"train!(loss, model, data, opt_state)\n\nUses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.\n\nIf model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.\n\nFor example, with these definitions...\n\ndata = [(x1, y1), (x2, y2), (x3, y3)]\n\nloss3(m, x, y) = norm(m(x) .- y) # the model is the first argument\n\nopt_state = Flux.setup(Adam(), model) # explicit setup of optimiser momenta\n\n...calling Flux.train!(loss3, model, data, opt_state) runs a loop much like this:\n\nfor d in data\n ∂L∂m = gradient(loss3, model, d...)[1]\n update!(opt_state, model, ∂L∂m)\nend\n\nYou can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:\n\nStop with a DomainError if the loss is infinite or NaN at any point.\nShow a progress bar using @withprogress.\n\ncompat: New\nThis method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's \"implicit\" parameter handling, with Grads.)\nInstead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.\nopt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.\nCallback functions are not supported. (But any code can be included in the above for loop.)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/reference/#Optimisers.update","page":"Training API","title":"Optimisers.update","text":"Optimisers.update(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nSee also update!, which will be faster for models of ordinary Arrays or CuArrays.\n\nExample\n\njulia> m = (x = Float32[1,2,3], y = tanh);\n\njulia> t = Optimisers.setup(Descent(0.1), m)\n(x = Leaf(Descent(0.1), nothing), y = ())\n\njulia> g = (x = [1,1,1], y = nothing); # fake gradient\n\njulia> Optimisers.update(t, m, g)\n((x = Leaf(Descent(0.1), nothing), y = ()), (x = Float32[0.9, 1.9, 2.9], y = tanh))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.update!","page":"Training API","title":"Optimisers.update!","text":"Optimisers.update!(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nThis is used in exactly the same manner as update, but because it may mutate arrays within the old model (and the old state), it will be faster for models of ordinary Arrays or CuArrays. However, you should not rely on the old model being fully updated but rather use the returned model. (The original state tree is always mutated, as each Leaf is mutable.)\n\nExample\n\njulia> using StaticArrays, Zygote, Optimisers\n\njulia> m = (x = [1f0, 2f0], y = SA[4f0, 5f0]); # partly mutable model\n\njulia> t = Optimisers.setup(Momentum(1/30, 0.9), m) # tree of states\n(x = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]), y = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]))\n\njulia> g = gradient(m -> sum(abs2.(m.x .+ m.y)), m)[1] # structural gradient\n(x = Float32[10.0, 14.0], y = Float32[10.0, 14.0])\n\njulia> t2, m2 = Optimisers.update!(t, m, g);\n\njulia> m2 # after update or update!, this is the new model\n(x = Float32[0.6666666, 1.5333333], y = Float32[3.6666667, 4.5333333])\n\njulia> m2.x === m.x # update! has re-used this array, for efficiency\ntrue\n\njulia> m # original should be discarded, may be mutated but no guarantee\n(x = Float32[0.6666666, 1.5333333], y = Float32[4.0, 5.0])\n\njulia> t == t2 # original state tree is guaranteed to be mutated\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.setup","page":"Training API","title":"Optimisers.setup","text":"Optimisers.setup(rule, model) -> state_tree\n\nInitialises the given optimiser for every trainable parameter within the model. Returns a tree of the relevant states, which must be passed to update or update!.\n\nExample\n\njulia> m = (x = rand(3), y = (true, false), z = tanh);\n\njulia> Optimisers.setup(Momentum(), m) # same field names as m\n(x = Leaf(Momentum(0.01, 0.9), [0.0, 0.0, 0.0]), y = ((), ()), z = ())\n\nThe recursion into structures uses Functors.jl, and any new structs containing parameters need to be marked with Functors.@functor before use. See the Flux docs for more about this.\n\njulia> struct Layer; mat; fun; end\n\njulia> model = (lay = Layer([1 2; 3 4f0], sin), vec = [5, 6f0]);\n\njulia> Optimisers.setup(Momentum(), model) # new struct is by default ignored\n(lay = (), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[5.0, 6.0], Restructure(NamedTuple, ..., 2))\n\njulia> using Functors; @functor Layer # annotate this type as containing parameters\n\njulia> Optimisers.setup(Momentum(), model)\n(lay = (mat = Leaf(Momentum(0.01, 0.9), Float32[0.0 0.0; 0.0 0.0]), fun = ()), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[1.0, 3.0, 2.0, 4.0, 5.0, 6.0], Restructure(NamedTuple, ..., 6))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"train! uses @progress which should show a progress bar in VSCode automatically. To see one in a terminal, you will need to install TerminalLoggers.jl and follow its setup instructions.","category":"page"},{"location":"reference/training/reference/#Optimisation-Modifiers","page":"Training API","title":"Optimisation Modifiers","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The state returned by setup can be modified to temporarily prevent training of some parts of the model, or to change the learning rate or other hyperparameter. The functions for doing so may be accessed as Flux.freeze!, Flux.thaw!, and Flux.adjust!. All mutate the state (or part of it) and return nothing.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Optimisers.adjust!\nOptimisers.freeze!\nOptimisers.thaw!","category":"page"},{"location":"reference/training/reference/#Optimisers.adjust!","page":"Training API","title":"Optimisers.adjust!","text":"Optimisers.adjust!(tree, η)\n\nAlters the state tree = setup(rule, model) to change the parameters of the optimisation rule, without destroying its stored state. Typically used mid-way through training.\n\nCan be applied to part of a model, by acting only on the corresponding part of the state tree.\n\nTo change just the learning rate, provide a number η::Real.\n\nExample\n\njulia> m = (vec = rand(Float32, 2), fun = sin);\n\njulia> st = Optimisers.setup(Nesterov(), m) # stored momentum is initialised to zero\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[0.0, 0.0]), fun = ())\n\njulia> st, m = Optimisers.update(st, m, (vec = [16, 88], fun = nothing)); # with fake gradient\n\njulia> st\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[-0.016, -0.088]), fun = ())\n\njulia> Optimisers.adjust!(st, 0.123) # change learning rate, stored momentum untouched\n\njulia> st\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\nTo change other parameters, adjust! also accepts keyword arguments matching the field names of the optimisation rule's type.\n\njulia> fieldnames(Adam)\n(:eta, :beta, :epsilon)\n\njulia> st2 = Optimisers.setup(OptimiserChain(ClipGrad(), Adam()), m)\n(vec = Leaf(OptimiserChain(ClipGrad(10.0), Adam(0.001, (0.9, 0.999), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st2; beta = (0.777, 0.909), delta = 11.1) # delta acts on ClipGrad\n(vec = Leaf(OptimiserChain(ClipGrad(11.1), Adam(0.001, (0.777, 0.909), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st; beta = \"no such field\") # silently ignored!\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.freeze!","page":"Training API","title":"Optimisers.freeze!","text":"Optimisers.freeze!(tree)\n\nTemporarily alters the state tree = setup(rule, model) so that parameters will not be updated. Un-done by thaw!.\n\nCan be applied to the state corresponding to only part of a model, for instance with model::Chain, to freeze model.layers[1] you should call freeze!(tree.layers[1]).\n\nExample\n\njulia> m = (x = ([1.0], 2.0), y = [3.0]);\n\njulia> s = Optimisers.setup(Momentum(), m);\n\njulia> Optimisers.freeze!(s.x)\n\njulia> Optimisers.update!(s, m, (x = ([pi], 10pi), y = [100pi])); # with fake gradient\n\njulia> m\n(x = ([1.0], 2.0), y = [-0.14159265358979312])\n\njulia> s\n(x = (Leaf(Momentum(0.01, 0.9), [0.0], frozen = true), ()), y = Leaf(Momentum(0.01, 0.9), [3.14159]))\n\njulia> Optimisers.thaw!(s)\n\njulia> s.x\n(Leaf(Momentum(0.01, 0.9), [0.0]), ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.thaw!","page":"Training API","title":"Optimisers.thaw!","text":"Optimisers.thaw!(tree)\n\nThe reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).\n\n\n\n\n\n","category":"function"},{"location":"tutorials/logistic_regression/#Logistic-Regression","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The following page contains a step-by-step walkthrough of the logistic regression algorithm in Julia using Flux. We will then create a simple logistic regression model without any usage of Flux and compare the different working parts with Flux's implementation. ","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing the required Julia packages.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames, OneHotArrays","category":"page"},{"location":"tutorials/logistic_regression/#Dataset","page":"Logistic Regression","title":"Dataset","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing a dataset from MLDatasets.jl. We will use the Iris dataset that contains the data of three different Iris species. The data consists of 150 data points (xs), each having four features. Each of these x is mapped to y, the name of a particular Iris specie. The following code will download the Iris dataset when run for the first time.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> Iris()\ndataset Iris:\n metadata => Dict{String, Any} with 4 entries\n features => 150×4 DataFrame\n targets => 150×1 DataFrame\n dataframe => 150×5 DataFrame\n\njulia> x, y = Iris(as_df=false)[:];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's have a look at our dataset -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> y\n1×150 Matrix{InlineStrings.String15}:\n \"Iris-setosa\" \"Iris-setosa\" … \"Iris-virginica\" \"Iris-virginica\"\n\njulia> x |> summary\n\"4×150 Matrix{Float64}\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The y values here corresponds to a type of iris plant, with a total of 150 data points. The x values depict the sepal length, sepal width, petal length, and petal width (all in cm) of 150 iris plant (hence the matrix size 4×150). Different type of iris plants have different lengths and widths of sepals and petals associated with them, and there is a definitive pattern for this in nature. We can leverage this to train a simple classifier that outputs the type of iris plant using the length and width of sepals and petals as inputs.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step would be to convert this data into a form that can be fed to a machine learning model. The x values are arranged in a matrix and should ideally be converted to Float32 type (see Performance tips), but the labels must be one hot encoded. Here is a great discourse thread on different techniques that can be used to one hot encode data with or without using any external Julia package.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> x = Float32.(x);\n\njulia> y = vec(y);\n\njulia> custom_y_onehot = unique(y) .== permutedims(y)\n3×150 BitMatrix:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"This same operation can also be performed using OneHotArrays' onehotbatch function. We will use both of these outputs parallelly to show how intuitive FluxML is!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> const classes = [\"Iris-setosa\", \"Iris-versicolor\", \"Iris-virginica\"];\n\njulia> flux_y_onehot = onehotbatch(y, classes)\n3×150 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our data is ready. The next step would be to build a classifier for the same.","category":"page"},{"location":"tutorials/logistic_regression/#Building-a-model","page":"Logistic Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A logistic regression model is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"model(x) = σ(Wx + b)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"where W is the weight matrix, b is the bias vector, and σ is any activation function. For our case, let's use the softmax activation function as we will be performing a multiclass classification task.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> m(W, b, x) = W*x .+ b\nm (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note that this model lacks an activation function, but we will come back to that.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now move ahead to initialize the parameters of our model. Given that our model has four inputs (4 features in every data point), and three outputs (3 different classes), the parameters can be initialized in the following way -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W = rand(Float32, 3, 4);\n\njulia> b = [0.0f0, 0.0f0, 0.0f0];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now our model can take in the complete dataset and predict the class of each x in one go. But, we need to ensure that our model outputs the probabilities of an input belonging to the respective classes. As our model has three outputs, each would denote the probability of the input belonging to a particular class.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We will use an activation function to map our outputs to a probability value. It would make sense to use a softmax activation function here, which is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"σ(vecx) = frace^z_isum_j=1^k e^z_j","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function scales down the outputs to probability values such that the sum of all the final outputs equals 1. Let's implement this in Julia.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_softmax(x) = exp.(x) ./ sum(exp.(x), dims=1)\ncustom_softmax (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The implementation looks straightforward enough! Note that we specify dims=1 in the sum function to calculate the sum of probabilities in each column. Remember, we will have a 3×150 matrix (predicted ys) as the output of our model, where each column would be an output of a corresponding input.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's combine this softmax function with our model to construct the complete custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) = m(W, b, x) |> custom_softmax\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's check if our model works.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) |> size\n(3, 150)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works! Let's check if the softmax function is working.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> all(0 .<= custom_model(W, b, x) .<= 1)\ntrue\n\njulia> sum(custom_model(W, b, x), dims=1)\n1×150 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Every output value is between 0 and 1, and every column adds to 1!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's convert our custom_model to a Flux model. Flux provides the users with a very elegant API that almost feels like writing your code!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note, all the flux_* variables in this tutorial would be general, that is, they can be used as it is with some other similar-looking dataset, but the custom_* variables will remain specific to this tutorial.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model = Chain(Dense(4 => 3), softmax)\nChain(\n Dense(4 => 3), # 15 parameters\n NNlib.softmax,\n)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A Dense(4 => 3) layer denotes a layer with four inputs (four features in every data point) and three outputs (three classes or labels). This layer is the same as the mathematical model defined by us above. Under the hood, Flux too calculates the output using the same expression, but we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function provided by NNLib.jl is re-exported by Flux, which has been used here. Lastly, Flux provides users with a Chain struct which makes stacking layers seamless.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A model's weights and biases can be accessed as follows -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model[1].weight, flux_model[1].bias\n(Float32[0.78588694 -0.45968163 -0.77409476 0.2358028; -0.9049773 -0.58643705 0.466441 -0.79523873; 0.82426906 0.4143493 0.7630932 0.020588955], Float32[0.0, 0.0, 0.0])","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now pass the complete data in one go, with each data point having four features (four inputs)!","category":"page"},{"location":"tutorials/logistic_regression/#Loss-and-accuracy","page":"Logistic Regression","title":"Loss and accuracy","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step should be to define some quantitative values for our model, which we will maximize or minimize during the complete training procedure. These values will be the loss function and the accuracy metric.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by defining a loss function, a logitcrossentropy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_logitcrossentropy(ŷ, y) = mean(.-sum(y .* logsoftmax(ŷ; dims = 1); dims = 1));","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can wrap the custom_logitcrossentropy inside a function that takes in the model parameters, xs, and ys, and returns the loss value.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_loss(W, b, x, y)\n ŷ = custom_model(W, b, x)\n custom_logitcrossentropy(ŷ, y)\n end;\n\njulia> custom_loss(W, b, x, custom_y_onehot)\n1.1714406827505623","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss function works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides us with many minimal yet elegant loss functions. In fact, the custom_logitcrossentropy defined above has been taken directly from Flux. The functions present in Flux includes sanity checks, ensures efficient performance, and behaves well with the overall FluxML ecosystem.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function flux_loss(flux_model, x, y)\n ŷ = flux_model(x)\n Flux.logitcrossentropy(ŷ, y)\n end;\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n1.2156688659673647","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Next, let's define an accuracy function, which we will try to maximize during our training procedure. Before jumping to accuracy, let's define a onecold function. The onecold function would convert our output, which remember, are probability values, to the actual class names.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can divide this task into two parts -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Identify the index of the maximum element of each column in the output matrix\nConvert this index to a class name","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The maximum index should be calculated along the columns (remember, each column is the output of a single x data point). We can use Julia's argmax function to achieve this.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> argmax(custom_y_onehot, dims=1) # calculate the cartesian index of max element column-wise\n1×150 Matrix{CartesianIndex{2}}:\n CartesianIndex(1, 1) CartesianIndex(1, 2) … CartesianIndex(3, 150)\n\njulia> max_idx = [x[1] for x in argmax(custom_y_onehot; dims=1)]\n1×150 Matrix{Int64}:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 3 3 3 3 3 3 3 3 3 3 3 3","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can write a function that calculates the indices of the maximum element in each column, and maps them to a class name.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_onecold(custom_y_onehot)\n max_idx = [x[1] for x in argmax(custom_y_onehot; dims=1)]\n vec(classes[max_idx])\n end;\n\njulia> custom_onecold(custom_y_onehot)\n150-element Vector{String}:\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n ⋮\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides users with the onecold function so that we don't have to write it on our own. Let's see how our custom_onecold function compares to Flux.onecold.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> istrue = Flux.onecold(flux_y_onehot, classes) .== custom_onecold(custom_y_onehot);\n\njulia> all(istrue)\ntrue","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Both the functions act identically!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We now move to the accuracy metric and run it with the untrained custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_accuracy(W, b, x, y) = mean(custom_onecold(custom_model(W, b, x)) .== y);\n\njulia> custom_accuracy(W, b, x, y)\n0.3333333333333333","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We could also have used Flux's built-in functionality to define this accuracy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_accuracy(x, y) = mean(Flux.onecold(flux_model(x), classes) .== y);\n\njulia> flux_accuracy(x, y)\n0.24","category":"page"},{"location":"tutorials/logistic_regression/#Training-the-model","page":"Logistic Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality. gradient takes in a function and its arguments, and returns a tuple containing ∂f/∂x for each argument x. Let's pass in custom_loss and the arguments required by custom_loss to gradient. We will require the derivatives of the loss function (custom_loss) with respect to the weights (∂f/∂w) and the bias (∂f/∂b) to carry out gradient descent, but we can ignore the partial derivatives of the loss function (custom_loss) with respect to x (∂f/∂x) and one hot encoded y (∂f/∂y).","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, custom_y_onehot);","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W .= W .- 0.1 .* dLdW;\n\njulia> b .= b .- 0.1 .* dLdb;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The parameters have been updated! We can now check the value of our custom loss function -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n1.164742997664842","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss went down! Let's plug our super training logic inside a function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function train_custom_model()\n dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, custom_y_onehot)\n W .= W .- 0.1 .* dLdW\n b .= b .- 0.1 .* dLdb\n end;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can plug the training function inside a loop and train the model for more epochs. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia. Here we will train the model for a maximum of 500 epochs, but to ensure that the model does not overfit, we will break as soon as our accuracy value crosses or becomes equal to 0.98.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> for i = 1:500\n train_custom_model();\n custom_accuracy(W, b, x, y) >= 0.98 && break\n end\n \njulia> @show custom_accuracy(W, b, x, y);\ncustom_accuracy(W, b, x, y) = 0.98","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Everything works! Our model achieved an accuracy of 0.98! Let's have a look at the loss.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n0.6520349798243569","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"As expected, the loss went down too! Now, let's repeat the same steps with our flux_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can write a similar-looking training loop for our flux_model and train it similarly.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_loss(flux_model, x, flux_y_onehot)\n1.215731131385928\n\njulia> function train_flux_model()\n dLdm, _, _ = gradient(flux_loss, flux_model, x, flux_y_onehot)\n @. flux_model[1].weight = flux_model[1].weight - 0.1 * dLdm[:layers][1][:weight]\n @. flux_model[1].bias = flux_model[1].bias - 0.1 * dLdm[:layers][1][:bias]\n end;\n\njulia> for i = 1:500\n train_flux_model();\n flux_accuracy(x, y) >= 0.98 && break\n end","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Looking at the accuracy and loss value -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> @show flux_accuracy(x, y);\nflux_accuracy(x, y) = 0.98\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n0.6952386604624324","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We see a very similar final loss and accuracy.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"info: Info\nOriginally published on 1st April 2023, by Saransh Chopra.","category":"page"},{"location":"tutorials/model_zoo/#Model-Zoo","page":"Model Zoo","title":"Model Zoo","text":"","category":"section"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"The model zoo is a collection of examples that demonstrate how to build and train models using Flux. The examples are organised by domain and include vision, text, and audio. Each example includes a description of the model, the data used, and the training process.","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Some of the examples are pedagogical, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Multilayer Perceptron\nSimple Convolutional Neural Network","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Others are more advanced, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Variational Autoencoder","category":"page"},{"location":"guide/models/custom_layers/#man-advanced","page":"Custom Layers","title":"Defining Customised Layers","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here we will try and describe usage of some more advanced features that Flux provides to give more control over model building.","category":"page"},{"location":"guide/models/custom_layers/#Custom-Model-Example","page":"Custom Layers","title":"Custom Model Example","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here is a basic example of a custom model. It simply adds the input to the result from the neural network.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"struct CustomModel{T <: Chain} # Parameter to avoid type instability\n chain::T\nend\n\nfunction (m::CustomModel)(x)\n # Arbitrary code can go here, but note that everything will be differentiated.\n # Zygote does not allow some operations, like mutating arrays.\n\n return m.chain(x) + x\nend\n\n# Call @layer to allow for training. Described below in more detail.\nFlux.@layer CustomModel","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice that we parameterized the type of the chain field. This is necessary for fast Julia code, so that that struct field can be given a concrete type. Chains have a type parameter fully specifying the types of the layers they contain. By using a type parameter, we are freeing Julia to determine the correct concrete type, so that we do not need to specify the full, possibly quite long, type ourselves.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"You can then use the model like:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"chain = Chain(Dense(10 => 10, relu), Dense(10 => 10))\nmodel = CustomModel(chain)\nmodel(rand(Float32, 10))","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"For an intro to Flux and automatic differentiation, see this tutorial.","category":"page"},{"location":"guide/models/custom_layers/#Customising-Parameter-Collection-for-a-Model","page":"Custom Layers","title":"Customising Parameter Collection for a Model","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Taking reference from our example Affine layer from the basics.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"By default all the fields in the Affine type are collected as its parameters, however, in some cases it may be desired to hold other metadata in our \"layers\" that may not be needed for training, and are hence supposed to be ignored while the parameters are collected. With Flux, the way to mark some fields of our layer as trainable is through overloading the trainable function:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> struct Affine\n W\n b\n end\n\njulia> Affine(in::Int, out::Int) = Affine(randn(out, in), randn(out));\n\njulia> (m::Affine)(x) = m.W * x .+ m.b;\n\njulia> Flux.@layer Affine\n\njulia> a = Affine(Float32[1 2; 3 4; 5 6], Float32[7, 8, 9])\nAffine(Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a) # default behavior\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], b = Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a::Affine) = (; W = a.W) # returns a NamedTuple using the field's name\n\njulia> Flux.trainable(a)\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0],)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Only the fields returned by trainable will be seen by Flux.setup and Flux.update! for training. But all fields wil be seen by gpu and similar functions, for example:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> a |> f16\nAffine(Float16[1.0 2.0; 3.0 4.0; 5.0 6.0], Float16[7.0, 8.0, 9.0])","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Note that there is no need to overload trainable to hide fields which do not contain numerical array (for example, activation functions, or Boolean flags). These are always ignored by training.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The exact same method of trainable can also be defined using the macro, for convenience:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Affine trainable=(W,)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling Functors.@functor Affine (W,) means that all no exploration of the model will ever visit the other fields: They will not be moved to the GPU by gpu, and their precision will not be changed by f32. This requires the struct to have a corresponding constructor that accepts only W as an argument.","category":"page"},{"location":"guide/models/custom_layers/#Custom-multiple-input-or-output-layer","page":"Custom Layers","title":"Custom multiple input or output layer","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the inception module.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We could have a struct that stores the weights of along each path and implement the joining/splitting in the forward pass function. That would mean a new struct for each different block, e.g. one would have a TransformerBlock struct for a transformer block, and a ResNetBlock struct for a ResNet block, each block being composed by smaller sub-blocks. This is often the simplest and cleanest way to implement complex models.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"This guide instead will show you how to construct a high-level layer (like Chain) that is made of multiple sub-layers for each path.","category":"page"},{"location":"guide/models/custom_layers/#Multiple-inputs:-a-custom-Join-layer","page":"Custom Layers","title":"Multiple inputs: a custom Join layer","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Join layer will accept multiple inputs at once, pass each input through a separate path, then combine the results together. Note that this layer can already be constructed using Parallel, but we will first walk through how do this manually.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by defining a new struct, Join, that stores the different paths and a combine operation as its fields.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom join layer\nstruct Join{T, F}\n combine::F\n paths::T\nend\n\n# allow Join(op, m1, m2, ...) as a constructor\nJoin(combine, paths...) = Join(combine, paths)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice again that we parameterized the type of the combine and paths fields. In addition to the performance considerations of concrete types, this allows either field to be Vectors, Tuples, or one of each - we don't need to pay attention to which.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The next step is to use Flux.@layer to make our struct behave like a Flux layer. This is important so that calling Flux.setup on a Join maps over the underlying trainable arrays on each path.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Join","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Finally, we define the forward pass. For Join, this means applying each path in paths to each input array, then using combine to merge the results.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"(m::Join)(xs::Tuple) = m.combine(map((f, x) -> f(x), m.paths, xs)...)\n(m::Join)(xs...) = m(xs)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Lastly, we can test our new layer. Thanks to the proper abstractions in Julia, our layer works on GPU arrays out of the box!","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)), # branch 1\n Dense(1 => 2), # branch 2\n Dense(1 => 1) # branch 3\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Join layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"guide/models/custom_layers/#Using-Parallel","page":"Custom Layers","title":"Using Parallel","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux already provides Parallel that can offer the same functionality. In this case, Join is going to just be syntactic sugar for Parallel.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Join(combine, paths) = Parallel(combine, paths)\nJoin(combine, paths...) = Join(combine, paths)\n\n# use vararg/tuple version of Parallel forward pass\nmodel = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)),\n Dense(1 => 2),\n Dense(1 => 1)\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"guide/models/custom_layers/#Multiple-outputs:-a-custom-Split-layer","page":"Custom Layers","title":"Multiple outputs: a custom Split layer","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Split layer will accept a single input, then pass the input through a separate path to produce multiple outputs.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by following the same steps as the Join layer: define a struct, use Flux.@layer, and define the forward pass.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom split layer\nstruct Split{T}\n paths::T\nend\n\nSplit(paths...) = Split(paths)\n\nFlux.@layer Split\n\n(m::Split)(x::AbstractArray) = map(f -> f(x), m.paths)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Now we can test to see that our Split does indeed produce multiple outputs.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Dense(10 => 5),\n Split(Dense(5 => 1, tanh), Dense(5 => 3, tanh), Dense(5 => 2))\n ) |> gpu\n\nmodel(gpu(rand(10)))\n# returns a tuple with three float vectors","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"A custom loss function for the multiple outputs may look like this:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Statistics\n\n# assuming model returns the output of a Split\n# x is a single input\n# ys is a tuple of outputs\nfunction loss(x, ys, model)\n # rms over all the mse\n ŷs = model(x)\n return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs)))\nend","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Split layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"reference/data/mlutils/#Working-with-Data,-using-MLUtils.jl","page":"Batching Data – MLUtils.jl","title":"Working with Data, using MLUtils.jl","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.","category":"page"},{"location":"reference/data/mlutils/#DataLoader","page":"Batching Data – MLUtils.jl","title":"DataLoader","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The DataLoader can be used to create mini-batches of data, in the format train! expects.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.DataLoader","category":"page"},{"location":"reference/data/mlutils/#MLUtils.DataLoader","page":"Batching Data – MLUtils.jl","title":"MLUtils.DataLoader","text":"DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])\n\nAn object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).\n\nTakes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data object that implements the numobs and getobs methods.\n\nThe last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.\n\nThe original data is preserved in the data field of the DataLoader.\n\nArguments\n\ndata: The data to be iterated over. The data type has to be supported by numobs and getobs.\nbatchsize: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize observations. Default 1.\nbuffer: If buffer=true and supported by the type of data, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer. Default false.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\nparallel: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads() to see the number of available threads. Passing parallel = true breaks ordering guarantees. Default false.\npartial: This argument is used only when batchsize > 0. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true.\nrng: A random number generator. Default Random.GLOBAL_RNG.\nshuffle: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data), shuffle=true ensures that the observations are shuffled anew every time you start iterating over eachobs. Default false.\n\nExamples\n\njulia> Xtrain = rand(10, 100);\n\njulia> array_loader = DataLoader(Xtrain, batchsize=2);\n\njulia> for x in array_loader\n @assert size(x) == (10, 2)\n # do something with x, 50 times\n end\n\njulia> array_loader.data === Xtrain\ntrue\n\njulia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples\n\njulia> for x in tuple_loader\n @assert x isa Tuple{Matrix}\n @assert size(x[1]) == (10, 2)\n end\n\njulia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples\n\njulia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);\n\njulia> for epoch in 1:100\n for (x, y) in train_loader # access via tuple destructuring\n @assert size(x) == (10, 5)\n @assert size(y) == (5,)\n # loss += f(x, y) # etc, runs 100 * 20 times\n end\n end\n\njulia> first(train_loader).label isa Vector{Char} # access via property name\ntrue\n\njulia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true\nfalse\n\njulia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last\n10×30 Matrix{Int8}\n10×30 Matrix{Int8}\n10×4 Matrix{Int8}\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#Utility-Functions","page":"Batching Data – MLUtils.jl","title":"Utility Functions","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.batch\nMLUtils.batchsize\nMLUtils.batchseq\nMLUtils.BatchView\nMLUtils.chunk\nMLUtils.eachobs\nMLUtils.fill_like\nMLUtils.filterobs\nMLUtils.flatten\nMLUtils.getobs\nMLUtils.getobs!\nMLUtils.joinobs\nMLUtils.group_counts\nMLUtils.group_indices\nMLUtils.groupobs\nMLUtils.kfolds\nMLUtils.leavepout\nMLUtils.mapobs\nMLUtils.numobs\nMLUtils.normalise\nMLUtils.obsview\nMLUtils.ObsView\nMLUtils.ones_like\nMLUtils.oversample\nMLUtils.randobs\nMLUtils.rand_like\nMLUtils.randn_like\nMLUtils.rpad_constant\nMLUtils.shuffleobs\nMLUtils.splitobs\nMLUtils.unbatch\nMLUtils.undersample\nMLUtils.unsqueeze\nMLUtils.unstack\nMLUtils.zeros_like","category":"page"},{"location":"reference/data/mlutils/#MLUtils.batch","page":"Batching Data – MLUtils.jl","title":"MLUtils.batch","text":"batch(xs)\n\nBatch the arrays in xs into a single array with an extra dimension.\n\nIf the elements of xs are tuples, named tuples, or dicts, the output will be of the same type. \n\nSee also unbatch.\n\nExamples\n\njulia> batch([[1,2,3], \n [4,5,6]])\n3×2 Matrix{Int64}:\n 1 4\n 2 5\n 3 6\n\njulia> batch([(a=[1,2], b=[3,4])\n (a=[5,6], b=[7,8])]) \n(a = [1 5; 2 6], b = [3 7; 4 8])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchsize","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchsize","text":"batchsize(data::BatchView) -> Int\n\nReturn the fixed size of each batch in data.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert batchsize(A) == 30\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchseq","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchseq","text":"batchseq(seqs, val = 0)\n\nTake a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by val.\n\nExamples\n\njulia> batchseq([[1, 2, 3], [4, 5]], 0)\n3-element Vector{Vector{Int64}}:\n [1, 4]\n [2, 5]\n [3, 0]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.BatchView","page":"Batching Data – MLUtils.jl","title":"MLUtils.BatchView","text":"BatchView(data, batchsize; partial=true, collate=nothing)\nBatchView(data; batchsize=1, partial=true, collate=nothing)\n\nCreate a view of the given data that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter batchsize. In the case that the size of the dataset is not dividable by the specified batchsize, the remaining observations will be ignored if partial=false. If partial=true instead the last batch-size can be slightly smaller.\n\nNote that any data access is delayed until getindex is called.\n\nIf used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch.\n\nFor BatchView to work on some data structure, the type of the given variable data must implement the data container interface. See ObsView for more info.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nbatchsize : The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).\npartial : If partial=false and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert typeof(A) <: BatchView <: AbstractVector\n@assert eltype(A) <: SubArray{Float64,2}\n@assert length(A) == 5 # Iris has 150 observations\n@assert size(A[1]) == (4,30) # Iris has 4 features\n\n# 5 batches of size 30 observations\nfor x in BatchView(X, batchsize=30)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert numobs(x) === 30\nend\n\n# 7 batches of size 20 observations\n# Note that the iris dataset has 150 observations,\n# which means that with a batchsize of 20, the last\n# 10 observations will be ignored\nfor (x, y) in BatchView((X, Y), batchsize=20, partial=false)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\n @assert numobs(x) == numobs(y) == 20\nend\n\n# collate tuple observations\nfor (x, y) in BatchView((rand(10, 3), [\"a\", \"b\", \"c\"]), batchsize=2, collate=true, partial=false)\n @assert size(x) == (10, 2)\n @assert size(y) == (2,)\nend\n\n\n# randomly assign observations to one and only one batch.\nfor (x, y) in BatchView(shuffleobs((X, Y)), batchsize=20)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\nend\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.chunk","page":"Batching Data – MLUtils.jl","title":"MLUtils.chunk","text":"chunk(x, n; [dims])\nchunk(x; [size, dims])\n\nSplit x into n parts or alternatively, if size is an integer, into equal chunks of size size. The parts contain the same number of elements except possibly for the last one that can be smaller.\n\nIn case size is a collection of integers instead, the elements of x are split into chunks of the given sizes.\n\nIf x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).\n\nExamples\n\njulia> chunk(1:10, 3)\n3-element Vector{UnitRange{Int64}}:\n 1:4\n 5:8\n 9:10\n\njulia> chunk(1:10; size = 2)\n5-element Vector{UnitRange{Int64}}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\njulia> x = reshape(collect(1:20), (5, 4))\n5×4 Matrix{Int64}:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n 4 9 14 19\n 5 10 15 20\n\njulia> xs = chunk(x, 2, dims=1)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:\n [1 6 11 16; 2 7 12 17; 3 8 13 18]\n [4 9 14 19; 5 10 15 20]\n\njulia> xs[1]\n3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n\njulia> xes = chunk(x; size = 2, dims = 2)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1 6; 2 7; … ; 4 9; 5 10]\n [11 16; 12 17; … ; 14 19; 15 20]\n\njulia> xes[2]\n5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:\n 11 16\n 12 17\n 13 18\n 14 19\n 15 20\n\njulia> chunk(1:6; size = [2, 4])\n2-element Vector{UnitRange{Int64}}:\n 1:2\n 3:6\n\n\n\n\n\nchunk(x, partition_idxs; [npartitions, dims])\n\nPartition the array x along the dimension dims according to the indexes in partition_idxs.\n\npartition_idxs must be sorted and contain only positive integers between 1 and the number of partitions. \n\nIf the number of partition npartitions is not provided, it is inferred from partition_idxs.\n\nIf dims is not provided, it defaults to the last dimension.\n\nSee also unbatch.\n\nExamples\n\njulia> x = reshape([1:10;], 2, 5)\n2×5 Matrix{Int64}:\n 1 3 5 7 9\n 2 4 6 8 10\n\njulia> chunk(x, [1, 2, 2, 3, 3])\n3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1; 2;;]\n [3 5; 4 6]\n [7 9; 8 10]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.eachobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.eachobs","text":"eachobs(data; kws...)\n\nReturn an iterator over data.\n\nSupports the same arguments as DataLoader. The batchsize default is -1 here while it is 1 for DataLoader.\n\nExamples\n\nX = rand(4,100)\n\nfor x in eachobs(X)\n # loop entered 100 times\n @assert typeof(x) <: Vector{Float64}\n @assert size(x) == (4,)\nend\n\n# mini-batch iterations\nfor x in eachobs(X, batchsize=10)\n # loop entered 10 times\n @assert typeof(x) <: Matrix{Float64}\n @assert size(x) == (4,10)\nend\n\n# support for tuples, named tuples, dicts\nfor (x, y) in eachobs((X, Y))\n # ...\nend\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.fill_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.fill_like","text":"fill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to val. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and ones_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.16087806\n 0.89916044\n\njulia> fill_like(x, 1.7, (3, 3))\n3×3 Matrix{Float32}:\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.803167 0.476101\n 0.303041 0.317581\n\njulia> fill_like(x, 1.7, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.7 1.7\n 1.7 1.7\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.filterobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.filterobs","text":"filterobs(f, data)\n\nReturn a subset of data container data including all indices i for which f(getobs(data, i)) === true.\n\ndata = 1:10\nnumobs(data) == 10\nfdata = filterobs(>(5), data)\nnumobs(fdata) == 5\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.flatten","page":"Batching Data – MLUtils.jl","title":"MLUtils.flatten","text":"flatten(x::AbstractArray)\n\nReshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.\n\nSee also unsqueeze.\n\nExamples\n\njulia> rand(3,4,5) |> flatten |> size\n(12, 5)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs","text":"getobs(data, [idx])\n\nReturn the observations corresponding to the observation index idx. Note that idx can be any type as long as data has defined getobs for that type. If idx is not provided, then materialize all observations in data.\n\nIf data does not have getobs defined, then in the case of Tables.table(data) == true returns the row(s) in position idx, otherwise returns data[idx].\n\nAuthors of custom data containers should implement Base.getindex for their type instead of getobs. getobs should only be implemented for types where there is a difference between getobs and Base.getindex (such as multi-dimensional arrays).\n\nThe returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this \"actual data\" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx is a scalar vs vector.\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs! and numobs.\n\nExamples\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\n\ngetobs(x, 2) == (a = 2, b = x.b[:, 2])\ngetobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])\n\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\n\ngetobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])\ngetobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs!","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs!","text":"getobs!(buffer, data, idx)\n\nInplace version of getobs(data, idx). If this method is defined for the type of data, then buffer should be used to store the result, instead of allocating a dedicated object.\n\nImplementing this function is optional. In the case no such method is provided for the type of data, then buffer will be ignored and the result of getobs returned. This could be because the type of data may not lend itself to the concept of copy!. Thus, supporting a custom getobs! is optional and not required.\n\nSee also getobs and numobs. \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.joinobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.joinobs","text":"joinobs(datas...)\n\nConcatenate data containers datas.\n\ndata1, data2 = 1:10, 11:20\njdata = joinumobs(data1, data2)\ngetobs(jdata, 15) == 15\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_counts","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_counts","text":"group_counts(x)\n\nCount the number of times that each element of x appears.\n\nSee also group_indices\n\nExamples\n\njulia> group_counts(['a', 'b', 'b'])\nDict{Char, Int64} with 2 entries:\n 'a' => 1\n 'b' => 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_indices","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_indices","text":"group_indices(x) -> Dict\n\nComputes the indices of elements in the vector x for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.\n\nSee also group_counts.\n\nExamples\n\njulia> x = [:yes, :no, :maybe, :yes];\n\njulia> group_indices(x)\nDict{Symbol, Vector{Int64}} with 3 entries:\n :yes => [1, 4]\n :maybe => [3]\n :no => [2]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.groupobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.groupobs","text":"groupobs(f, data)\n\nSplit data container data data into different data containers, grouping observations by f(obs).\n\ndata = -10:10\ndatas = groupobs(>(0), data)\nlength(datas) == 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.kfolds","page":"Batching Data – MLUtils.jl","title":"MLUtils.kfolds","text":"kfolds(n::Integer, k = 5) -> Tuple\n\nCompute the train/validation assignments for k repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5 or k = 10. The following code snippet generates the indices assignments for k = 5\n\njulia> train_idx, val_idx = kfolds(10, 5);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nkfolds(data, [k = 5])\n\nRepartition a data container k times using a k folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs is invoked.\n\nConceptually, a k-folds repartitioning strategy divides the given data into k roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k different partitions of data.\n\nIn the case that the size of the dataset is not dividable by the specified k, the remaining observations will be evenly distributed among the parts.\n\nfor (x_train, x_val) in kfolds(X, k=10)\n # code called 10 times\n # nobs(x_val) may differ up to ±1 over iterations\nend\n\nMultiple variables are supported (e.g. for labeled data)\n\nfor ((x_train, y_train), val) in kfolds((X, Y), k=10)\n # ...\nend\n\nBy default the folds are created using static splits. Use shuffleobs to randomly assign observations to the folds.\n\nfor (x_train, x_val) in kfolds(shuffleobs(X), k = 10)\n # ...\nend\n\nSee leavepout for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.leavepout","page":"Batching Data – MLUtils.jl","title":"MLUtils.leavepout","text":"leavepout(n::Integer, [size = 1]) -> Tuple\n\nCompute the train/validation assignments for k ≈ n/size repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size or size+1 observations assigned to it. The following code snippet generates the index-vectors for size = 2.\n\njulia> train_idx, val_idx = leavepout(10, 2);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nleavepout(data, p = 1)\n\nRepartition a data container using a k-fold strategy, where k is chosen in such a way, that each validation subset of the resulting folds contains roughly p observations. Defaults to p = 1, which is also known as \"leave-one-out\" partitioning.\n\nThe resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until getobs is invoked.\n\nfor (train, val) in leavepout(X, p=2)\n # if nobs(X) is dividable by 2,\n # then numobs(val) will be 2 for each iteraton,\n # otherwise it may be 3 for the first few iterations.\nend\n\nSeekfolds for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.mapobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.mapobs","text":"mapobs(f, data; batched=:auto)\n\nLazily map f over the observations in a data container data. Returns a new data container mdata that can be indexed and has a length. Indexing triggers the transformation f.\n\nThe batched keyword argument controls the behavior of mdata[idx] and mdata[idxs] where idx is an integer and idxs is a vector of integers:\n\nbatched=:auto (default). Let f handle the two cases. Calls f(getobs(data, idx)) and f(getobs(data, idxs)).\nbatched=:never. The function f is always called on a single observation. Calls f(getobs(data, idx)) and [f(getobs(data, idx)) for idx in idxs].\nbatched=:always. The function f is always called on a batch of observations. Calls getobs(f(getobs(data, [idx])), 1) and f(getobs(data, idxs)).\n\nExamples\n\njulia> data = (a=[1,2,3], b=[1,2,3]);\n\njulia> mdata = mapobs(data) do x\n (c = x.a .+ x.b, d = x.a .- x.b)\n end\nmapobs(#25, (a = [1, 2, 3], b = [1, 2, 3]); batched=:auto))\n\njulia> mdata[1]\n(c = 2, d = 0)\n\njulia> mdata[1:2]\n(c = [2, 4], d = [0, 0])\n\n\n\n\n\nmapobs(fs, data)\n\nLazily map each function in tuple fs over the observations in data container data. Returns a tuple of transformed data containers.\n\n\n\n\n\nmapobs(namedfs::NamedTuple, data)\n\nMap a NamedTuple of functions over data, turning it into a data container of NamedTuples. Field syntax can be used to select a column of the resulting data container.\n\ndata = 1:10\nnameddata = mapobs((x = sqrt, y = log), data)\ngetobs(nameddata, 10) == (x = sqrt(10), y = log(10))\ngetobs(nameddata.x, 10) == sqrt(10)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.numobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.numobs","text":"numobs(data)\n\nReturn the total number of observations contained in data.\n\nIf data does not have numobs defined, then in the case of Tables.table(data) == true returns the number of rows, otherwise returns length(data).\n\nAuthors of custom data containers should implement Base.length for their type instead of numobs. numobs should only be implemented for types where there is a difference between numobs and Base.length (such as multi-dimensional arrays).\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs.\n\nExamples\n\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\nnumobs(x) == 3\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\nnumobs(x) == 3\n\nAll internal containers must have the same number of observations:\n\njulia> x = (a = [1, 2, 3, 4], b = rand(6, 3));\n\njulia> numobs(x)\nERROR: DimensionMismatch: All data containers must have the same number of observations.\nStacktrace:\n [1] _check_numobs_error()\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:163\n [2] _check_numobs\n @ ~/.julia/dev/MLUtils/src/observation.jl:130 [inlined]\n [3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177\n [4] top-level scope\n @ REPL[35]:1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.normalise","page":"Batching Data – MLUtils.jl","title":"MLUtils.normalise","text":"normalise(x; dims=ndims(x), ϵ=1e-5)\n\nNormalise the array x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. \n\nϵ is a small additive factor added to the denominator for numerical stability.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.obsview","page":"Batching Data – MLUtils.jl","title":"MLUtils.obsview","text":"obsview(data, [indices])\n\nReturns a lazy view of the observations in data that correspond to the given indices. No data will be copied except of the indices. It is similar to constructing an ObsView, but returns a SubArray if the type of data is Array or SubArray. Furthermore, this function may be extended for custom types of data that also want to provide their own subset-type.\n\nIn case data is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of ObsView instead of a ObsView of tuples.\n\nIf instead you want to get the subset of observations corresponding to the given indices in their native type, use getobs.\n\nSee ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.ObsView","page":"Batching Data – MLUtils.jl","title":"MLUtils.ObsView","text":"ObsView(data, [indices])\n\nUsed to represent a subset of some data of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.\n\nThe main purpose for the existence of ObsView is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.\n\nAny data access is delayed until getindex is called, and even getindex returns the result of obsview which in general avoids data movement until getobs is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nindices : Optional. The index or indices of the observation(s) in data that the subset should represent. Can be of type Int or some subtype of AbstractVector.\n\nMethods\n\ngetindex : Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.\nnumobs : Returns the total number observations in the subset.\ngetobs : Returns the underlying data that the ObsView represents at the given relative indices. Note that these indices are in \"subset space\", and in general will not directly correspond to the same indices in the underlying data set.\n\nDetails\n\nFor ObsView to work on some data structure, the desired type MyType must implement the following interface:\n\ngetobs(data::MyType, idx) : Should return the observation(s) indexed by idx. In what form is up to the user. Note that idx can be of type Int or AbstractVector.\nnumobs(data::MyType) : Should return the total number of observations in data\n\nThe following methods can also be provided and are optional:\n\ngetobs(data::MyType) : By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.\nobsview(data::MyType, idx) : If your custom type has its own kind of subset type, you can return it here. An example for such a case are SubArray for representing a subset of some AbstractArray.\ngetobs!(buffer, data::MyType, [idx]) : Inplace version of getobs(data, idx). If this method is provided for MyType, then eachobs can preallocate a buffer that is then reused every iteration. Note: buffer should be equivalent to the return value of getobs(::MyType, ...), since this is how buffer is preallocated by default.\n\nExamples\n\nX, Y = MLUtils.load_iris()\n\n# The iris set has 150 observations and 4 features\n@assert size(X) == (4,150)\n\n# Represents the 80 observations as a ObsView\nv = ObsView(X, 21:100)\n@assert numobs(v) == 80\n@assert typeof(v) <: ObsView\n# getobs indexes into v\n@assert getobs(v, 1:10) == X[:, 21:30]\n\n# Use `obsview` to avoid boxing into ObsView\n# for types that provide a custom \"subset\", such as arrays.\n# Here it instead creates a native SubArray.\nv = obsview(X, 1:100)\n@assert numobs(v) == 100\n@assert typeof(v) <: SubArray\n\n# Also works for tuples of arbitrary length\nsubset = obsview((X, Y), 1:100)\n@assert numobs(subset) == 100\n@assert typeof(subset) <: Tuple # tuple of SubArray\n\n# Use as iterator\nfor x in ObsView(X)\n @assert typeof(x) <: SubArray{Float64,1}\nend\n\n# iterate over each individual labeled observation\nfor (x, y) in ObsView((X, Y))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# same but in random order\nfor (x, y) in ObsView(shuffleobs((X, Y)))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# Indexing: take first 10 observations\nx, y = ObsView((X, Y))[1:10]\n\nSee also\n\nobsview, getobs, numobs, splitobs, shuffleobs, kfolds.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.ones_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.ones_like","text":"ones_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.8621633\n 0.5158395\n\njulia> ones_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.82297 0.656143\n 0.701828 0.391335\n\njulia> ones_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.oversample","page":"Batching Data – MLUtils.jl","title":"MLUtils.oversample","text":"oversample(data, classes; fraction=1, shuffle=true)\noversample(data::Tuple; fraction=1, shuffle=true)\n\nGenerate a re-balanced version of data by repeatedly sampling existing observations in such a way that every class will have at least fraction times the number observations of the largest class in classes. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) data.\n\nAs an example, by default (i.e. with fraction = 1) the resulting dataset will be near perfectly balanced. On the other hand, with fraction = 0.5 every class in the resulting data with have at least 50% as many observations as the largest class.\n\nThe classes input is an array with the same length as numobs(data). \n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to true.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# oversample the class \"a\" to match \"b\"\nX_bal, Y_bal = oversample(X, Y)\n\n# this results in a bigger dataset with repeated data\n@assert size(X_bal) == (3,8)\n@assert length(Y_bal) == 8\n\n# now both \"a\", and \"b\" have 4 observations each\n@assert sum(Y_bal .== \"a\") == 4\n@assert sum(Y_bal .== \"b\") == 4\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple and classes is not given, then it will be assumed that the last element of the tuple contains the classes.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(oversample(data, data.Y))\n8×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.376304 0.100022 a\n 2 │ 0.467095 0.185437 b\n 3 │ 0.481957 0.319906 b\n 4 │ 0.336762 0.390811 b\n 5 │ 0.376304 0.100022 a\n 6 │ 0.427064 0.0648339 a\n 7 │ 0.427064 0.0648339 a\n 8 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also undersample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.randobs","text":"randobs(data, [n])\n\nPick a random observation or a batch of n random observations from data. For this function to work, the type of data must implement numobs and getobs.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rand_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.rand_like","text":"rand_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.rand and randn_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> rand_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.780032 0.920552 0.53689\n 0.121451 0.741334 0.5449\n 0.55348 0.138136 0.556404\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> rand_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.429274 0.135379\n 0.718895 0.0098756\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randn_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.randn_like","text":"randn_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value drawn from a normal distribution. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.randn and rand_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> randn_like(x, (3, 3))\n3×3 Matrix{Float32}:\n -0.385331 0.956231 0.0745102\n 1.43756 -0.967328 2.06311\n 0.0482372 1.78728 -0.902547\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> randn_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n -0.578527 0.823445\n -1.01338 -0.612053\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rpad_constant","page":"Batching Data – MLUtils.jl","title":"MLUtils.rpad_constant","text":"rpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)\n\nReturn the given sequence padded with val along the dimensions dims up to a maximum length in each direction specified by n.\n\nExamples\n\njulia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4\n4-element Vector{Int64}:\n 1\n 2\n -1\n -1\n\njulia> rpad_constant([1, 2, 3], 2) # no padding if length is already greater than n\n3-element Vector{Int64}:\n 1\n 2\n 3\n\njulia> rpad_constant([1 2; 3 4], 4; dims=1) # padding along the first dimension\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\njulia> rpad_constant([1 2; 3 4], 4) # padding along all dimensions by default\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.shuffleobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.shuffleobs","text":"shuffleobs([rng], data)\n\nReturn a \"subset\" of data that spans all observations, but has the order of the observations shuffled.\n\nThe values of data itself are not copied. Instead only the indices are shuffled. This function calls obsview to accomplish that, which means that the return value is likely of a different type than data.\n\n# For Arrays the subset will be of type SubArray\n@assert typeof(shuffleobs(rand(4,10))) <: SubArray\n\n# Iterate through all observations in random order\nfor x in eachobs(shuffleobs(X))\n ...\nend\n\nThe optional parameter rng allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See Random in Julia's standard library for more info.\n\nFor this function to work, the type of data must implement numobs and getobs. See ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.splitobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.splitobs","text":"splitobs(n::Int; at) -> Tuple\n\nCompute the indices for two or more disjoint subsets of the range 1:n with splits given by at.\n\nExamples\n\njulia> splitobs(100, at=0.7)\n(1:70, 71:100)\n\njulia> splitobs(100, at=(0.1, 0.4))\n(1:10, 11:50, 51:100)\n\n\n\n\n\nsplitobs(data; at, shuffle=false) -> Tuple\n\nPartition the data into two or more subsets. When at is a number (between 0 and 1) this specifies the proportion in the first subset. When at is a tuple, each entry specifies the proportion an a subset, with the last having 1-sum(at). In all there are length(at)+1 subsets returned.\n\nIf shuffle=true, randomly permute the observations before splitting.\n\nSupports any datatype implementing the numobs and getobs interfaces – including arrays, tuples & NamedTuples of arrays.\n\nExamples\n\njulia> splitobs(permutedims(1:100); at=0.7) # simple 70%-30% split, of a matrix\n([1 2 … 69 70], [71 72 … 99 100])\n\njulia> data = (x=ones(2,10), n=1:10) # a NamedTuple, consistent last dimension\n(x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:10)\n\njulia> splitobs(data, at=(0.5, 0.3)) # a 50%-30%-20% split, e.g. train/test/validation\n((x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:5), (x = [1.0 1.0 1.0; 1.0 1.0 1.0], n = 6:8), (x = [1.0 1.0; 1.0 1.0], n = 9:10))\n\njulia> train, test = splitobs((permutedims(1.0:100.0), 101:200), at=0.7, shuffle=true); # split a Tuple\n\njulia> vec(test[1]) .+ 100 == test[2]\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unbatch","page":"Batching Data – MLUtils.jl","title":"MLUtils.unbatch","text":"unbatch(x)\n\nReverse of the batch operation, unstacking the last dimension of the array x.\n\nSee also unstack and chunk.\n\nExamples\n\njulia> unbatch([1 3 5 7;\n 2 4 6 8])\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.undersample","page":"Batching Data – MLUtils.jl","title":"MLUtils.undersample","text":"undersample(data, classes; shuffle=true)\n\nGenerate a class-balanced version of data by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data.\n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# subsample the class \"b\" to match \"a\"\nX_bal, Y_bal = undersample(X, Y)\n\n# this results in a smaller dataset\n@assert size(X_bal) == (3,4)\n@assert length(Y_bal) == 4\n\n# now both \"a\", and \"b\" have 2 observations each\n@assert sum(Y_bal .== \"a\") == 2\n@assert sum(Y_bal .== \"b\") == 2\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple, then it will be assumed that the last element of the tuple contains the targets.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(undersample(data, data.Y))\n4×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.427064 0.0648339 a\n 2 │ 0.376304 0.100022 a\n 3 │ 0.467095 0.185437 b\n 4 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also oversample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unsqueeze","page":"Batching Data – MLUtils.jl","title":"MLUtils.unsqueeze","text":"unsqueeze(x; dims)\n\nReturn x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended. dims can be an integer between 1 and ndims(x)+1.\n\nSee also flatten, stack.\n\nExamples\n\njulia> unsqueeze([1 2; 3 4], dims=2)\n2×1×2 Array{Int64, 3}:\n[:, :, 1] =\n 1\n 3\n\n[:, :, 2] =\n 2\n 4\n\n\njulia> xs = [[1, 2], [3, 4], [5, 6]]\n3-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n\njulia> unsqueeze(xs, dims=1)\n1×3 Matrix{Vector{Int64}}:\n [1, 2] [3, 4] [5, 6]\n\n\n\n\n\nunsqueeze(; dims)\n\nReturns a function which, acting on an array, inserts a dimension of size 1 at dims.\n\nExamples\n\njulia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size\n(21, 1, 22, 23)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unstack","page":"Batching Data – MLUtils.jl","title":"MLUtils.unstack","text":"unstack(xs; dims)\n\nUnroll the given xs into an array of arrays along the given dimension dims.\n\nSee also stack, unbatch, and chunk.\n\nExamples\n\njulia> unstack([1 3 5 7; 2 4 6 8], dims=2)\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.zeros_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.zeros_like","text":"zeros_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also ones_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.4005432\n 0.36934233\n\njulia> zeros_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.0695155 0.667979\n 0.558468 0.59903\n\njulia> zeros_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.0 0.0\n 0.0 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#man-callback-helpers","page":"Callback Helpers","title":"Callback Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.throttle","category":"page"},{"location":"reference/training/callbacks/#Flux.throttle","page":"Callback Helpers","title":"Flux.throttle","text":"throttle(f, timeout; leading=true, trailing=false)\n\nReturn a function that when invoked, will only be triggered at most once during timeout seconds.\n\nNormally, the throttled function will run as much as it can, without ever going more than once per wait duration; but if you'd like to disable the execution on the leading edge, pass leading=false. To enable execution on the trailing edge, pass trailing=true.\n\nExamples\n\njulia> a = Flux.throttle(() -> println(\"Flux\"), 2);\n\njulia> for i = 1:4 # a called in alternate iterations\n a()\n sleep(1)\n end\nFlux\nFlux\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Patience-Helpers","page":"Callback Helpers","title":"Patience Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-loss that decreases for 4 calls, then starts increasing\n# we call this like loss()\nloss = let t = 0\n () -> begin\n t += 1\n (t - 4) ^ 2\n end\nend\n\n# create an early stopping trigger\n# returns true when the loss increases for two consecutive steps\nes = early_stopping(loss, 2; init_score = 9)\n\n# this will stop at the 6th (4 decreasing + 2 increasing calls) epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"The keyword argument distance of early_stopping is a function of the form distance(best_score, score). By default distance is -, which implies that the monitored metric f is expected to be decreasing and minimized. If you use some increasing metric (e.g. accuracy), you can customize the distance function: (best_score, score) -> score - best_score.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-accuracy that increases by 0.01 each time from 0 to 1\n# we call this like acc()\nacc = let v = 0\n () -> v = max(1, v + 0.01)\nend\n\n# create an early stopping trigger for accuracy\nes = early_stopping(acc, 3; delta = (best_score, score) -> score - best_score)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"early_stopping and plateau are both built on top of patience. You can use patience to build your own triggers that use a patient counter. For example, if you want to trigger when the loss is below a threshold for several consecutive iterations:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"threshold(f, thresh, delay) = patience(delay) do\n f() < thresh\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Both predicate in patience and f in early_stopping / plateau can accept extra arguments. You can pass such extra arguments to predicate or f through the returned function:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"trigger = patience((a; b) -> a > b, 3)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n trigger(1; b = 2) && break\nend\n\n# this will stop at the 3rd epoch\nfor epoch in 1:10\n trigger(3; b = 2) && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.patience\nFlux.early_stopping\nFlux.plateau","category":"page"},{"location":"reference/training/callbacks/#Flux.patience","page":"Callback Helpers","title":"Flux.patience","text":"patience(predicate, wait)\n\nReturn a function that internally counts by one when predicate(...) == true, otherwise the count is reset to zero. If the count is greater than or equal to wait, the function returns true, otherwise it returns false.\n\nExamples\n\njulia> loss() = rand();\n\njulia> trigger = Flux.patience(() -> loss() < 1, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.early_stopping","page":"Callback Helpers","title":"Flux.early_stopping","text":"early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)\n\nReturn a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.\n\nExamples\n\njulia> loss = let l = 0\n () -> l += 1\n end; # pseudo loss function that returns increasing values\n\njulia> es = Flux.early_stopping(loss, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n es() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.plateau","page":"Callback Helpers","title":"Flux.plateau","text":"plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)\n\nReturn a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.\n\nExamples\n\njulia> f = let v = 10\n () -> v = v / abs(v) - v\n end; # -9, 8, -7, 6, ...\n\njulia> trigger = Flux.plateau(f, 3; init_score=10, min_dist=18);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n[ Info: Epoch 4\n\n\n\n\n\n","category":"function"},{"location":"guide/training/training/#man-training","page":"Training","title":"Training a Flux Model","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Training refers to the process of slowly adjusting the parameters of a model to make it work better. Besides the model itself, we will need three things:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"An objective function that evaluates how well a model is doing on some input.\nAn optimisation rule which describes how the model's parameters should be adjusted.\nSome training data to use as the input during this process.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Usually the training data is some collection of examples (or batches of examples) which are handled one-by-one. One epoch of training means that each example is used once, something like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise the optimiser for this model:\nopt_state = Flux.setup(rule, model)\n\nfor data in train_set\n # Unpack this element (for supervised training):\n input, label = data\n\n # Calculate the gradient of the objective\n # with respect to the parameters within the model:\n grads = Flux.gradient(model) do m\n result = m(input)\n loss(result, label)\n end\n\n # Update the parameters so as to reduce the objective,\n # according the chosen optimisation rule:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This loop can also be written using the function train!, but it's helpful to understand the pieces first:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\nend","category":"page"},{"location":"guide/training/training/#Model-Gradients","page":"Training","title":"Model Gradients","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Fist recall from the section on taking gradients that Flux.gradient(f, a, b) always calls f(a, b), and returns a tuple (∂f_∂a, ∂f_∂b). In the code above, the function f passed to gradient is an anonymous function with one argument, created by the do block, hence grads is a tuple with one element. Instead of a do block, we could have written:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(m -> loss(m(input), label), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Since the model is some nested set of layers, grads[1] is a similarly nested set of NamedTuples, ultimately containing gradient components. If (for example) θ = model.layers[1].weight[2,3] is one scalar parameter, an entry in a matrix of weights, then the derivative of the loss with respect to it is ∂f_∂θ = grads[1].layers[1].weight[2,3].","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is important that the execution of the model takes place inside the call to gradient, in order for the influence of the model's parameters to be observed by Zygote.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is also important that every update! step receives a newly computed gradient, as it will change whenever the model's parameters are changed, and for each new data point.","category":"page"},{"location":"guide/training/training/#Loss-Functions","page":"Training","title":"Loss Functions","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The objective function must return a number representing how far the model is from the desired result. This is termed the loss of the model.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This number can be produced by any ordinary Julia code, but this must be executed within the call to gradient. For instance, we could define a function","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"loss(y_hat, y) = sum((y_hat .- y).^2)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or write this directly inside the do block above. Many commonly used functions, like mse for mean-squared error or crossentropy for cross-entropy loss, are available from the Flux.Losses module.","category":"page"},{"location":"guide/training/training/#Optimisation-Rules","page":"Training","title":"Optimisation Rules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The simplest kind of optimisation using the gradient is termed gradient descent (or sometimes stochastic gradient descent when, as here, it is not applied to the entire dataset at once).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Gradient descent needs a learning rate which is a small number describing how fast to walk downhill, usually written as the Greek letter \"eta\", η. This is often described as a hyperparameter, to distinguish it from the parameters which are being updated θ = θ - η * ∂loss_∂θ. We want to update all the parameters in the model, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"η = 0.01 # learning rate\n\n# For each parameter array, update\n# according to the corresponding gradient:\nfmap(model, grads[1]) do p, g\n p .= p .- η .* g\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"A slightly more refined version of this loop to update all the parameters is wrapped up as a function update!(opt_state, model, grads[1]). And the learning rate is the only thing stored in the Descent struct.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, there are many other optimisation rules, which adjust the step size and direction in various clever ways. Most require some memory of the gradients from earlier steps, rather than always walking straight downhill – Momentum is the simplest. The function setup creates the necessary storage for this, for a particular model. It should be called once, before training, and returns a tree-like object which is the first argument of update!. Like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise momentum \nopt_state = Flux.setup(Momentum(0.01, 0.9), model)\n\nfor data in train_set\n grads = [...]\n\n # Update both model parameters and optimiser state:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Many commonly-used optimisation rules, such as Adam, are built-in. These are listed on the optimisers page.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"compat: Implicit-style optimiser state\nThis setup makes another tree-like structure. Old versions of Flux did not do this, and instead stored a dictionary-like structure within the optimiser Adam(0.001). This was initialised on first use of the version of update! for \"implicit\" parameters.","category":"page"},{"location":"guide/training/training/#Datasets-and-Batches","page":"Training","title":"Datasets & Batches","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The loop above iterates through train_set, expecting at each step a tuple (input, label). The very simplest such object is a vector of tuples, such as this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"x = randn(28, 28)\ny = rand(10)\ndata = [(x, y)]","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or data = [(x, y), (x, y), (x, y)] for the same values three times.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Very often, the initial data is large arrays which you need to slice into examples. To produce one iterator of pairs (x, y), you might want zip:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"X = rand(28, 28, 60_000); # many images, each 28 × 28\nY = rand(10, 60_000)\ndata = zip(eachslice(X; dims=3), eachcol(Y))\n\nfirst(data) isa Tuple{AbstractMatrix, AbstractVector} # true","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Here each iteration will use one matrix x (an image, perhaps) and one vector y. It is very common to instead train on batches of such inputs (or mini-batches, the two words mean the same thing) both for efficiency and for better results. This can be easily done using the DataLoader:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"data = Flux.DataLoader((X, Y), batchsize=32)\n\nx1, y1 = first(data)\nsize(x1) == (28, 28, 32)\nlength(data) == 1875 === 60_000 ÷ 32","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's layers are set up to accept such a batch of input data, and the convolutional layers such as Conv require it. The batch index is always the last dimension.","category":"page"},{"location":"guide/training/training/#Training-Loops","page":"Training","title":"Training Loops","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Simple training loops like the one above can be written compactly using the train! function. Including setup, this reads:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nfor epoch in 1:100\n Flux.train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Or explicitly writing the anonymous function which this do block creates, train!((m,x,y) -> loss(m(x),y), model, train_set, opt_state) is exactly equivalent.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Real training loops often need more flexibility, and the best way to do this is just to write the loop. This is ordinary Julia code, without any need to work through some callback API. Here is an example, in which it may be helpful to note:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The function withgradient is like gradient but also returns the value of the function, for logging or diagnostic use.\nLogging or printing is best done outside of the gradient call, as there is no need to differentiate these commands.\nTo use result for logging purposes, you could change the do block to end with return my_loss(result, label), result, i.e. make the function passed to withgradient return a tuple. The first element is always the loss.\nJulia's break and continue keywords let you exit from parts of the loop.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nmy_log = []\nfor epoch in 1:100\n losses = Float32[]\n for (i, data) in enumerate(train_set)\n input, label = data\n\n val, grads = Flux.withgradient(model) do m\n # Any code inside here is differentiated.\n # Evaluation of the model and loss must be inside!\n result = m(input)\n my_loss(result, label)\n end\n\n # Save the loss from the forward pass. (Done outside of gradient.)\n push!(losses, val)\n\n # Detect loss of Inf or NaN. Print a warning, and then skip update!\n if !isfinite(val)\n @warn \"loss is $val on item $i\" epoch\n continue\n end\n\n Flux.update!(opt_state, model, grads[1])\n end\n\n # Compute some accuracy, and save details as a NamedTuple\n acc = my_accuracy(model, train_set)\n push!(my_log, (; acc, losses))\n\n # Stop training when some criterion is reached\n if acc > 0.95\n println(\"stopping after $epoch epochs\")\n break\n end\nend","category":"page"},{"location":"guide/training/training/#Regularisation","page":"Training","title":"Regularisation","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The term regularisation covers a wide variety of techniques aiming to improve the result of training. This is often done to avoid overfitting.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Some of these can be implemented by simply modifying the loss function. L₂ regularisation (sometimes called ridge regression) adds to the loss a penalty proportional to θ^2 for every scalar parameter. A very simple model could be implemented as follows:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(densemodel) do m\n result = m(input)\n penalty = sum(abs2, m.weight)/2 + sum(abs2, m.bias)/2\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Accessing each individual parameter array by hand won't work well for large models. Instead, we can use Flux.trainables to collect all of them, and then apply a function to each one, and sum the result:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"pen_l2(x::AbstractArray) = sum(abs2, x)/2\n\ngrads = Flux.gradient(model) do m\n result = m(input)\n penalty = sum(pen_l2, Flux.trainables(m))\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, the gradient of this penalty term is very simple: It is proportional to the original weights. So there is a simpler way to implement exactly the same thing, by modifying the optimiser instead of the loss function. This is done by replacing this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"with this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"decay_opt_state = Flux.setup(OptimiserChain(WeightDecay(0.42), Adam(0.1)), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's optimisers are really modifications applied to the gradient before using it to update the parameters, and OptimiserChain applies two such modifications. The first, WeightDecay adds 0.42 times the original parameter to the gradient, matching the gradient of the penalty above (with the same, unrealistically large, constant). After that, in either case, Adam computes the final update.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same trick works for L₁ regularisation (also called Lasso), where the penalty is pen_l1(x::AbstractArray) = sum(abs, x) instead. This is implemented by SignDecay(0.42).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same OptimiserChain mechanism can be used for other purposes, such as gradient clipping with ClipGrad or ClipNorm.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Besides L1 / L2 / weight decay, another common and quite different kind of regularisation is provided by the Dropout layer. This turns off some outputs of the previous layer during training. It should switch automatically, but see trainmode! / testmode! to manually enable or disable this layer.","category":"page"},{"location":"guide/training/training/#Learning-Rate-Schedules","page":"Training","title":"Learning Rate Schedules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Finer control of training, you may wish to alter the learning rate mid-way through training. This can be done with adjust!, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model) # initialise once\n\nfor epoch in 1:1000\n train!([...], state) # Train with η = 0.1 for first 100,\n if epoch == 100 # then change to use η = 0.01 for the rest.\n Flux.adjust!(opt_state, 0.01)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Other hyper-parameters can also be adjusted, such as Flux.adjust!(opt_state, beta = (0.8, 0.99)). And such modifications can be applied to just one part of the model. For instance, this sets a different learning rate for the encoder and the decoder:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Consider some model with two parts:\nbimodel = Chain(enc = [...], dec = [...])\n\n# This returns a tree whose structure matches the model:\nopt_state = Flux.setup(Adam(0.02), bimodel)\n\n# Adjust the learning rate to be used for bimodel.layers.enc\nFlux.adjust!(opt_state.layers.enc, 0.03)","category":"page"},{"location":"guide/training/training/#Freezing-layer-parameters","page":"Training","title":"Freezing layer parameters","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"To completely disable training of some part of the model, use freeze!. This is a temporary modification, reversed by thaw!:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux.freeze!(opt_state.layers.enc)\n\n# Now training won't update parameters in bimodel.layers.enc\ntrain!(loss, bimodel, data, opt_state)\n\n# Un-freeze the entire model:\nFlux.thaw!(opt_state)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).","category":"page"},{"location":"reference/models/activation/#man-activations","page":"Activation Functions","title":"Activation Functions from NNlib.jl","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"These non-linearities used between layers of your model are exported by the NNlib package.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on. Alternatively, they can be passed to a layer like Dense(784 => 1024, relu) which will handle this broadcasting.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Functions like softmax are sometimes described as activation functions, but not by Flux. They must see all the outputs, and hence cannot be broadcasted. See the next page for details.","category":"page"},{"location":"reference/models/activation/#Alphabetical-Listing","page":"Activation Functions","title":"Alphabetical Listing","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"celu\nelu\ngelu\nhardsigmoid\nhardswish\nhardtanh\nleakyrelu\nlisht\nlogcosh\nlogsigmoid\nmish\nrelu\nrelu6\nrrelu\nselu\nsigmoid\nsigmoid_fast\nsoftplus\nsoftshrink\nsoftsign\nswish\ntanhshrink\ntanh_fast\ntrelu","category":"page"},{"location":"reference/models/activation/#NNlib.celu","page":"Activation Functions","title":"NNlib.celu","text":"celu(x, α=1) = x ≥ 0 ? x : α * (exp(x/α) - 1)\n\nActivation function from \"Continuously Differentiable Exponential Linear Units\".\n\njulia> lineplot(celu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ celu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> celu(-10f0)\n-0.9999546f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.elu","page":"Activation Functions","title":"NNlib.elu","text":"elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)\n\nExponential Linear Unit activation function. See \"Fast and Accurate Deep Network Learning by Exponential Linear Units\". You can also specify the coefficient explicitly, e.g. elu(x, 1).\n\njulia> lineplot(elu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ elu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> elu(-10f0)\n-0.9999546f0\n\njulia> elu(-10f0, 2)\n-1.9999092f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.gelu","page":"Activation Functions","title":"NNlib.gelu","text":"gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))\n\nActivation function from \"Gaussian Error Linear Units\".\n\njulia> lineplot(gelu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊│ gelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣤⣤⣤⣤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⡤⡧⠶⠶⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(gelu, -5, 0, height=7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠒⠒⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸│ gelu(x) \n │⠑⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇│ swish(x)\n │⠀⠀⠀⠀⠀⠈⠉⠒⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⢠⡇⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⣄⠀⠀⠀⠀⠀⢠⡞⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⢄⣀⣀⡤⢣⠃⠀⠀│ \n -0.2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠇⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardsigmoid","page":"Activation Functions","title":"NNlib.hardsigmoid","text":"hardσ(x) = max(0, min(1, (x + 3) / 6))\n\nPiecewise linear approximation of sigmoid.\n\njulia> lineplot(hardsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠉⠉⠉⠉⠉⠉⠉⠉│ hardσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⡗⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⠤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardswish","page":"Activation Functions","title":"NNlib.hardswish","text":"hardswish(x) = x * hardσ(x)\n\nHard-Swish activation function. See \"Searching for MobileNetV3\".\n\njulia> lineplot(hardswish, -2, 5, height = 7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠒⠉│ hardswish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠒⠉⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣤⣤⣖⣚⣉⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠉⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(hardswish, -4, 0, height = 7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⢣⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡜│ hardswish(x)\n │⠒⠒⠢⠤⢄⣀⡀⠀⠀⠀⠀⠱⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠎⠀│ swish(x) \n │⠀⠀⠀⠀⠀⠀⠈⠉⠑⠒⠦⢄⣘⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠃⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⡖⠦⢄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⢔⠏⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠣⣄⠀⠉⠑⠒⠦⠤⢄⣀⣀⣀⣀⡠⠤⠖⣊⠕⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⠤⡀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀│ \n -0.4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠢⠤⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-4⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> hardswish.(-5:5)'\n1×11 adjoint(::Vector{Float64}) with eltype Float64:\n -0.0 -0.0 -0.0 -0.333333 -0.333333 0.0 0.666667 1.66667 3.0 4.0 5.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardtanh","page":"Activation Functions","title":"NNlib.hardtanh","text":"hardtanh(x) = max(-1, min(1, x))\n\nSegment-wise linear approximation of tanh, much cheaper to compute. See \"Large Scale Machine Learning\".\n\nSee also tanh_fast.\n\njulia> lineplot(hardtanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⠔⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ hardtanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠋⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⠔⠋⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x\n\njulia> lineplot(tanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠤⠒⠒⠒⠊⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠊⠁⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⡠⠤⠤⠤⠖⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.leakyrelu","page":"Activation Functions","title":"NNlib.leakyrelu","text":"leakyrelu(x, a=0.01) = max(a*x, x)\n\nLeaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).\n\njulia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ #42(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠤⠒⠒⠋⠉⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⠤⠤⠒⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> leakyrelu(-10f0, 0.2)\n-2.0f0\n\njulia> leakyrelu(-10f0, 0.02)\n-0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.lisht","page":"Activation Functions","title":"NNlib.lisht","text":"lisht(x) = x * tanh(x)\n\nActivation function from \"LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ...\"\n\njulia> lineplot(lisht, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x)\n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ \n │⠀⠀⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⣄⣀⣀⣇⣀⣀⠤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, logcosh)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x) \n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ logcosh(x)\n │⠢⣄⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⣀⠔│ \n f(x) │⠀⠈⠑⠢⣀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⣀⠔⠊⠁⠀│ \n │⠀⠀⠀⠀⠀⠉⠢⢄⡀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⡠⠔⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠦⣌⡓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⣁⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠪⠷⣦⣄⣀⣀⣇⣀⣀⣤⠶⠕⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logcosh","page":"Activation Functions","title":"NNlib.logcosh","text":"logcosh(x)\n\nReturn log(cosh(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logcosh, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ logcosh(x)\n │⠉⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ \n │⠀⠀⠀⠑⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠑⠦⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠢⢄⣀⣀⣇⣀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logsigmoid","page":"Activation Functions","title":"NNlib.logsigmoid","text":"logσ(x)\n\nReturn log(σ(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡧⠤⠔⠒⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ logσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡤⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.mish","page":"Activation Functions","title":"NNlib.mish","text":"mish(x) = x * tanh(softplus(x))\n\nActivation function from \"Mish: A Self Regularized Non-Monotonic Neural Activation Function\".\n\njulia> lineplot(mish, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋│ mish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣧⣔⣊⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu","page":"Activation Functions","title":"NNlib.relu","text":"relu(x) = max(0, x)\n\nRectified Linear Unit activation function.\n\njulia> lineplot(relu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ relu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu6","page":"Activation Functions","title":"NNlib.relu6","text":"relu6(x) = min(max(0, x), 6)\n\nRectified Linear Unit activation function capped at 6. See \"Convolutional Deep Belief Networks\" from CIFAR-10.\n\njulia> lineplot(relu6, -10, 10, height=7)\n ┌────────────────────────────────────────┐ \n 6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠎⠉⠉⠉⠉⠉⠉⠉⠉│ relu6(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡤⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⡠⠎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⡧⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-10⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.rrelu","page":"Activation Functions","title":"NNlib.rrelu","text":"rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)\n# where `a` is randomly sampled from uniform distribution `U(lo, hi)`\n\nRandomized Leaky Rectified Linear Unit activation function. See \"Empirical Evaluation of Rectified Activations\" You can also specify the bound explicitly, e.g. rrelu(x, 0.0, 1.0).\n\njulia> lineplot(rrelu, -20, 10, height=7)\n ┌────────────────────────────────────────┐ \n 10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ rrelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⠤⣤⣤⢤⣤⣤⠤⠤⠤⢼⠮⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⣰⢀⣆⡄⣄⡄⡠⡰⠦⠷⡜⢢⠷⠳⠢⠊⠉⠉⠀⠀⠁⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠃⠉⠙⠘⠃⠈⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-20⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> extrema(rrelu.(fill(-10f0, 1000)))\n(-3.3316886f0, -1.2548422f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.selu","page":"Activation Functions","title":"NNlib.selu","text":"selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))\n\nλ ≈ 1.05070...\nα ≈ 1.67326...\n\nScaled exponential linear units. See \"Self-Normalizing Neural Networks\".\n\njulia> lineplot(selu, -3, 2, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ selu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠊⠉⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⡠⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⣉⠭⠛⡏⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡤⠤⠒⠊⠉⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠤⠤⠖⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> selu(-10f0)\n-1.7580194f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid","page":"Activation Functions","title":"NNlib.sigmoid","text":"σ(x) = 1 / (1 + exp(-x))\n\nClassic sigmoid activation function. Unicode σ can be entered as \\sigma then tab, in many editors. The ascii name sigmoid is also exported.\n\nSee also sigmoid_fast.\n\njulia> using UnicodePlots\n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> sigmoid === σ\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid_fast","page":"Activation Functions","title":"NNlib.sigmoid_fast","text":"sigmoid_fast(x)\n\nThis is a faster, and very slightly less accurate, version of sigmoid. For `x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.\n\nSee also tanh_fast.\n\njulia> sigmoid(0.2f0)\n0.54983395f0\n\njulia> sigmoid_fast(0.2f0)\n0.54983395f0\n\njulia> hardσ(0.2f0)\n0.53333336f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softplus","page":"Activation Functions","title":"NNlib.softplus","text":"softplus(x) = log(exp(x) + 1)\n\nSee \"Deep Sparse Rectifier Neural Networks\", JMLR 2011.\n\njulia> lineplot(softplus, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⠤⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⡠⠤⠤⠤⠤⠔⠒⠒⠚⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, relu)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠│ relu(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡴⠞⠋⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣤⡴⠞⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⢤⡲⠝⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⣉⠥⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣠⣤⣤⣤⣤⣔⣒⣒⣚⣉⣉⣁⣀⣇⠴⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softplus(16f0)\n16.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softshrink","page":"Activation Functions","title":"NNlib.softshrink","text":"softshrink(x, λ=0.5) =\n (x ≥ λ ? x - λ : (-λ ≥ x ? x + λ : 0))\n\nSee \"Softshrink Activation Function\".\n\njulia> lineplot(softshrink, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⠉⠁│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⡤⠤⠤⠤⠤⠤⠤⡧⠤⠤⠤⠤⠶⠮⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⢀⣀⠤⠖⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⣀⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanhshrink)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⣉⡡│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⣒⣋⠥⠤⠒⠊⠉⠁⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠾⠿⠯⠭⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⢀⣀⡠⠤⠖⢒⣋⠭⠗⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠊⣉⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀\n\njulia> softshrink.((-10f0, 10f0))\n(-9.5f0, 9.5f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softsign","page":"Activation Functions","title":"NNlib.softsign","text":"softsign(x) = x / (1 + |x|)\n\nSee \"Quadratic Polynomials Learn Better Image Features\" (2009).\n\njulia> lineplot(softsign, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣀⣀⠤⠤⠤⠤⠤│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⠋⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠒⠒⠒⠒⠒⠊⠉⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanh)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠊⠉⠉⠉⣉⣉⣉⣉⣉⠭⠭⠭⠭⠭│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⣃⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanh(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⡞⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡴⠃⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⢋⠕⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣒⣒⣒⣒⣒⣊⣉⣉⣉⣉⣁⣀⣀⡠⠤⠒⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softsign(1f0)\n0.5f0\n\njulia> softsign(100f0)\n0.990099f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.swish","page":"Activation Functions","title":"NNlib.swish","text":"swish(x) = x * σ(x)\n\nSelf-gated activation function. See \"Swish: a Self-Gated Activation Function\".\n\njulia> lineplot(swish, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤│ swish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋⠁⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⣀⡤⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⣤⡤⡧⠴⠶⠯⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠉⠑⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanhshrink","page":"Activation Functions","title":"NNlib.tanhshrink","text":"tanhshrink(x) = x - tanh(x)\n\nSee \"Tanhshrink Activation Function\".\n\njulia> lineplot(tanhshrink, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠊│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⣀⡠⠤⠒⠊⠉⠁⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠮⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⣀⡠⠴⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡠⠴⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> tanhshrink.((-10f0, 10f0))\n(-9.0f0, 9.0f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanh_fast","page":"Activation Functions","title":"NNlib.tanh_fast","text":"tanh_fast(x)\n\nThis is a faster but slighly less accurate version of tanh.\n\nWhere Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit. \n\nFor x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.\n\nSee also sigmoid_fast.\n\njulia> tanh(0.5f0)\n0.46211717f0\n\njulia> tanh_fast(0.5f0)\n0.46211714f0\n\njulia> hard_tanh(0.5f0)\n0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.trelu","page":"Activation Functions","title":"NNlib.trelu","text":"trelu(x, theta=1) = x > theta ? x : 0\n\nThreshold gated rectified linear activation function. See \"Zero-bias autoencoders and the benefits of co-adapting features\"\n\njulia> lineplot(trelu, -2, 4, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ trelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣀⣀⣀⣀⣀⣀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀4⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#One-More","page":"Activation Functions","title":"One More","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Julia's Base.Math also provides tanh, which can be used as an activation function.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"julia> using UnicodePlots\n\njulia> lineplot(tanh, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⣀⠤⠔⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡰⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⡤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠎⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⡤⠤⠔⠒⠉⠁⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ","category":"page"},{"location":"ecosystem/#The-Julia-Ecosystem-around-Flux","page":"Ecosystem","title":"The Julia Ecosystem around Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.","category":"page"},{"location":"ecosystem/#Flux-models","page":"Ecosystem","title":"Flux models","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Flux's model-zoo contains examples from many domains.","category":"page"},{"location":"ecosystem/#Computer-vision","page":"Ecosystem","title":"Computer vision","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ObjectDetector.jl provides ready-to-go image detection via YOLO.\nMetalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.\nUNet.jl is a generic UNet implementation.","category":"page"},{"location":"ecosystem/#Natural-language-processing","page":"Ecosystem","title":"Natural language processing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.\nTextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.","category":"page"},{"location":"ecosystem/#Reinforcement-learning","page":"Ecosystem","title":"Reinforcement learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.\nReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.","category":"page"},{"location":"ecosystem/#Graph-learning","page":"Ecosystem","title":"Graph learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.\nGeometricFlux.jl is the first graph neural network library for julia. \nNeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.\nSeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.","category":"page"},{"location":"ecosystem/#Time-series","page":"Ecosystem","title":"Time series","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FluxArchitectures.jl is a collection of advanced network architectures for time series forecasting.","category":"page"},{"location":"ecosystem/#Robust-networks","page":"Ecosystem","title":"Robust networks","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Tools-closely-associated-with-Flux","page":"Ecosystem","title":"Tools closely associated with Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Utility tools you're unlikely to have met if you never used Flux!","category":"page"},{"location":"ecosystem/#High-level-training-flows","page":"Ecosystem","title":"High-level training flows","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FastAI.jl is a Julia port of Python's fast.ai library.\nFluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl\nIgnite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.\nTsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.","category":"page"},{"location":"ecosystem/#Datasets","page":"Ecosystem","title":"Datasets","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"MLDatasets.jl focuses on downloading, unpacking, and accessing benchmark datasets.\nGraphMLDatasets.jl: a library for machine learning datasets on graph.","category":"page"},{"location":"ecosystem/#Plumbing","page":"Ecosystem","title":"Plumbing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Tools to put data into the right order for creating a model.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Augmentor.jl is a real-time library augmentation library for increasing the number of training images.\nDataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.\nMLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.","category":"page"},{"location":"ecosystem/#Parameters","page":"Ecosystem","title":"Parameters","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ParameterSchedulers.jl standard scheduling policies for machine learning.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Differentiable-programming","page":"Ecosystem","title":"Differentiable programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Packages based on differentiable programming but not necessarily related to Machine Learning. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.\nDiffEqFlux.jl provides tools for creating Neural Differential Equations.\nFlux3D.jl shows off machine learning on 3D data.\nRayTracer.jl combines ML with computer vision via a differentiable renderer.\nDuckietown.jl Differentiable Duckietown simulator.\nThe Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.\nAtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.\nDiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.","category":"page"},{"location":"ecosystem/#Probabilistic-programming","page":"Ecosystem","title":"Probabilistic programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.\nOmega.jl is a research project aimed at causal, higher-order probabilistic programming.\nStheno.jl provides flexible Gaussian processes.","category":"page"},{"location":"ecosystem/#Statistics","page":"Ecosystem","title":"Statistics","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"OnlineStats.jl provides single-pass algorithms for statistics.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Useful-miscellaneous-packages","page":"Ecosystem","title":"Useful miscellaneous packages","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Some useful and random packages!","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.\nMill.jl helps to prototype flexible multi-instance learning models.\nMLMetrics.jl is a utility for scoring models in data science and machine learning.\nTorch.jl exposes torch in Julia.\nValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.\nInvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.\nProgressMeter.jl progress meters for long-running computations.\nTensorBoardLogger.jl easy peasy logging to tensorboard in Julia\nArgParse.jl is a package for parsing command-line arguments to Julia programs.\nParameters.jl types with default field values, keyword constructors and (un-)pack macros.\nBSON.jl is a package for working with the Binary JSON serialisation format.\nDataFrames.jl in-memory tabular data in Julia.\nDrWatson.jl is a scientific project assistant software.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Alternatives-to-Flux","page":"Ecosystem","title":"Alternatives to Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Julia has several other libraries for making neural networks. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl) \nKnet.jl is a neural network library built around AutoGrad.jl.\nLux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"compat: Explicit or explicit?\nFlux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word \"explicit\" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/functors/#Recursive-transformations-from-Functors.jl","page":"Nested Structures – Functors.jl","title":"Recursive transformations from Functors.jl","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux models are deeply nested structures, and Functors.jl provides tools needed to explore such objects, apply functions to the parameters they contain, and re-build them.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"compat: Flux ≤ 0.14\nAll layers were previously defined with the Functors.@functor macro. This still works, but it is recommended that you use the new Flux.@layer macro instead. Both allow Flux.setup to see the parameters inside, and gpu to move them to the GPU, but Flux.@layer also overloads printing, and offers a way to define trainable at the same time.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Functors.jl has its own notes on basic usage for more details. Additionally, the Advanced Model Building and Customisation page covers the use cases of Functors in greater details.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux.@layer\nFunctors.@functor\nFunctors.fmap\nFunctors.fmap_with_path\nFunctors.isleaf\nFunctors.children\nFunctors.fcollect\nFunctors.functor\nFunctors.fmapstructure\nFunctors.fmapstructure_with_path\nFunctors.execute\nFunctors.AbstractWalk\nFunctors.ExcludeWalk\nFunctors.CachedWalk","category":"page"},{"location":"reference/models/functors/#Flux.@layer","page":"Nested Structures – Functors.jl","title":"Flux.@layer","text":"@layer Dense\n@layer :expand Chain\n@layer BatchNorm trainable=(β,γ)\n\nThis macro replaces most uses of @functor. Its basic purpose is the same: When you define a new layer, this tells Flux to explore inside it to see the parameters it trains, and also to move them to the GPU, change precision, etc.\n\nLike @functor, this assumes your struct has the default constructor, to enable re-building. If you define an inner constructor (i.e. a function within the struct block) things may break.\n\nThe keyword trainable allows you to limit this exploration, instead of visiting all fieldnames(T). Note that it is never necessary to tell Flux to ignore non-array objects such as functions or sizes.\n\nThe macro also handles overloads of show for pretty printing.\n\nBy default, it adds methods to 3-arg Base.show to treat your layer much like Dense or Conv.\nIf your layer is a container, more like Chain or Parallel, then :expand makes show unfold its contents.\nTo disable all show overloads, there is an :ignore option too.\n\n(You probably still want to define 2-arg show(io::IO, x::Layer), the macro does not touch this.)\n\nNote that re-running the macro with different options may not remove all methods, you will need to restart.\n\nExample\n\njulia> struct Trio; a; b; c end\n\njulia> tri = Trio(Dense([1.1 2.2], [0.0], tanh), Dense(hcat(3.3), false), Dropout(0.4))\nTrio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4))\n\njulia> Flux.destructure(tri) # parameters are not yet visible to Flux\n(Bool[], Restructure(Trio, ..., 0))\n\njulia> Flux.@layer :expand Trio\n\njulia> Flux.destructure(tri) # now gpu, params, train!, etc will see inside too\n([1.1, 2.2, 0.0, 3.3], Restructure(Trio, ..., 4))\n\njulia> tri # and layer is printed like Chain\nTrio(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1; bias=false), # 1 parameters\n Dropout(0.4),\n) # Total: 3 arrays, 4 parameters, 224 bytes.\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.@functor","page":"Nested Structures – Functors.jl","title":"Functors.@functor","text":"@functor T\n@functor T (x,)\n\nAdds methods to functor allowing recursion into objects of type T, and reconstruction. Assumes that T has a constructor accepting all of its fields, which is true unless you have provided an inner constructor which does not.\n\nBy default all fields of T are considered children; this can be restricted be restructed by providing a tuple of field names.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> Functors.children(Foo(1,2))\n(x = 1, y = 2)\n\njulia> _, re = Functors.functor(Foo(1,2));\n\njulia> re((10, 20))\nFoo(10, 20)\n\njulia> struct TwoThirds a; b; c; end\n\njulia> @functor TwoThirds (a, c)\n\njulia> ch2, re3 = Functors.functor(TwoThirds(10,20,30));\n\njulia> ch2\n(a = 10, c = 30)\n\njulia> re3((\"ten\", \"thirty\"))\nTwoThirds(\"ten\", 20, \"thirty\")\n\njulia> fmap(x -> 10x, TwoThirds(Foo(1,2), Foo(3,4), 56))\nTwoThirds(Foo(10, 20), Foo(3, 4), 560)\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.fmap","page":"Nested Structures – Functors.jl","title":"Functors.fmap","text":"fmap(f, x, ys...; exclude = Functors.isleaf, walk = Functors.DefaultWalk(), [prune])\n\nA structure and type preserving map.\n\nBy default it transforms every leaf node (identified by exclude, default isleaf) by applying f, and otherwise traverses x recursively using functor. Optionally, it may also be associated with objects ys with the same tree structure. In that case, f is applied to the corresponding leaf nodes in x and ys.\n\nSee also fmap_with_path and fmapstructure.\n\nExamples\n\njulia> fmap(string, (x=1, y=(2, 3)))\n(x = \"1\", y = (\"2\", \"3\"))\n\njulia> nt = (a = [1,2], b = [23, (45,), (x=6//7, y=())], c = [8,9]);\n\njulia> fmap(println, nt)\n[1, 2]\n23\n45\n6//7\n()\n[8, 9]\n(a = nothing, b = Any[nothing, (nothing,), (x = nothing, y = nothing)], c = nothing)\n\njulia> fmap(println, nt; exclude = x -> x isa Array)\n[1, 2]\nAny[23, (45,), (x = 6//7, y = ())]\n[8, 9]\n(a = nothing, b = nothing, c = nothing)\n\njulia> twice = [1, 2]; # println only acts once on this\n\njulia> fmap(println, (i = twice, ii = 34, iii = [5, 6], iv = (twice, 34), v = 34.0))\n[1, 2]\n34\n[5, 6]\n34\n34.0\n(i = nothing, ii = nothing, iii = nothing, iv = (nothing, nothing), v = nothing)\n\njulia> d1 = Dict(\"x\" => [1,2], \"y\" => 3);\n\njulia> d2 = Dict(\"x\" => [4,5], \"y\" => 6, \"z\" => \"an_extra_value\");\n\njulia> fmap(+, d1, d2) == Dict(\"x\" => [5, 7], \"y\" => 9) # Note that \"z\" is ignored\ntrue\n\nMutable objects which appear more than once are only handled once (by caching f(x) in an IdDict). Thus the relationship x.i === x.iv[1] will be preserved. An immutable object which appears twice is not stored in the cache, thus f(34) will be called twice, and the results will agree only if f is pure.\n\nBy default, Tuples, NamedTuples, and some other container-like types in Base have children to recurse into. Arrays of numbers do not. To enable recursion into new types, you must provide a method of functor, which can be done using the macro @functor:\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> struct Bar; x; end\n\njulia> @functor Bar\n\njulia> m = Foo(Bar([1,2,3]), (4, 5, Bar(Foo(6, 7))));\n\njulia> fmap(x -> 10x, m)\nFoo(Bar([10, 20, 30]), (40, 50, Bar(Foo(60, 70))))\n\njulia> fmap(string, m)\nFoo(Bar(\"[1, 2, 3]\"), (\"4\", \"5\", Bar(Foo(\"6\", \"7\"))))\n\njulia> fmap(string, m, exclude = v -> v isa Bar)\nFoo(\"Bar([1, 2, 3])\", (4, 5, \"Bar(Foo(6, 7))\"))\n\nTo recurse into custom types without reconstructing them afterwards, use fmapstructure.\n\nFor advanced customization of the traversal behaviour, pass a custom walk function that subtypes Functors.AbstractWalk. The call fmap(f, x, ys...; walk = mywalk) will wrap mywalk in ExcludeWalk then CachedWalk. Here, ExcludeWalk is responsible for applying f at excluded nodes. For a low-level interface for executing a user-constructed walk, see execute.\n\njulia> struct MyWalk <: Functors.AbstractWalk end\n\njulia> (::MyWalk)(recurse, x) = x isa Bar ? \"hello\" :\n Functors.DefaultWalk()(recurse, x)\n\njulia> fmap(x -> 10x, m; walk = MyWalk())\nFoo(\"hello\", (40, 50, \"hello\"))\n\nThe behaviour when the same node appears twice can be altered by giving a value to the prune keyword, which is then used in place of all but the first:\n\njulia> twice = [1, 2];\n\njulia> fmap(float, (x = twice, y = [1,2], z = twice); prune = missing)\n(x = [1.0, 2.0], y = [1.0, 2.0], z = missing)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmap_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmap_with_path","text":"fmap_with_path(f, x, ys...; exclude = isleaf, walk = DefaultWalkWithPath(), [prune])\n\nLike fmap, but also passes a KeyPath to f for each node in the recursion. The KeyPath is a tuple of the indices used to reach the current node from the root of the recursion. The KeyPath is constructed by the walk function, and can be used to reconstruct the path to the current node from the root of the recursion.\n\nf has to accept two arguments: the associated KeyPath and the value of the current node.\n\nexclude also receives the KeyPath as its first argument and a node as its second. It should return true if the recursion should not continue on its children and f applied to it.\n\nprune is used to control the behaviour when the same node appears twice, see fmap for more information.\n\nExamples\n\njulia> x = ([1, 2, 3], 4, (a=5, b=Dict(\"A\"=>6, \"B\"=>7), c=Dict(\"C\"=>8, \"D\"=>9)));\n\njulia> exclude(kp, x) = kp == KeyPath(3, :c) || Functors.isleaf(x);\n\njulia> fmap_with_path((kp, x) -> x isa Dict ? nothing : x.^2, x; exclude = exclude)\n([1, 4, 9], 16, (a = 25, b = Dict(\"B\" => 49, \"A\" => 36), c = nothing))\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.isleaf","page":"Nested Structures – Functors.jl","title":"Functors.isleaf","text":"Functors.isleaf(x)\n\nReturn true if x has no children according to functor.\n\nExamples\n\njulia> Functors.isleaf(1)\ntrue\n\njulia> Functors.isleaf([2, 3, 4])\ntrue\n\njulia> Functors.isleaf([\"five\", [6, 7]])\nfalse\n\njulia> Functors.isleaf([])\nfalse\n\njulia> Functors.isleaf((8, 9))\nfalse\n\njulia> Functors.isleaf(())\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.children","page":"Nested Structures – Functors.jl","title":"Functors.children","text":"Functors.children(x)\n\nReturn the children of x as defined by functor. Equivalent to functor(x)[1].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fcollect","page":"Nested Structures – Functors.jl","title":"Functors.fcollect","text":"fcollect(x; exclude = v -> false)\n\nTraverse x by recursing each child of x as defined by functor and collecting the results into a flat array, ordered by a breadth-first traversal of x, respecting the iteration order of children calls.\n\nDoesn't recurse inside branches rooted at nodes v for which exclude(v) == true. In such cases, the root v is also excluded from the result. By default, exclude always yields false.\n\nSee also children.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> struct Bar; x; end\n\njulia> @functor Bar\n\njulia> struct TypeWithNoChildren; x; y; end\n\njulia> m = Foo(Bar([1,2,3]), TypeWithNoChildren(:a, :b))\nFoo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n\njulia> fcollect(m)\n4-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n [1, 2, 3]\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> v isa Bar)\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> Functors.isleaf(v))\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.functor","page":"Nested Structures – Functors.jl","title":"Functors.functor","text":"Functors.functor(x) = functor(typeof(x), x)\n\nReturns a tuple containing, first, a NamedTuple of the children of x (typically its fields), and second, a reconstruction funciton. This controls the behaviour of fmap.\n\nMethods should be added to functor(::Type{T}, x) for custom types, usually using the macro @functor.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure","text":"fmapstructure(f, x, ys...; exclude = isleaf, [prune])\n\nLike fmap, but doesn't preserve the type of custom structs. Instead, it returns a NamedTuple (or a Tuple, or an array), or a nested set of these.\n\nUseful for when the output must not contain custom structs.\n\nSee also fmap and fmapstructure_with_path.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> m = Foo([1,2,3], [4, (5, 6), Foo(7, 8)]);\n\njulia> fmapstructure(x -> 2x, m)\n(x = [2, 4, 6], y = Any[8, (10, 12), (x = 14, y = 16)])\n\njulia> fmapstructure(println, m)\n[1, 2, 3]\n4\n5\n6\n7\n8\n(x = nothing, y = Any[nothing, (nothing, nothing), (x = nothing, y = nothing)])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure_with_path","text":"fmapstructure_with_path(f, x, ys...; [exclude, prune])\n\nLike fmap_with_path, but doesn't preserve the type of custom structs. Instead, it returns a named tuple, a tuple, an array, a dict, or a nested set of these.\n\nSee also fmapstructure.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.execute","page":"Nested Structures – Functors.jl","title":"Functors.execute","text":"execute(walk, x, ys...)\n\nExecute a walk that recursively calls itself, starting at a node x in a Functors tree, as well as optional associated nodes ys... in other Functors trees. Any custom walk function that subtypes Functors.AbstractWalk is permitted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.AbstractWalk","page":"Nested Structures – Functors.jl","title":"Functors.AbstractWalk","text":"AbstractWalk\n\nAny walk for use with fmap should inherit from this type. A walk subtyping AbstractWalk must satisfy the walk function interface:\n\nstruct MyWalk <: AbstractWalk end\n\nfunction (::MyWalk)(recurse, x, ys...)\n # implement this\nend\n\nThe walk function is called on a node x in a Functors tree. It may also be passed associated nodes ys... in other Functors trees. The walk function recurses further into (x, ys...) by calling recurse on the child nodes. The choice of which nodes to recurse and in what order is custom to the walk.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.ExcludeWalk","page":"Nested Structures – Functors.jl","title":"Functors.ExcludeWalk","text":"ExcludeWalk(walk, fn, exclude)\n\nA walk that recurses nodes (x, ys...) according to walk, except when exclude(x) is true. Then, fn(x, ys...) is applied instead of recursing further.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.CachedWalk","page":"Nested Structures – Functors.jl","title":"Functors.CachedWalk","text":"CachedWalk(walk[; prune])\n\nA walk that recurses nodes (x, ys...) according to walk and storing the output of the recursion in a cache indexed by x (based on object ID). Whenever the cache already contains x, either:\n\nprune is specified, then it is returned, or\nprune is unspecified, and the previously cached recursion of (x, ys...) returned.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Moving-models,-or-data,-to-the-GPU","page":"Nested Structures – Functors.jl","title":"Moving models, or data, to the GPU","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux provides some convenience functions based on fmap. Some (f16, f32, f64) change the precision of all arrays in a model. Others are used for moving a model to of from GPU memory:","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"cpu\ngpu(::Any)\ngpu(::Flux.DataLoader)","category":"page"},{"location":"reference/models/functors/#Flux.cpu","page":"Nested Structures – Functors.jl","title":"Flux.cpu","text":"cpu(m)\n\nCopies m onto the CPU, the opposite of gpu. Recurses into structs marked @functor.\n\nExample\n\njulia> m_gpu = Dense(CUDA.randn(2, 5))\nDense(5 => 2) # 12 parameters\n\njulia> m_gpu.bias # matches the given weight matrix\n2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.0\n 0.0\n\njulia> m = m_gpu |> cpu\nDense(5 => 2) # 12 parameters\n\njulia> m.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Flux.gpu-Tuple{Any}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(m)\n\nCopies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time).\n\nOn arrays, this calls CUDA's cu, which also changes arrays with Float64 elements to Float32 while copying them to the device (same for AMDGPU). To act on arrays within a struct, the struct type must be marked with @functor.\n\nUse cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.\n\nSee the CUDA.jl docs to help identify the current device.\n\nExample\n\njulia> m = Dense(rand(2, 3)) # constructed with Float64 weight matrix\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m.weight)\nMatrix{Float64} (alias for Array{Float64, 2})\n\njulia> m_gpu = gpu(m) # can equivalently be written m_gpu = m |> gpu\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m_gpu.weight)\nCUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}\n\n\n\n\n\n","category":"method"},{"location":"reference/models/functors/#Flux.gpu-Tuple{DataLoader}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(data::DataLoader)\ncpu(data::DataLoader)\n\nTransforms a given DataLoader to apply gpu or cpu to each batch of data, when iterated over. (If no GPU is available, this does nothing.)\n\nExample\n\njulia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 Matrix{Float64}, y = 3-element StepRange{Char, Int64})\n\njulia> first(dl)\n(x = [1.0 1.0 1.0; 1.0 1.0 1.0], y = 'a':1:'c')\n\njulia> c_dl = gpu(dl)\n4-element DataLoader(::MLUtils.MappedData{:auto, typeof(gpu), NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}}, batchsize=3)\n with first element:\n (; x = 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element StepRange{Char, Int64})\n\njulia> first(c_dl).x\n2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\nFor large datasets, this is preferred over moving all the data to the GPU before creating the DataLoader, like this:\n\njulia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})\n\nwarning: Warning\nThis only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.\n\n\n\n\n\n","category":"method"},{"location":"reference/models/losses/#man-losses","page":"Loss Functions","title":"Loss Functions","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux provides a large number of common loss functions used for training machine learning models. They are grouped together in the Flux.Losses module.","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Loss functions for supervised learning typically expect as inputs a target y, and a prediction ŷ from your model. In Flux's convention, the order of the arguments is the following","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y)","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Most loss functions in Flux have an optional argument agg, denoting the type of aggregation performed over the batch:","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y) # defaults to `mean`\nloss(ŷ, y, agg=sum) # use `sum` for reduction\nloss(ŷ, y, agg=x->sum(x, dims=2)) # partial reduction\nloss(ŷ, y, agg=x->mean(w .* x)) # weighted mean\nloss(ŷ, y, agg=identity) # no aggregation.","category":"page"},{"location":"reference/models/losses/#Function-listing","page":"Loss Functions","title":"Function listing","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux.Losses.mae\nFlux.Losses.mse\nFlux.Losses.msle\nFlux.Losses.huber_loss\nFlux.Losses.label_smoothing\nFlux.Losses.crossentropy\nFlux.Losses.logitcrossentropy\nFlux.Losses.binarycrossentropy\nFlux.Losses.logitbinarycrossentropy\nFlux.Losses.kldivergence\nFlux.Losses.poisson_loss\nFlux.Losses.hinge_loss\nFlux.Losses.squared_hinge_loss\nFlux.Losses.dice_coeff_loss\nFlux.Losses.tversky_loss\nFlux.Losses.binary_focal_loss\nFlux.Losses.focal_loss\nFlux.Losses.siamese_contrastive_loss","category":"page"},{"location":"reference/models/losses/#Flux.Losses.mae","page":"Loss Functions","title":"Flux.Losses.mae","text":"mae(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean absolute error:\n\nagg(abs.(ŷ .- y))\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> Flux.mae(y_model, 1:3)\n0.10000000000000009\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.mse","page":"Loss Functions","title":"Flux.Losses.mse","text":"mse(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean square error:\n\nagg((ŷ .- y) .^ 2)\n\nSee also: mae, msle, crossentropy.\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> y_true = 1:3;\n\njulia> Flux.mse(y_model, y_true)\n0.010000000000000018\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.msle","page":"Loss Functions","title":"Flux.Losses.msle","text":"msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nThe loss corresponding to mean squared logarithmic errors, calculated as\n\nagg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)\n\nThe ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.\n\nExample\n\njulia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)\n0.009084041f0\n\njulia> Flux.msle(Float32[0.9, 1.8, 2.7], 1:3)\n0.011100831f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.huber_loss","page":"Loss Functions","title":"Flux.Losses.huber_loss","text":"huber_loss(ŷ, y; delta = 1, agg = mean)\n\nReturn the mean of the Huber loss given the prediction ŷ and true values y.\n\n | 0.5 * |ŷ - y|^2, for |ŷ - y| <= δ\nHuber loss = |\n | δ * (|ŷ - y| - 0.5 * δ), otherwise\n\nExample\n\njulia> ŷ = [1.1, 2.1, 3.1];\n\njulia> Flux.huber_loss(ŷ, 1:3) # default δ = 1 > |ŷ - y|\n0.005000000000000009\n\njulia> Flux.huber_loss(ŷ, 1:3, delta=0.05) # changes behaviour as |ŷ - y| > δ\n0.003750000000000005\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.label_smoothing","page":"Loss Functions","title":"Flux.Losses.label_smoothing","text":"label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)\n\nReturns smoothed labels, meaning the confidence on label values are relaxed.\n\nWhen y is given as one-hot vector or batch of one-hot, its calculated as\n\ny .* (1 - α) .+ α / size(y, dims)\n\nwhen y is given as a number or batch of numbers for binary classification, its calculated as\n\ny .* (1 - α) .+ α / 2\n\nin which case the labels are squeezed towards 0.5.\n\nα is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.\n\ndims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.\n\nExample\n\njulia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)\n2×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ ⋅ ⋅ 1 ⋅ 1\n 1 1 1 ⋅ 1 ⋅\n\njulia> y_smoothed = Flux.label_smoothing(y, 0.2f0)\n2×6 Matrix{Float32}:\n 0.1 0.1 0.1 0.9 0.1 0.9\n 0.9 0.9 0.9 0.1 0.9 0.1\n\njulia> y_sim = softmax(y .* log(2f0))\n2×6 Matrix{Float32}:\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n\njulia> y_dis = vcat(y_sim[2,:]', y_sim[1,:]')\n2×6 Matrix{Float32}:\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n\njulia> Flux.crossentropy(y_sim, y) < Flux.crossentropy(y_sim, y_smoothed)\ntrue\n\njulia> Flux.crossentropy(y_dis, y) > Flux.crossentropy(y_dis, y_smoothed)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.crossentropy","page":"Loss Functions","title":"Flux.Losses.crossentropy","text":"crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)\n\nReturn the cross entropy between the given probability distributions; calculated as\n\nagg(-sum(y .* log.(ŷ .+ ϵ); dims))\n\nCross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction ŷ is supposed to sum to one across dims, as would be the case with the output of a softmax operation.\n\nFor numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .\n\nUse label_smoothing to smooth the true labels as preprocessing before computing the loss.\n\nSee also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.\n\nExample\n\njulia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)\n3×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ ⋅ 1\n ⋅ 1 ⋅ 1 ⋅\n ⋅ ⋅ 1 ⋅ ⋅\n\njulia> y_model = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> sum(y_model; dims=1)\n1×5 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0\n\njulia> Flux.crossentropy(y_model, y_label)\n1.6076053f0\n\njulia> 5 * ans ≈ Flux.crossentropy(y_model, y_label; agg=sum)\ntrue\n\njulia> y_smooth = Flux.label_smoothing(y_label, 0.15f0)\n3×5 Matrix{Float32}:\n 0.9 0.05 0.05 0.05 0.9\n 0.05 0.9 0.05 0.9 0.05\n 0.05 0.05 0.9 0.05 0.05\n\njulia> Flux.crossentropy(y_model, y_smooth)\n1.5776052f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitcrossentropy","page":"Loss Functions","title":"Flux.Losses.logitcrossentropy","text":"logitcrossentropy(ŷ, y; dims = 1, agg = mean)\n\nReturn the cross entropy calculated by\n\nagg(-sum(y .* logsoftmax(ŷ; dims); dims))\n\nThis is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.\n\nSee also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.\n\nExample\n\njulia> y_label = Flux.onehotbatch(collect(\"abcabaa\"), 'a':'c')\n3×7 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 1\n ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n\njulia> y_model = reshape(vcat(-9:0, 0:9, 7.5f0), 3, 7)\n3×7 Matrix{Float32}:\n -9.0 -6.0 -3.0 0.0 2.0 5.0 8.0\n -8.0 -5.0 -2.0 0.0 3.0 6.0 9.0\n -7.0 -4.0 -1.0 1.0 4.0 7.0 7.5\n\njulia> Flux.logitcrossentropy(y_model, y_label)\n1.5791205f0\n\njulia> Flux.crossentropy(softmax(y_model), y_label)\n1.5791197f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binarycrossentropy","page":"Loss Functions","title":"Flux.Losses.binarycrossentropy","text":"binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the binary cross-entropy loss, computed as\n\nagg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))\n\nWhere typically, the prediction ŷ is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.\n\nUse label_smoothing to smooth the y value as preprocessing before computing the loss.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1]\n3-element Vector{Bool}:\n 1\n 0\n 1\n\njulia> y_prob = softmax(reshape(vcat(1:3, 3:5), 2, 3) .* 1f0)\n2×3 Matrix{Float32}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binarycrossentropy(y_prob[2,:], y_bin)\n0.43989f0\n\njulia> all(p -> 0 < p < 1, y_prob[2,:]) # else DomainError\ntrue\n\njulia> y_hot = Flux.onehotbatch(y_bin, 0:1)\n2×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n\njulia> Flux.crossentropy(y_prob, y_hot)\n0.43989f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitbinarycrossentropy","page":"Loss Functions","title":"Flux.Losses.logitbinarycrossentropy","text":"logitbinarycrossentropy(ŷ, y; agg = mean)\n\nMathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1];\n\njulia> y_model = Float32[2, -1, pi]\n3-element Vector{Float32}:\n 2.0\n -1.0\n 3.1415927\n\njulia> Flux.logitbinarycrossentropy(y_model, y_bin)\n0.160832f0\n\njulia> Flux.binarycrossentropy(sigmoid.(y_model), y_bin)\n0.16083185f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.kldivergence","page":"Loss Functions","title":"Flux.Losses.kldivergence","text":"kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the Kullback-Leibler divergence between the given probability distributions.\n\nThe KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.\n\nExample\n\njulia> p1 = [1 0; 0 1]\n2×2 Matrix{Int64}:\n 1 0\n 0 1\n\njulia> p2 = fill(0.5, 2, 2)\n2×2 Matrix{Float64}:\n 0.5 0.5\n 0.5 0.5\n\njulia> Flux.kldivergence(p2, p1) ≈ log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p1; agg = sum) ≈ 2log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p2; eps = 0) # about -2e-16 with the regulator\n0.0\n\njulia> Flux.kldivergence(p1, p2; eps = 0) # about 17.3 with the regulator\nInf\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.poisson_loss","page":"Loss Functions","title":"Flux.Losses.poisson_loss","text":"poisson_loss(ŷ, y; agg = mean)\n\nReturn how much the predicted distribution ŷ diverges from the expected Poisson distribution y; calculated as -\n\nsum(ŷ .- y .* log.(ŷ)) / size(y, 2)\n\nMore information..\n\nExample\n\njulia> y_model = [1, 3, 3]; # data should only take integral values\n\njulia> Flux.poisson_loss(y_model, 1:3)\n0.5023128522198171\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.hinge_loss","page":"Loss Functions","title":"Flux.Losses.hinge_loss","text":"hinge_loss(ŷ, y; agg = mean)\n\nReturn the hinge_loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum(max.(0, 1 .- ŷ .* y)) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: squared_hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.hinge_loss(y_pred, y_true)\n0.55\n\njulia> Flux.hinge_loss(y_pred[1], y_true[1]) != 0 # same sign but |ŷ| < 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[end], y_true[end]) == 0 # same sign but |ŷ| >= 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[2], y_true[2]) != 0 # opposite signs\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.squared_hinge_loss","page":"Loss Functions","title":"Flux.Losses.squared_hinge_loss","text":"squared_hinge_loss(ŷ, y)\n\nReturn the squared hinge_loss loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.squared_hinge_loss(y_pred, y_true)\n0.625\n\njulia> Flux.squared_hinge_loss(y_pred[1], y_true[1]) != 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[end], y_true[end]) == 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[2], y_true[2]) != 0\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.dice_coeff_loss","page":"Loss Functions","title":"Flux.Losses.dice_coeff_loss","text":"dice_coeff_loss(ŷ, y; smooth = 1)\n\nReturn a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:\n\n1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)\n\nExample\n\njulia> y_pred = [1.1, 2.1, 3.1];\n\njulia> Flux.dice_coeff_loss(y_pred, 1:3)\n0.000992391663909964\n\njulia> 1 - Flux.dice_coeff_loss(y_pred, 1:3) # ~ F1 score for image segmentation\n0.99900760833609\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.tversky_loss","page":"Loss Functions","title":"Flux.Losses.tversky_loss","text":"tversky_loss(ŷ, y; beta = 0.7)\n\nReturn the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:\n\n1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binary_focal_loss","page":"Loss Functions","title":"Flux.Losses.binary_focal_loss","text":"binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nFor gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.\n\nSee also: Losses.focal_loss for multi-class setting\n\nExample\n\njulia> y = [0 1 0\n 1 0 1]\n2×3 Matrix{Int64}:\n 0 1 0\n 1 0 1\n\njulia> ŷ = [0.268941 0.5 0.268941\n 0.731059 0.5 0.731059]\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binary_focal_loss(ŷ, y) ≈ 0.0728675615927385\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.focal_loss","page":"Loss Functions","title":"Flux.Losses.focal_loss","text":"focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nThe modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.\n\nExample\n\njulia> y = [1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0]\n3×5 Matrix{Int64}:\n 1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0\n\njulia> ŷ = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> Flux.focal_loss(ŷ, y) ≈ 1.1277571935622628\ntrue\n\nSee also: Losses.binary_focal_loss for binary (not one-hot) labels\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.siamese_contrastive_loss","page":"Loss Functions","title":"Flux.Losses.siamese_contrastive_loss","text":"siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)\n\nReturn the contrastive loss which can be useful for training Siamese Networks. It is given by\n\nagg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)\n\nSpecify margin to set the baseline for distance at which pairs are dissimilar.\n\nExample\n\njulia> ŷ = [0.5, 1.5, 2.5];\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3)\n-4.833333333333333\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3, margin = 2)\n-4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Built-in-Layer-Types","page":"Built-in Layers","title":"Built-in Layer Types","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"If you started at the beginning of the guide, then you have already met the basic Dense layer, and seen Chain for combining layers. These core layers form the foundation of almost all neural networks.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The Dense exemplifies several features:","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"It contains an an activation function, which is broadcasted over the output. Because this broadcast can be fused with other operations, doing so is more efficient than applying the activation function separately.\nIt take an init keyword, which accepts a function acting like rand. That is, init(2,3,4) should create an array of this size. Flux has many such functions built-in. All make a CPU array, moved later with gpu if desired.\nThe bias vector is always initialised Flux.zeros32. The keyword bias=false will turn this off, i.e. keeping the bias permanently zero.\nIt is annotated with @layer, which means that Flux.setup will see the contents, and gpu will move their arrays to the GPU.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"By contrast, Chain itself contains no parameters, but connects other layers together. The section on dataflow layers introduces others like this.","category":"page"},{"location":"reference/models/layers/#Fully-Connected","page":"Built-in Layers","title":"Fully Connected","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Dense\nFlux.Bilinear\nFlux.Scale","category":"page"},{"location":"reference/models/layers/#Flux.Dense","page":"Built-in Layers","title":"Flux.Dense","text":"Dense(in => out, σ=identity; bias=true, init=glorot_uniform)\nDense(W::AbstractMatrix, [bias, σ])\n\nCreate a traditional fully connected layer, whose forward pass is given by:\n\ny = σ.(W * x .+ bias)\n\nThe input x should be a vector of length in, or batch of vectors represented as an in × N matrix, or any array with size(x,1) == in. The out y will be a vector of length out, or a batch with size(y) == (out, size(x)[2:end]...)\n\nKeyword bias=false will switch off trainable bias for the layer. The initialisation of the weight matrix is W = init(out, in), calling the function given to keyword init, with default glorot_uniform. The weight matrix and/or the bias vector (of length out) may also be provided explicitly.\n\nExamples\n\njulia> model = Dense(5 => 2)\nDense(5 => 2) # 12 parameters\n\njulia> model(rand32(5, 64)) |> size\n(2, 64)\n\njulia> model(rand32(5, 6, 4, 64)) |> size # treated as three batch dimensions\n(2, 6, 4, 64)\n\njulia> model2 = Dense(ones(2, 5), false, tanh) # using provided weight matrix\nDense(5 => 2, tanh; bias=false) # 10 parameters\n\njulia> model2(ones(5))\n2-element Vector{Float64}:\n 0.9999092042625951\n 0.9999092042625951\n\njulia> Flux.trainables(model2) # no trainable bias\n1-element Vector{AbstractArray}:\n [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Bilinear","page":"Built-in Layers","title":"Flux.Bilinear","text":"Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)\nBilinear(W::AbstractArray, [bias, σ])\n\nCreates a layer which is fully connected between two inputs and the output, and otherwise similar to Dense. Its output, given vectors x & y, is another vector z with, for all i ∈ 1:out:\n\nz[i] = σ(x' * W[i,:,:] * y + bias[i])\n\nIf x and y are matrices, then each column of the output z = B(x, y) is of this form, with B the Bilinear layer.\n\nIf the second input y is not given, it is taken to be equal to x, i.e. B(x) == B(x, x)\n\nThe two inputs may also be provided as a tuple, B((x, y)) == B(x, y), which is accepted as the input to a Chain.\n\nIf the two input sizes are the same, in1 == in2, then you may write Bilinear(in => out, σ).\n\nThe initialisation works as for Dense layer, with W = init(out, in1, in2). By default the bias vector is zeros(Float32, out), option bias=false will switch off trainable bias. Either of these may be provided explicitly.\n\nExamples\n\njulia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);\n\njulia> B = Flux.Bilinear((5, 5) => 7)\nBilinear(5 => 7) # 182 parameters\n\njulia> B(x) |> size # interactions based on one input\n(7, 32)\n\njulia> B(x,y) == B((x,y)) # two inputs, may be given as a tuple\ntrue\n\njulia> sc = SkipConnection(\n Chain(Dense(5 => 20, tanh), Dense(20 => 9, tanh)),\n Flux.Bilinear((9, 5) => 3, bias=false),\n ); # used as the recombinator, with skip as the second input\n\njulia> sc(x) |> size\n(3, 32)\n\njulia> Flux.Bilinear(rand(4,8,16), false, tanh) # first dim of weight is the output\nBilinear((8, 16) => 4, tanh; bias=false) # 512 parameters\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Scale","page":"Built-in Layers","title":"Flux.Scale","text":"Scale(size::Integer..., σ=identity; bias=true, init=ones32)\nScale(scale::AbstractArray, [bias, σ])\n\nCreate an element-wise layer, whose forward pass is given by:\n\ny = σ.(scale .* x .+ bias)\n\nThis uses .* instead of matrix multiplication * of Dense.\n\nThe learnable scale & bias are initialised init(size...) and zeros32(size...), with init=ones32 by default. You may specify the function init, turn off trainable bias with bias=false, or provide the array(s) explicitly.\n\nUsed by LayerNorm with affine=true.\n\nExamples\n\njulia> a = Flux.Scale(2)\nScale(2) # 4 parameters\n\njulia> Flux.trainables(a)\n2-element Vector{AbstractArray}:\n Float32[1.0, 1.0]\n Float32[0.0, 0.0]\n\njulia> a([1 2 3])\n2×3 Matrix{Float32}:\n 1.0 2.0 3.0\n 1.0 2.0 3.0\n\njulia> b = Flux.Scale(Float32[1 2 3 4], false, abs2)\nScale(1, 4, abs2; bias=false) # 4 parameters\n\njulia> b([1, 10])\n2×4 Matrix{Float32}:\n 1.0 4.0 9.0 16.0\n 100.0 400.0 900.0 1600.0\n\njulia> Flux.trainables(b)\n1-element Vector{AbstractArray}:\n Float32[1.0 2.0 3.0 4.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.","category":"page"},{"location":"reference/models/layers/#Convolution-Models","page":"Built-in Layers","title":"Convolution Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are used to build convolutional neural networks (CNNs).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Conv\nConv(weight::AbstractArray)\nConvTranspose\nConvTranspose(weight::AbstractArray)\nCrossCor\nCrossCor(weight::AbstractArray)\nDepthwiseConv\nSamePad\nFlux.flatten","category":"page"},{"location":"reference/models/layers/#Flux.Conv","page":"Built-in Layers","title":"Flux.Conv","text":"Conv(filter, in => out, σ = identity;\n stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])\n\nStandard convolutional layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nImage data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.\n\nTo take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:\n\nfilter should be a tuple of N integers.\nKeywords stride and dilation should each be either single integer, or a tuple with N integers.\nKeyword pad specifies the number of elements added to the borders of the data array. It can be\na single integer for equal padding all around,\na tuple of N integers, to apply the same padding at begin/end of each spatial dimension,\na tuple of 2*N integers, for asymmetric padding, or\nthe singleton SamePad(), to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.\nKeyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.\n\nKeywords to control initialization of the layer:\n\ninit - Function used to generate initial weights. Defaults to glorot_uniform.\nbias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).\n\nSee also ConvTranspose, DepthwiseConv, CrossCor.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = Conv((5,5), 3 => 7, relu; bias = false)\nConv((5, 5), 3 => 7, relu, bias=false) # 525 parameters\n\njulia> layer(xs) |> size\n(96, 96, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2)(xs) |> size\n(48, 48, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, pad = SamePad())(xs) |> size\n(50, 50, 7, 50)\n\njulia> Conv((1,1), 3 => 7; pad = (20,10,0,0))(xs) |> size\n(130, 100, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, dilation = 4)(xs) |> size\n(42, 42, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Conv-Tuple{AbstractArray}","page":"Built-in Layers","title":"Flux.Conv","text":"Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nConstructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).\n\njulia> weight = rand(3, 4, 5);\n\njulia> bias = zeros(5);\n\njulia> layer = Conv(weight, bias, sigmoid) # expects 1 spatial dimension\nConv((3,), 4 => 5, σ) # 65 parameters\n\njulia> layer(randn(100, 4, 64)) |> size\n(98, 5, 64)\n\njulia> Flux.params(layer) |> length\n2\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.ConvTranspose","page":"Built-in Layers","title":"Flux.ConvTranspose","text":"ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])\n\nStandard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.\n\nNote that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.\n\nTo conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = ConvTranspose((5,5), 3 => 7, relu)\nConvTranspose((5, 5), 3 => 7, relu) # 532 parameters\n\njulia> layer(xs) |> size\n(104, 104, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2)(xs) |> size\n(203, 203, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2, outpad=1)(xs) |> size\n(204, 204, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=3, pad=SamePad())(xs) |> size\n(300, 300, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.ConvTranspose-Tuple{AbstractArray}","page":"Built-in Layers","title":"Flux.ConvTranspose","text":"ConvTranspose(weight::AbstractArray, [bias, activation; stride, pad, outpad, dilation, groups])\n\nConstructs a ConvTranspose layer with the given weight and bias. Accepts the same keywords and has the same defaults as ConvTranspose(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).\n\nExamples\n\njulia> weight = rand(3, 4, 5);\n\njulia> bias = zeros(4);\n\njulia> layer = ConvTranspose(weight, bias, sigmoid)\nConvTranspose((3,), 5 => 4, σ) # 64 parameters\n\njulia> layer(randn(100, 5, 64)) |> size # transposed convolution will increase the dimension size (upsampling)\n(102, 4, 64)\n\njulia> Flux.params(layer) |> length\n2\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.CrossCor","page":"Built-in Layers","title":"Flux.CrossCor","text":"CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\n\nStandard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)\nCrossCor((5, 5), 3 => 6, relu, bias=false) # 450 parameters\n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> CrossCor((5,5), 3 => 7, stride=3, pad=(2,0))(xs) |> size\n(34, 32, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.CrossCor-Tuple{AbstractArray}","page":"Built-in Layers","title":"Flux.CrossCor","text":"CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nConstructs a CrossCor layer with the given weight and bias. Accepts the same keywords and has the same defaults as CrossCor(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).\n\nExamples\n\njulia> weight = rand(3, 4, 5);\n\njulia> bias = zeros(5);\n\njulia> layer = CrossCor(weight, bias, relu)\nCrossCor((3,), 4 => 5, relu) # 65 parameters\n\njulia> layer(randn(100, 4, 64)) |> size\n(98, 5, 64)\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.DepthwiseConv","page":"Built-in Layers","title":"Flux.DepthwiseConv","text":"DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\nDepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nReturn a depthwise convolutional layer, that is a Conv layer with number of groups equal to the number of input channels.\n\nSee Conv for a description of the arguments.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)\nConv((5, 5), 3 => 6, relu, groups=3, bias=false) # 150 parameters \n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size\n(50, 50, 9, 50)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.SamePad","page":"Built-in Layers","title":"Flux.SamePad","text":"SamePad()\n\nPassed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).\n\nSee also Conv, MaxPool.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of images\n\njulia> layer = Conv((2,2), 3 => 7, pad=SamePad())\nConv((2, 2), 3 => 7, pad=(1, 0, 1, 0)) # 91 parameters\n\njulia> layer(xs) |> size # notice how the dimensions stay the same with this padding\n(100, 100, 7, 50)\n\njulia> layer2 = Conv((2,2), 3 => 7)\nConv((2, 2), 3 => 7) # 91 parameters\n\njulia> layer2(xs) |> size # the output dimension changes as the padding was not \"same\"\n(99, 99, 7, 50)\n\njulia> layer3 = Conv((5, 5), 3 => 7, stride=2, pad=SamePad())\nConv((5, 5), 3 => 7, pad=2, stride=2) # 532 parameters\n\njulia> layer3(xs) |> size # output size = `ceil(input_size/stride)` = 50\n(50, 50, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.flatten","page":"Built-in Layers","title":"Flux.flatten","text":"flatten(x)\n\nSame as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#MultiHeadAttention","page":"Built-in Layers","title":"MultiHeadAttention","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"MultiHeadAttention","category":"page"},{"location":"reference/models/layers/#Flux.MultiHeadAttention","page":"Built-in Layers","title":"Flux.MultiHeadAttention","text":"MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])\n\nThe multi-head dot-product attention layer used in Transformer architectures [1].\n\nReturns the transformed input sequence and the attention scores.\n\n[1] Vaswani et al. \"Attention is all you need.\" Advances in Neural Information Processing Systems. 2017.\n\nArguments\n\ndims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.\nnheads: number of heads. Default 8.\ninit: weight initializer for the Dense layers. Default glorot_uniform.\nbias : whether pointwise QKVO dense transforms use bias. Default false.\ndropout_prob: dropout probability for the attention scores. Default 0.0.\n\nForward\n\n(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])\n\nThe arguments of the forward pass are:\n\nq_in: Input query array of size (q_in_dim, q_len, batch_size).\nk_in: Input key array of size (k_in_dim, kv_len, batch_size).\nv_in: Input value array of size (v_in_dim, kv_len, batch_size).\nbias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.\nmask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.\n\nAlternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).\n\nSee also NNlib.dot_product_attention.\n\nExamples\n\nmha = MultiHeadAttention(64, nheads = 8)\nq = rand(Float32, (64, 10, 32))\nk = rand(Float32, (64, 20, 32))\nv = rand(Float32, (64, 20, 32))\ny, α = mha(q, k, v) \n# [y] = [64, 10, 32]\n# [α] = [20, 10, 8, 32]\n\nmha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)\ny, α = mha(q) # self-attention\n# [y] = [1024, 10, 32]\n# [α] = [10, 10, 8, 32]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Pooling","page":"Built-in Layers","title":"Pooling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"AdaptiveMaxPool\nMaxPool\nGlobalMaxPool\nAdaptiveMeanPool\nMeanPool\nGlobalMeanPool","category":"page"},{"location":"reference/models/layers/#Flux.AdaptiveMaxPool","page":"Built-in Layers","title":"Flux.AdaptiveMaxPool","text":"AdaptiveMaxPool(out::NTuple)\n\nAdaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMaxPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MaxPool","page":"Built-in Layers","title":"Flux.MaxPool","text":"MaxPool(window::NTuple; pad=0, stride=window)\n\nMax pooling layer, which replaces all pixels in a block of size window with one.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7, pad=2), # 532 parameters\n MaxPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(100, 100, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\njulia> layer = MaxPool((5,), pad=2, stride=(3,)) # one-dimensional window\nMaxPool((5,), pad=2, stride=3)\n\njulia> layer(rand(Float32, 100, 7, 50)) |> size\n(34, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMaxPool","page":"Built-in Layers","title":"Flux.GlobalMaxPool","text":"GlobalMaxPool()\n\nGlobal max pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.\n\nSee also MaxPool, GlobalMeanPool.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\njulia> GlobalMaxPool()(rand(3,5,7)) |> size # preserves 2 dimensions\n(1, 5, 7)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AdaptiveMeanPool","page":"Built-in Layers","title":"Flux.AdaptiveMeanPool","text":"AdaptiveMeanPool(out::NTuple)\n\nAdaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMeanPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MeanPool","page":"Built-in Layers","title":"Flux.MeanPool","text":"MeanPool(window::NTuple; pad=0, stride=window)\n\nMean pooling layer, averaging all pixels in a block of size window.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7), # 532 parameters\n MeanPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(96, 96, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMeanPool","page":"Built-in Layers","title":"Flux.GlobalMeanPool","text":"GlobalMeanPool()\n\nGlobal mean pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Upsampling","page":"Built-in Layers","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The opposite of pooling, these layers increase the size of an array. They have no trainable parameters. ","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Upsample\nPixelShuffle","category":"page"},{"location":"reference/models/layers/#Flux.Upsample","page":"Built-in Layers","title":"Flux.Upsample","text":"Upsample(mode = :nearest; [scale, size]) \nUpsample(scale, mode = :nearest)\n\nAn upsampling layer. One of two keywords must be given:\n\nIf scale is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword size accepts a tuple, to directly specify the leading dimensions of the output.\n\nCurrently supported upsampling modes and corresponding NNlib's methods are:\n\n:nearest -> NNlib.upsample_nearest \n:bilinear -> NNlib.upsample_bilinear\n:trilinear -> NNlib.upsample_trilinear\n\nExamples\n\njulia> m = Upsample(scale = (2, 3))\nUpsample(:nearest, scale = (2, 3))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 6, 1, 1)\n\njulia> m = Upsample(:bilinear, size = (4, 5))\nUpsample(:bilinear, size = (4, 5))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 5, 1, 1)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PixelShuffle","page":"Built-in Layers","title":"Flux.PixelShuffle","text":"PixelShuffle(r::Int)\n\nPixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.\n\nSee NNlib.pixel_shuffle.\n\nExamples\n\njulia> p = PixelShuffle(2);\n\njulia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]\n2×2×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 4.1\n 5.1 6.1\n\n[:, :, 2, 1] =\n 3.2 4.2\n 5.2 6.2\n\n[:, :, 3, 1] =\n 3.3 4.3\n 5.3 6.3\n\n[:, :, 4, 1] =\n 3.4 4.4\n 5.4 6.4\n\njulia> p(xs)\n4×4×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 3.3 4.1 4.3\n 3.2 3.4 4.2 4.4\n 5.1 5.3 6.1 6.3\n 5.2 5.4 6.2 6.4\n\njulia> xs = [3row + col + channel/10 for row in 1:2, col in 1:3, channel in 1:4, n in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 5.1 6.1\n 7.1 8.1 9.1\n\n[:, :, 2, 1] =\n 4.2 5.2 6.2\n 7.2 8.2 9.2\n\n[:, :, 3, 1] =\n 4.3 5.3 6.3\n 7.3 8.3 9.3\n\n[:, :, 4, 1] =\n 4.4 5.4 6.4\n 7.4 8.4 9.4\n\njulia> p(xs)\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 4.3 5.1 5.3 6.1 6.3\n 4.2 4.4 5.2 5.4 6.2 6.4\n 7.1 7.3 8.1 8.3 9.1 9.3\n 7.2 7.4 8.2 8.4 9.2 9.4\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Embedding-Vectors","page":"Built-in Layers","title":"Embedding Vectors","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Flux.Embedding\nFlux.EmbeddingBag","category":"page"},{"location":"reference/models/layers/#Flux.Embedding","page":"Built-in Layers","title":"Flux.Embedding","text":"Embedding(in => out; init=randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.\n\nThis layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.\n\nFor indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).\n\nExamples\n\njulia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))\nEmbedding(26 => 4) # 104 parameters\n\njulia> emb(2) # one column of e.weight (here not random!)\n4-element Vector{Float32}:\n 0.0\n 22.0\n 0.0\n 0.0\n\njulia> emb([3, 1, 20, 14, 4, 15, 7]) # vocabulary indices, in 1:26\n4×7 Matrix{Float32}:\n 0.0 22.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n 22.0 0.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 22.0 0.0 0.0\n\njulia> ans == emb(Flux.onehotbatch(\"cat&dog\", 'a':'z', 'n'))\ntrue\n\njulia> emb(rand(1:26, (10, 1, 12))) |> size # three batch dimensions\n(4, 10, 1, 12)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.EmbeddingBag","page":"Built-in Layers","title":"Flux.EmbeddingBag","text":"EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a \"bag\". Their individual embedding vectors are reduced to one, using mean or some other function.\n\nInstead of acting on one \"bag\", such as x::Vector{Int}, the layer can also act on several:\n\nActing on a vector of \"bags\", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).\nAny higher-rank array of integers is interpreted as a collection of \"bags\" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all \"bags\" have the same length.\nA vector of \"bags\" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.\n\nThe \"bag\" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.\n\nExamples\n\njulia> vocab_size = 26; # embed into 3 dimensions, with non-random vectors:\n\njulia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))\nEmbeddingBag(26 => 3) # 78 parameters\n\njulia> eb([2]) # one bag of 1 item\n3-element Vector{Float32}:\n 0.0\n 100.0\n 0.0\n\njulia> eb([3,3,1]) # one bag of 3 items, one mean embedding\n3-element Vector{Float32}:\n 33.333332\n 0.0\n 66.666664\n\njulia> eb([[3,1,3], [2,1]]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 50.0\n 0.0 50.0\n 66.6667 0.0\n\njulia> eb([1 1 1 1; 1 2 3 4]) # 4 bags each of 2 items, eachcol([1 1 1 1; 1 2 3 4])\n3×4 Matrix{Float32}:\n 100.0 50.0 50.0 50.0\n 0.0 50.0 0.0 0.0\n 0.0 0.0 50.0 0.0\n\njulia> eb(rand(1:26, 10, 5, 5)) |> size # 25 bags each of 10 items\n(3, 5, 5)\n\nAnother way to specify \"many bags of many items\" is to provide a vector data (each in 1:in) and a vector at stating where to split that up into \"bags\". The first bag starts with data[at[1]], the second at data[at[2]], and so on, with no overlaps and nothing left out (thus it requires at[1]==1).\n\njulia> data = [11, 1, 12, 2, 13, 3, 14];\n\njulia> data[1:3], data[4:end]\n([11, 1, 12], [2, 13, 3, 14])\n\njulia> eb(data, [1, 4]) # two bags, of 3 and 4 items\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 0.0 25.0\n 0.0 25.0\n\nFinally, each bag may also be also be represented as a OneHotMatrix.\n\njulia> eb(Flux.onehotbatch(\"bba\", 'a':'z')) # same as [2,2,1], one bag of 3 items\n3-element Vector{Float32}:\n 33.333332\n 66.666664\n 0.0\n\njulia> eb([Flux.onehotbatch(\"bba\", 'a':'z'), Flux.onehotbatch(\"cc\", 'a':'z')]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 66.6667 0.0\n 0.0 100.0\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#man-dataflow-layers","page":"Built-in Layers","title":"Dataflow Layers, or Containers","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Chain\nFlux.activations\nMaxout\nSkipConnection\nParallel\nPairwiseFusion","category":"page"},{"location":"reference/models/layers/#Flux.Chain","page":"Built-in Layers","title":"Flux.Chain","text":"Chain(layers...)\nChain(name = layer, ...)\n\nCollects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, m[2] or m[1:end-1], and if names are given, m[:name] == m[1] etc.\n\nExamples\n\njulia> m = Chain(x -> x^2, x -> x+1);\n\njulia> m(5) == 26\ntrue\n\njulia> m = Chain(Dense(10 => 5, tanh), Dense(5 => 2));\n\njulia> x = rand32(10, 32);\n\njulia> m(x) == m[2](m[1](x))\ntrue\n\njulia> m2 = Chain(enc = Chain(Flux.flatten, Dense(10 => 5, tanh)), \n dec = Dense(5 => 2));\n\njulia> m2(x) == (m2[:dec] ∘ m2[:enc])(x)\ntrue\n\nFor large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.activations","page":"Built-in Layers","title":"Flux.activations","text":"activations(c::Chain, input)\n\nLike calling a Chain, but saves the result of each layer as an output.\n\nExamples\n\njulia> using Flux: activations\n\njulia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);\n\njulia> activations(c, 1)\n(2, 4, 64)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.Maxout","page":"Built-in Layers","title":"Flux.Maxout","text":"Maxout(layers...)\nMaxout(f, n_alts)\n\nThis contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.\n\nInstead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.\n\nMaxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio \"Maxout Networks\" https://arxiv.org/abs/1302.4389.\n\nSee also Parallel to reduce with other operators.\n\nExamples\n\njulia> m = Maxout(x -> abs2.(x), x -> x .* 3);\n\njulia> m([-2 -1 0 1 2])\n1×5 Matrix{Int64}:\n 4 1 0 3 6\n\njulia> m3 = Maxout(() -> Dense(5 => 7, tanh), 3)\nMaxout(\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n) # Total: 6 arrays, 126 parameters, 888 bytes.\n\njulia> Flux.outputsize(m3, (5, 11))\n(7, 11)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.SkipConnection","page":"Built-in Layers","title":"Flux.SkipConnection","text":"SkipConnection(layer, connection)\n\nCreate a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, \"skipped\" input.\n\nThe simplest \"ResNet\"-type connection is just SkipConnection(layer, +). Here is a more complicated example:\n\njulia> m = Conv((3,3), 4 => 7, pad=(1,1));\n\njulia> x = ones(Float32, 5, 5, 4, 10);\n\njulia> size(m(x)) == (5, 5, 7, 10)\ntrue\n\njulia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));\n\njulia> size(sm(x)) == (5, 5, 11, 10)\ntrue\n\nSee also Parallel, Maxout.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Parallel","page":"Built-in Layers","title":"Flux.Parallel","text":"Parallel(connection, layers...)\nParallel(connection; name = layer, ...)\n\nCreate a layer which passes an input array to each path in layers, before reducing the output with connection.\n\nCalled with one input x, this is equivalent to connection([l(x) for l in layers]...). If called with multiple inputs, one is passed to each layer, thus Parallel(+, f, g)(x, y) = f(x) + g(y).\n\nLike Chain, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: m[1] == m[:name] is the first layer.\n\nSee also SkipConnection which is Parallel with one identity, and Maxout which reduces by broadcasting max.\n\nExamples\n\njulia> model = Chain(Dense(3 => 5),\n Parallel(vcat, Dense(5 => 4), Chain(Dense(5 => 7), Dense(7 => 4))),\n Dense(8 => 17));\n\njulia> model(rand32(3)) |> size\n(17,)\n\njulia> model2 = Parallel(+; α = Dense(10 => 2, tanh), β = Dense(5 => 2))\nParallel(\n +,\n α = Dense(10 => 2, tanh), # 22 parameters\n β = Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 34 parameters, 392 bytes.\n\njulia> model2(rand32(10), rand32(5)) |> size\n(2,)\n\njulia> model2[:α](rand32(10)) |> size\n(2,)\n\njulia> model2[:β] == model2[2]\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PairwiseFusion","page":"Built-in Layers","title":"Flux.PairwiseFusion","text":"PairwiseFusion(connection, layers...)\n\nArguments\n\nconnection: A function taking 2 inputs and combining them into a single output \nlayers: The layers whose outputs are combined\n\nInputs\n\nThis layer behaves differently based on input type:\n\nIf input x is a tuple of length N (or the input is xs with N x's), matching the number of layers, \n\nthen each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:\n\nx1 → layer1 → y1 ↘\n connection → layer2 → y2 ↘\n x2 ↗ connection → layer3 → y3\n x3 ↗\n\n... or written as:\n\ny1 = layer1(x1)\ny2 = layer2(connection(y1, x2))\ny3 = layer3(connection(y2, x3))\n\nWith just one input, each layer receives the same x combined with the previous output. Thus y = PairwiseFusion(connection, layers...)(x) obeys:\n\ny[1] == layers[1](x)\nfor i in 2:length(layers)\n y[i] == connection(layers[i](y[i-1]), x)\nend\n\nReturns\n\nA tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Recurrent-Models","page":"Built-in Layers","title":"Recurrent Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"RNN\nLSTM\nGRU\nGRUv3\nFlux.Recur\nFlux.reset!","category":"page"},{"location":"reference/models/layers/#Flux.RNN","page":"Built-in Layers","title":"Flux.RNN","text":"RNN(in => out, σ = tanh)\n\nThe most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.\n\nThe arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(RNNCell(a...)), and so RNNs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nExamples\n\njulia> r = RNN(3 => 5)\nRecur(\n RNNCell(3 => 5, tanh), # 50 parameters\n) # Total: 4 trainable arrays, 50 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 432 bytes.\n\njulia> r(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(r);\n\njulia> r(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the following example:julia> r = RNN(3 => 5)\nRecur(\n RNNCell(3 => 5, tanh), # 50 parameters\n) # Total: 4 trainable arrays, 50 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 432 bytes.\n\njulia> r.state |> size\n(5, 1)\n\njulia> r(rand(Float32, 3)) |> size\n(5,)\n\njulia> r.state |> size\n(5, 1)\n\njulia> r(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\njulia> r.state |> size # state shape has changed\n(5, 10)\n\njulia> r(rand(Float32, 3)) |> size # erroneously outputs a length 5*10 = 50 vector.\n(50,)\n\nNote:\n\nRNNCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type, but if Wh is dxd, then Wi should be of shape dxN.\n\njulia> using LinearAlgebra\n\njulia> r = Flux.Recur(Flux.RNNCell(tanh, rand(5, 4), Tridiagonal(rand(5, 5)), rand(5), rand(5, 1)))\n\njulia> r(rand(4, 10)) |> size # batch size of 10\n(5, 10)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.LSTM","page":"Built-in Layers","title":"Flux.LSTM","text":"LSTM(in => out)\n\nLong Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.\n\nThe arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(LSTMCell(a...)), and so LSTMs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nSee this article for a good overview of the internals.\n\nExamples\n\njulia> l = LSTM(3 => 5)\nRecur(\n LSTMCell(3 => 5), # 190 parameters\n) # Total: 5 trainable arrays, 190 parameters,\n # plus 2 non-trainable, 10 parameters, summarysize 1.062 KiB.\n\njulia> l(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(l);\n\njulia> l(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.\n\nNote:\n\nLSTMCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.GRU","page":"Built-in Layers","title":"Flux.GRU","text":"GRU(in => out)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.\n\nThe integer arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(GRUCell(a...)), and so GRUs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nSee this article for a good overview of the internals.\n\nExamples\n\njulia> g = GRU(3 => 5)\nRecur(\n GRUCell(3 => 5), # 140 parameters\n) # Total: 4 trainable arrays, 140 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 792 bytes.\n\njulia> g(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(g);\n\njulia> g(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.\n\nNote:\n\nGRUCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.GRUv3","page":"Built-in Layers","title":"Flux.GRUv3","text":"GRUv3(in => out)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.\n\nThe arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(GRUv3Cell(a...)), and so GRUv3s are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nSee this article for a good overview of the internals.\n\nExamples\n\njulia> g = GRUv3(3 => 5)\nRecur(\n GRUv3Cell(3 => 5), # 140 parameters\n) # Total: 5 trainable arrays, 140 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 848 bytes.\n\njulia> g(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(g);\n\njulia> g(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.\n\nNote:\n\nGRUv3Cells can be constructed directly by specifying the non-linear function, the Wi, Wh, and Wh_h internal matrices, a bias vector b, and a learnable initial state state0. The Wi, Wh, and Wh_h matrices do not need to be the same type. See the example in RNN.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.Recur","page":"Built-in Layers","title":"Flux.Recur","text":"Recur(cell)\n\nRecur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:\n\nh, y = cell(h, x...)\n\nFor example, here's a recurrent network that keeps a running total of its inputs:\n\nExamples\n\njulia> accum(h, x) = (h + x, x)\naccum (generic function with 1 method)\n\njulia> rnn = Flux.Recur(accum, 0)\nRecur(accum)\n\njulia> rnn(2) \n2\n\njulia> rnn(3)\n3\n\njulia> rnn.state\n5\n\nFolding over a 3d Array of dimensions (features, batch, time) is also supported:\n\njulia> accum(h, x) = (h .+ x, x)\naccum (generic function with 1 method)\n\njulia> rnn = Flux.Recur(accum, zeros(Int, 1, 1))\nRecur(accum)\n\njulia> rnn([2])\n1-element Vector{Int64}:\n 2\n\njulia> rnn([3])\n1-element Vector{Int64}:\n 3\n\njulia> rnn.state\n1×1 Matrix{Int64}:\n 5\n\njulia> out = rnn(reshape(1:10, 1, 1, :)); # apply to a sequence of (features, batch, time)\n\njulia> out |> size\n(1, 1, 10)\n\njulia> vec(out)\n10-element Vector{Int64}:\n 1\n 2\n 3\n 4\n 5\n 6\n 7\n 8\n 9\n 10\n\njulia> rnn.state\n1×1 Matrix{Int64}:\n 60\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.reset!","page":"Built-in Layers","title":"Flux.reset!","text":"reset!(rnn)\n\nReset the hidden state of a recurrent layer back to its original value.\n\nAssuming you have a Recur layer rnn, this is roughly equivalent to:\n\nrnn.state = hidden(rnn.cell)\n\nExamples\n\njulia> r = Flux.RNNCell(relu, ones(1,1), zeros(1,1), ones(1,1), zeros(1,1)); # users should use the RNN wrapper struct instead\n\njulia> y = Flux.Recur(r, ones(1,1));\n\njulia> y.state\n1×1 Matrix{Float64}:\n 1.0\n\njulia> y(ones(1,1)) # relu(1*1 + 1)\n1×1 Matrix{Float64}:\n 2.0\n\njulia> y.state\n1×1 Matrix{Float64}:\n 2.0\n\njulia> Flux.reset!(y)\n1×1 Matrix{Float64}:\n 0.0\n\njulia> y.state\n1×1 Matrix{Float64}:\n 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Normalisation-and-Regularisation","page":"Built-in Layers","title":"Normalisation & Regularisation","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"BatchNorm\nDropout\nAlphaDropout\nLayerNorm\nInstanceNorm\nGroupNorm\nFlux.normalise","category":"page"},{"location":"reference/models/layers/#Flux.BatchNorm","page":"Built-in Layers","title":"Flux.BatchNorm","text":"BatchNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=true, track_stats=true, active=nothing,\n eps=1f-5, momentum= 0.1f0)\n\nBatch Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N dimensions, call the N-1th the channel dimension. For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.\n\nBatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nAfter normalisation, elementwise activation λ is applied.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nUse testmode! during inference.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = BatchNorm(3);\n\njulia> Flux.trainmode!(m);\n\njulia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Dropout","page":"Built-in Layers","title":"Flux.Dropout","text":"Dropout(p; [dims, rng, active])\n\nLayer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.\n\nWhile training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.\n\nBy default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.\n\nBy default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).\n\nKeyword rng lets you specify a custom random number generator. (Only supported on the CPU.)\n\nExamples\n\njulia> m = Chain(Dense(ones(3,2)), Dropout(0.4))\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4),\n)\n\njulia> m(ones(2, 7)) # test mode, no effect\n3×7 Matrix{Float64}:\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n\njulia> Flux.trainmode!(m) # equivalent to use within gradient\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4, active=true),\n)\n\njulia> m(ones(2, 7))\n3×7 Matrix{Float64}:\n 0.0 0.0 3.33333 0.0 0.0 0.0 0.0\n 3.33333 0.0 3.33333 0.0 3.33333 0.0 3.33333\n 3.33333 3.33333 0.0 3.33333 0.0 0.0 3.33333\n\njulia> y = m(ones(2, 10_000));\n\njulia> using Statistics\n\njulia> mean(y) # is about 2.0, same as in test mode\n1.9989999999999961\n\njulia> mean(iszero, y) # is about 0.4\n0.4003\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AlphaDropout","page":"Built-in Layers","title":"Flux.AlphaDropout","text":"AlphaDropout(p; [rng, active])\n\nA dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.\n\nDoes nothing to the input once testmode! is true.\n\nExamples\n\njulia> using Statistics\n\njulia> x = randn32(1000,1);\n\njulia> m = Chain(Dense(1000 => 1000, selu), AlphaDropout(0.2));\n\njulia> Flux.trainmode!(m);\n\njulia> y = m(x);\n\njulia> isapprox(std(x), std(y), atol=0.2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LayerNorm","page":"Built-in Layers","title":"Flux.LayerNorm","text":"LayerNorm(size..., λ=identity; affine=true, eps=1f-5)\n\nA normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.\n\nIn the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.\n\nIf affine=true, it also applies a learnable shift and rescaling using the Scale layer.\n\nSee also BatchNorm, InstanceNorm, GroupNorm, and normalise.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = LayerNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.InstanceNorm","page":"Built-in Layers","title":"Flux.InstanceNorm","text":"InstanceNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=false, track_stats=false,\n eps=1f-5, momentum=0.1f0)\n\nInstance Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nInstanceNorm computes the mean and variance for each D_1×...×D_{N-2}×1×1 input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nWarning: the defaults for affine and track_stats used to be true in previous Flux versions (< v0.12).\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = InstanceNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GroupNorm","page":"Built-in Layers","title":"Flux.GroupNorm","text":"GroupNorm(channels::Int, G::Int, λ = identity;\n initβ = zeros32,\n initγ = ones32,\n affine = true,\n eps = 1f-5,\n momentum = 0.1f0)\n\nGroup Normalization layer.\n\nchs is the number of channels, the channel dimension of your input. For an array of N dimensions, the N-1th index is the channel dimension.\n\nG is the number of groups along which the statistics are computed. The number of channels must be an integer multiple of the number of groups.\n\nchannels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 4, 2); # a batch of 2 images, each having 4 channels\n\njulia> m = GroupNorm(4, 2);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y[:, :, 1:2, 1]), 1, atol=0.1) && std(xs[:, :, 1:2, 1]) != std(y[:, :, 1:2, 1])\ntrue\n\njulia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.normalise","page":"Built-in Layers","title":"Flux.normalise","text":"normalise(x; dims=ndims(x), eps=1e-5)\n\nNormalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the denominator for numerical stability.\n\nExamples\n\njulia> using Statistics\n\njulia> x = [90, 100, 110, 130, 70];\n\njulia> mean(x), std(x; corrected=false)\n(100.0, 20.0)\n\njulia> y = Flux.normalise(x)\n5-element Vector{Float64}:\n -0.49999975000012503\n 0.0\n 0.49999975000012503\n 1.499999250000375\n -1.499999250000375\n\njulia> isapprox(std(y; corrected=false), 1, atol=1e-5)\ntrue\n\njulia> x = rand(10:100, 10, 10);\n\njulia> y = Flux.normalise(x, dims=1);\n\njulia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Test-vs.-Train","page":"Built-in Layers","title":"Test vs. Train","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference. ","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"warning: Warning\nThis automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"testmode!(::Any)\ntestmode!(::Any, ::Any)\ntrainmode!","category":"page"},{"location":"reference/models/layers/#Flux.testmode!-Tuple{Any}","page":"Built-in Layers","title":"Flux.testmode!","text":"testmode!(model, [mode]) -> model\n\nSet a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.\n\nIf you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.\n\nThere is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.\n\nExample\n\njulia> d = Dropout(0.3)\nDropout(0.3)\n\njulia> testmode!(d) # dropout is now always disabled\nDropout(0.3, active=false)\n\njulia> trainmode!(d) # dropout is now always enabled\nDropout(0.3, active=true)\n\njulia> testmode!(d, :auto) # back to default\nDropout(0.3)\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.testmode!-Tuple{Any, Any}","page":"Built-in Layers","title":"Flux.testmode!","text":"testmode!(model, inactive)\n\nThis two-argument method is largely internal. It recurses into the model, and until a method like testmode!(d::Dropout, inactive) alters the activity of a layer. Custom layers can support manual testmode! / trainmode! switching by defining such a method.\n\nPossible values of inactive are:\n\ntrue for testing, i.e. active=false\nfalse for training, same as trainmode!(m)\n:auto or nothing for Flux to detect training automatically.\n\ncompat: Compat\nThis method may be removed in a future breaking change, to separate the user-facing testmode! from the internal recursion.\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.trainmode!","page":"Built-in Layers","title":"Flux.trainmode!","text":"trainmode!(model) -> model\n\nSet a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.\n\n\n\n\n\ntrainmode!(m, active)\n\nwarning: Warning\nThis two-argument method is deprecated.\n\nPossible values of active are:\n\ntrue for training, or \nfalse for testing, same as testmode!(m)\n:auto or nothing for Flux to detect training automatically.\n\n\n\n\n\n","category":"function"},{"location":"guide/models/overview/#man-overview","page":"Fitting a Line","title":"Flux Overview: Fitting a Straight Line","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Here's how you'd use Flux to build and train the most basic of models, step by step.","category":"page"},{"location":"guide/models/overview/#A-Trivial-Prediction","page":"Fitting a Line","title":"A Trivial Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will predict the output of the function 4x + 2. Making such predictions is called \"linear regression\", and is really too simple to need a neural network. But it's a nice toy example.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, import Flux and define the function we want to simulate:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux\n\njulia> actual(x) = 4x + 2\nactual (generic function with 1 method)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will build a model to approximate the actual function.","category":"page"},{"location":"guide/models/overview/#1.-Provide-Training-and-Test-Data","page":"Fitting a Line","title":"1. Provide Training and Test Data","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Use the actual function to build sets of data for training and verification:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> x_train, x_test = hcat(0:5...), hcat(6:10...)\n([0 1 … 4 5], [6 7 … 9 10])\n\njulia> y_train, y_test = actual.(x_train), actual.(x_test)\n([2 6 … 18 22], [26 30 … 38 42])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Normally, your training and test data come from real world observations, but here we simulate them.","category":"page"},{"location":"guide/models/overview/#2.-Build-a-Model-to-Make-Predictions","page":"Fitting a Line","title":"2. Build a Model to Make Predictions","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, build a model to make predictions with 1 input and 1 output:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters\n\njulia> model.weight\n1×1 Matrix{Float32}:\n 0.95041317\n\njulia> model.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, a dense layer is a struct with fields weight and bias. weight represents a weights' matrix and bias represents a bias vector. There's another way to think about a model. In Flux, models are conceptually predictive functions: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Dense(1 => 1) also implements the function σ(Wx+b) where W and b are the weights and biases. σ is an activation function (more on activations later). Our model has one weight and one bias, but typical models will have many more. Think of weights and biases as knobs and levers Flux can use to tune predictions. Activation functions are transformations that tailor models to your needs. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This model will already make predictions, though not accurate ones yet:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_train)\n1×6 Matrix{Float32}:\n 0.0 0.906654 1.81331 2.71996 3.62662 4.53327","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In order to make better predictions, you'll need to provide a loss function to tell Flux how to objectively evaluate the quality of a prediction. Loss functions compute the cumulative distance between actual values and predictions. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Statistics\n\njulia> loss(model, x, y) = mean(abs2.(model(x) .- y));\n\njulia> loss(predict, x_train, y_train)\n122.64734f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"More accurate predictions will yield a lower loss. You can write your own loss functions or rely on those already provided by Flux. This loss function is called mean squared error (and built-in as mse). Flux works by iteratively reducing the loss through training.","category":"page"},{"location":"guide/models/overview/#3.-Improve-the-Prediction","page":"Fitting a Line","title":"3. Improve the Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, the Flux Flux.train! function uses a loss function and training data to improve the parameters of your model based on a pluggable optimiser:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux: train!\n\njulia> opt = Descent()\nDescent(0.1)\n\njulia> data = [(x_train, y_train)]\n1-element Vector{Tuple{Matrix{Int64}, Matrix{Int64}}}:\n ([0 1 … 4 5], [2 6 … 18 22])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, we have the optimiser and data we'll pass to train!. All that remains are the parameters of the model. Remember, each model is a Julia struct with a function and configurable parameters. Remember, the dense layer has weights and biases that depend on the dimensions of the inputs and outputs: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight\n1×1 Matrix{Float32}:\n 0.9066542\n\njulia> predict.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The dimensions of these model parameters depend on the number of inputs and outputs.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux will adjust predictions by iteratively changing these parameters according to the optimiser.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This optimiser implements the classic gradient descent strategy. Now improve the parameters of the model with a call to Flux.train! like this:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> train!(loss, predict, data, opt)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"And check the loss:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> loss(predict, x_train, y_train)\n116.38745f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"It went down. Why? ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight, predict.bias\n(Float32[7.246838;;], Float32[1.748103])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The parameters have changed. This single step is the essence of machine learning.","category":"page"},{"location":"guide/models/overview/#3.-Iteratively-Train-the-Model","page":"Fitting a Line","title":"3+. Iteratively Train the Model","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In the previous section, we made a single call to train! which iterates over the data we passed in just once. An epoch refers to one pass over the dataset. Typically, we will run the training for multiple epochs to drive the loss down even further. Let's run it a few more times:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> for epoch in 1:200\n train!(loss, predict, data, opt)\n end\n\njulia> loss(predict, x_train, y_train)\n0.00339581f0\n\njulia> predict.weight, predict.bias\n(Float32[4.0159144;;], Float32[2.004479])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After 200 training steps, the loss went down, and the parameters are getting close to those in the function the model is built to predict.","category":"page"},{"location":"guide/models/overview/#4.-Verify-the-Results","page":"Fitting a Line","title":"4. Verify the Results","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, let's verify the predictions:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_test)\n1×5 Matrix{Float32}:\n 26.1121 30.13 34.1479 38.1657 42.1836\n\njulia> y_test\n1×5 Matrix{Int64}:\n 26 30 34 38 42","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The predictions are good. Here's how we got there. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After we trained the model, we verified it with the test data to verify the results. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.","category":"page"},{"location":"reference/destructure/#man-destructure","page":"Flat vs. Nested","title":"Flat vs. Nested Structures","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"A Flux model is a nested structure, with parameters stored within many layers. Sometimes you may want a flat representation of them, to interact with functions expecting just one vector. This is provided by destructure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> model = Chain(Dense(2=>1, tanh), Dense(1=>1))\nChain(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1), # 2 parameters\n) # Total: 4 arrays, 5 parameters, 276 bytes.\n\njulia> flat, rebuild = Flux.destructure(model)\n(Float32[0.863101, 1.2454957, 0.0, -1.6345707, 0.0], Restructure(Chain, ..., 5))\n\njulia> rebuild(zeros(5)) # same structure, new parameters\nChain(\n Dense(2 => 1, tanh), # 3 parameters (all zero)\n Dense(1 => 1), # 2 parameters (all zero)\n) # Total: 4 arrays, 5 parameters, 276 bytes.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Both destructure and the Restructure function can be used within gradient computations. For instance, this computes the Hessian ∂²L/∂θᵢ∂θⱼ of some loss function, with respect to all parameters of the Flux model. The resulting matrix has off-diagonal entries, which cannot really be expressed in a nested structure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> x = rand(Float32, 2, 16);\n\njulia> grad = gradient(m -> sum(abs2, m(x)), model) # nested gradient\n((layers = ((weight = Float32[10.339018 11.379145], bias = Float32[22.845667], σ = nothing), (weight = Float32[-29.565302;;], bias = Float32[-37.644184], σ = nothing)),),)\n\njulia> function loss(v::Vector)\n m = rebuild(v)\n y = m(x)\n sum(abs2, y)\n end;\n\njulia> gradient(loss, flat) # flat gradient, same numbers\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184],)\n\njulia> Zygote.hessian(loss, flat) # second derivative\n5×5 Matrix{Float32}:\n -7.13131 -5.54714 -11.1393 -12.6504 -8.13492\n -5.54714 -7.11092 -11.0208 -13.9231 -9.36316\n -11.1393 -11.0208 -13.7126 -27.9531 -22.741\n -12.6504 -13.9231 -27.9531 18.0875 23.03\n -8.13492 -9.36316 -22.741 23.03 32.0\n\njulia> Flux.destructure(grad) # acts on non-models, too\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184], Restructure(Tuple, ..., 5))","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"In order to collect all parameters of a model into a list instead, you can use the trainables function:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> Flux.trainables(model)\n5-element Vector{AbstractArray}:\n [0.863101 1.2454957]\n [0.0]\n [1.290355429422727;;]\n [0.0]","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Any mutation of the elements of the resulting list will affect the model's parameters.","category":"page"},{"location":"reference/destructure/#All-Parameters","page":"Flat vs. Nested","title":"All Parameters","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"The functions destructure and trainables live in Optimisers.jl.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Optimisers.destructure\nOptimisers.trainable\nOptimisers.trainables\nOptimisers.isnumeric","category":"page"},{"location":"reference/destructure/#Optimisers.destructure","page":"Flat vs. Nested","title":"Optimisers.destructure","text":"destructure(model) -> vector, reconstructor\n\nCopies all trainable, isnumeric parameters in the model to a vector, and returns also a function which reverses this transformation. Differentiable.\n\nExample\n\njulia> v, re = destructure((x=[1.0, 2.0], y=(sin, [3.0 + 4.0im])))\n(ComplexF64[1.0 + 0.0im, 2.0 + 0.0im, 3.0 + 4.0im], Restructure(NamedTuple, ..., 3))\n\njulia> re([3, 5, 7+11im])\n(x = [3.0, 5.0], y = (sin, ComplexF64[7.0 + 11.0im]))\n\nIf model contains various number types, they are promoted to make vector, and are usually restored by Restructure. Such restoration follows the rules of ChainRulesCore.ProjectTo, and thus will restore floating point precision, but will permit more exotic numbers like ForwardDiff.Dual.\n\nIf model contains only GPU arrays, then vector will also live on the GPU. At present, a mixture of GPU and ordinary CPU arrays is undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainable","page":"Flat vs. Nested","title":"Optimisers.trainable","text":"trainable(x::Layer) -> NamedTuple\n\nThis may be overloaded to make optimisers ignore some fields of every Layer, which would otherwise contain trainable parameters.\n\nwarning: Warning\nThis is very rarely required. Fields of struct Layer which contain functions, or integers like sizes, are always ignored anyway. Overloading trainable is only necessary when some arrays of numbers are to be optimised, and some arrays of numbers are not.\n\nThe default is Functors.children(x), usually a NamedTuple of all fields, and trainable(x) must contain a subset of these.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainables","page":"Flat vs. Nested","title":"Optimisers.trainables","text":"trainables(x, path = false)\n\nReturn an iterable over all the trainable parameters in x, that is all the numerical arrays (see isnumeric) which are reachable through trainable.\n\nParameters appearing multiple times in the model (tied weights) will be present only once in the output.\n\nIf path = false, the output is a list of numerical arrays.\n\nIf path = true, the output is a list of (KeyPath, AbstractArray) pairs, where KeyPath is a type representing the path to the array in the original structure.\n\nSee also destructure for a similar operation that returns a single flat vector instead.\n\nExamples\n\njulia> struct MyLayer\n w\n b\n end\n\njulia> Functors.@functor MyLayer\n\njulia> Optimisers.trainable(x::MyLayer) = (; w = x.w,) # only w is trainable in this example\n\njulia> x = MyLayer([1.0,2.0,3.0], [4.0,5.0,6.0]);\n\njulia> trainables(x)\n1-element Vector{AbstractArray}:\n [1.0, 2.0, 3.0]\n\n julia> x = MyLayer((a=[1.0,2.0], b=[3.0]), [4.0,5.0,6.0]);\n\n julia> trainables(x) # collects nested parameters\n 2-element Vector{AbstractArray}:\n [1.0, 2.0]\n [3.0]\n\njulia> x = (a = [1.0,2.0], b = (Dict(\"c\" => [3.0, 4.0], \"d\" => 5.0), [6.0,7.0]));\n\njulia> for (kp, y) in trainables(x, path = true)\n println(kp, \" => \", y)\n end\nKeyPath(:a,) => [1.0, 2.0]\nKeyPath(:b, 1, \"c\") => [3.0, 4.0]\nKeyPath(:b, 2) => [6.0, 7.0]\n\njulia> getkeypath(x, KeyPath(:b, 1, \"c\"))\n2-element Vector{Float64}:\n 3.0\n 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.isnumeric","page":"Flat vs. Nested","title":"Optimisers.isnumeric","text":"isnumeric(x) -> Bool\n\nReturns true on any parameter to be adjusted by Optimisers.jl, namely arrays of non-integer numbers. Returns false on all other types.\n\nRequires also that Functors.isleaf(x) == true, to focus on e.g. the parent of a transposed matrix, not the wrapper.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#All-Layers","page":"Flat vs. Nested","title":"All Layers","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Another kind of flat view of a nested model is provided by the modules command. This extracts a list of all layers:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.modules","category":"page"},{"location":"reference/destructure/#Flux.modules","page":"Flat vs. Nested","title":"Flux.modules","text":"modules(m)\n\nReturn an iterator over non-leaf objects that can be reached by recursing m over the children given by functor.\n\nUseful for applying a function (e.g. a regularizer) over specific modules or subsets of the parameters (e.g. the weights but not the biases).\n\nExamples\n\njulia> m1 = Chain(Dense(28^2, 64), BatchNorm(64, relu));\n\njulia> m2 = Chain(m1, Dense(64, 10))\nChain(\n Chain(\n Dense(784 => 64), # 50_240 parameters\n BatchNorm(64, relu), # 128 parameters, plus 128\n ),\n Dense(64 => 10), # 650 parameters\n) # Total: 6 trainable arrays, 51_018 parameters,\n # plus 2 non-trainable, 128 parameters, summarysize 200.312 KiB.\n\njulia> Flux.modules(m2)\n7-element Vector{Any}:\n Chain(Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10)) # 51_018 parameters, plus 128 non-trainable\n (Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10))\n Chain(Dense(784 => 64), BatchNorm(64, relu)) # 50_368 parameters, plus 128 non-trainable\n (Dense(784 => 64), BatchNorm(64, relu))\n Dense(784 => 64) # 50_240 parameters\n BatchNorm(64, relu) # 128 parameters, plus 128 non-trainable\n Dense(64 => 10) # 650 parameters\n\njulia> L2(m) = sum(sum(abs2, l.weight) for l in Flux.modules(m) if l isa Dense)\nL2 (generic function with 1 method)\n\njulia> L2(m2) isa Float32\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Save-and-Load","page":"Flat vs. Nested","title":"Save and Load","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.state\nFlux.loadmodel!","category":"page"},{"location":"reference/destructure/#Flux.state","page":"Flat vs. Nested","title":"Flux.state","text":"state(x)\n\nReturn an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).\n\nBesides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.\n\nThis method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.\n\nThe state can be passed to loadmodel! to restore the model.\n\nExamples\n\nCopy the state into another model\n\njulia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));\n\njulia> s = Flux.state(m1)\n(layers = ((weight = [1.0; 1.0;;], bias = [0.0, 0.0], σ = ()), (weight = [1.0 1.0], bias = [0.0], σ = ())),)\n\njulia> m2 = Chain(Dense(1, 2, tanh), Dense(2, 1; bias=false)); # weights are random numbers\n\njulia> Flux.loadmodel!(m2, s);\n\njulia> m2[1].weight # now the weights of m2 are the same as m1\n2×1 Matrix{Float32}:\n 1.0\n 1.0\n\njulia> Flux.state(trainmode!(Dropout(0.2))) # contains p & activity, but not RNG state\n(p = 0.2, dims = (), active = true, rng = ())\n\njulia> Flux.state(BatchNorm(1)) # contains non-trainable arrays μ, σ²\n(λ = (), β = Float32[0.0], γ = Float32[1.0], μ = Float32[0.0], σ² = Float32[1.0], ϵ = 1.0f-5, momentum = 0.1f0, affine = true, track_stats = true, active = nothing, chs = 1)\n\nSave and load with BSON\n\njulia> using BSON\n\njulia> BSON.@save \"checkpoint.bson\" model_state = s\n\njulia> Flux.loadmodel!(m2, BSON.load(\"checkpoint.bson\")[:model_state])\n\nSave and load with JLD2\n\njulia> using JLD2\n\njulia> JLD2.jldsave(\"checkpoint.jld2\", model_state = s)\n\njulia> Flux.loadmodel!(m2, JLD2.load(\"checkpoint.jld2\", \"model_state\"))\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Flux.loadmodel!","page":"Flat vs. Nested","title":"Flux.loadmodel!","text":"loadmodel!(dst, src)\n\nCopy all the parameters (trainable and non-trainable) from src into dst.\n\nRecursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).\n\nSee also Flux.state.\n\nExamples\n\njulia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))\nChain(\n Dense(5 => 2, tanh), # 12 parameters\n Dense(2 => 1), # 3 parameters\n) # Total: 4 arrays, 15 parameters, 316 bytes.\n\njulia> dst[1].weight ≈ ones(2, 5) # by construction\ntrue\n\njulia> src = Chain(Dense(5 => 2, relu), Dense(2 => 1, bias=false));\n\njulia> Flux.loadmodel!(dst, src);\n\njulia> dst[1].weight ≈ ones(2, 5) # values changed\nfalse\n\njulia> iszero(dst[2].bias)\ntrue\n\nExtended help\n\nThrows an error when:\n\ndst and src do not share the same fields (at any level)\nthe sizes of leaf nodes are mismatched between dst and src\ncopying non-array values to/from an array parameter (except inactive parameters described below)\ndst is a \"tied\" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values\n\nInactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#KeyPath","page":"Flat vs. Nested","title":"KeyPath","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Functors.KeyPath\nFunctors.getkeypath\nFunctors.haskeypath","category":"page"},{"location":"reference/destructure/#Functors.KeyPath","page":"Flat vs. Nested","title":"Functors.KeyPath","text":"KeyPath(keys...)\n\nA type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, or Int.\n\nFor custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.\n\nFor string and integer keys instead, the access is done with getindex.\n\nSee also getkeypath, haskeypath.\n\nExamples\n\njulia> kp = KeyPath(:b, 3)\nKeyPath(:b, 3)\n\njulia> KeyPath(:a, kp, :c, 4) # construct mixing keys and keypaths\nKeyPath(:a, :b, 3, :c, 4)\n\njulia> struct T\n a\n b\n end\n\njulia> function Base.getproperty(x::T, k::Symbol)\n if k in fieldnames(T)\n return getfield(x, k)\n elseif k === :ab\n return \"ab\"\n else \n error()\n end\n end;\n\njulia> Base.propertynames(::T) = (:a, :b, :ab);\n\njulia> x = T(3, Dict(:c => 4, :d => 5));\n\njulia> getkeypath(x, KeyPath(:ab)) # equivalent to x.ab\n\"ab\"\n\njulia> getkeypath(x, KeyPath(:b, :c)) # equivalent to (x.b)[:c]\n4\n\n\n\n\n\n","category":"type"},{"location":"reference/destructure/#Functors.getkeypath","page":"Flat vs. Nested","title":"Functors.getkeypath","text":"getkeypath(x, kp::KeyPath)\n\nReturn the value in x at the path kp.\n\nSee also KeyPath, haskeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> getkeypath(x, KeyPath(:b, \"d\", 2))\n6\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Functors.haskeypath","page":"Flat vs. Nested","title":"Functors.haskeypath","text":"haskeypath(x, kp::KeyPath)\n\nReturn true if x has a value at the path kp.\n\nSee also KeyPath, getkeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> haskeypath(x, KeyPath(:a))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 1))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 4))\nfalse\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#One-Hot-Encoding-with-OneHotArrays.jl","page":"OneHotArrays.jl","title":"One-Hot Encoding with OneHotArrays.jl","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"It's common to encode categorical variables (like true, false or cat, dog) in \"one-of-k\" or \"one-hot\" form. OneHotArrays.jl provides the onehot function to make this easy.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehot(:b, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> onehot(:c, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n ⋅\n 1","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"There is also a onecold function, which is an inverse of onehot. It can also be given an array of numbers instead of booleans, in which case it performs an argmax-like operation, returning the label with the highest corresponding weight.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> onecold(ans, [:a, :b, :c])\n:c\n\njulia> onecold([true, false, false], [:a, :b, :c])\n:a\n\njulia> onecold([0.3, 0.2, 0.5], [:a, :b, :c])\n:c","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"For multiple samples at once, onehotbatch creates a batch (matrix) of one-hot vectors, and onecold treats matrices as batches.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehotbatch([:b, :a, :b], [:a, :b, :c])\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n ⋅ ⋅ ⋅\n\njulia> onecold(ans, [:a, :b, :c])\n3-element Vector{Symbol}:\n :b\n :a\n :b","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"Note that these operations returned OneHotVector and OneHotMatrix rather than Arrays. OneHotVectors behave like normal vectors but avoid any unnecessary cost compared to using an integer index directly. For example, multiplying a matrix with a one-hot vector simply slices out the relevant row of the matrix under the hood.","category":"page"},{"location":"reference/data/onehot/#Function-listing","page":"OneHotArrays.jl","title":"Function listing","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"OneHotArrays.onehot\nOneHotArrays.onecold\nOneHotArrays.onehotbatch\nOneHotArrays.OneHotArray\nOneHotArrays.OneHotVector\nOneHotArrays.OneHotMatrix","category":"page"},{"location":"reference/data/onehot/#OneHotArrays.onehot","page":"OneHotArrays.jl","title":"OneHotArrays.onehot","text":"onehot(x, labels, [default])\n\nReturns a OneHotVector which is roughly a sparse representation of x .== labels.\n\nInstead of storing say Vector{Bool}, it stores the index of the first occurrence of x in labels. If x is not found in labels, then it either returns onehot(default, labels), or gives an error if no default is given.\n\nSee also onehotbatch to apply this to many xs, and onecold to reverse either of these, as well as to generalise argmax.\n\nExamples\n\njulia> β = onehot(:b, (:a, :b, :c))\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> αβγ = (onehot(0, 0:2), β, onehot(:z, [:a, :b, :c], :c)) # uses default\n(Bool[1, 0, 0], Bool[0, 1, 0], Bool[0, 0, 1])\n\njulia> hcat(αβγ...) # preserves sparsity\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅\n ⋅ 1 ⋅\n ⋅ ⋅ 1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onecold","page":"OneHotArrays.jl","title":"OneHotArrays.onecold","text":"onecold(y::AbstractArray, labels = 1:size(y,1))\n\nRoughly the inverse operation of onehot or onehotbatch: This finds the index of the largest element of y, or each column of y, and looks them up in labels.\n\nIf labels are not specified, the default is integers 1:size(y,1) – the same operation as argmax(y, dims=1) but sometimes a different return type.\n\nExamples\n\njulia> onecold([false, true, false])\n2\n\njulia> onecold([0.3, 0.2, 0.5], (:a, :b, :c))\n:c\n\njulia> onecold([ 1 0 0 1 0 1 0 1 0 0 1\n 0 1 0 0 0 0 0 0 1 0 0\n 0 0 0 0 1 0 0 0 0 0 0\n 0 0 0 0 0 0 1 0 0 0 0\n 0 0 1 0 0 0 0 0 0 1 0 ], 'a':'e') |> String\n\"abeacadabea\"\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onehotbatch","page":"OneHotArrays.jl","title":"OneHotArrays.onehotbatch","text":"onehotbatch(xs, labels, [default])\n\nReturns a OneHotMatrix where kth column of the matrix is onehot(xs[k], labels). This is a sparse matrix, which stores just a Vector{UInt32} containing the indices of the nonzero elements.\n\nIf one of the inputs in xs is not found in labels, that column is onehot(default, labels) if default is given, else an error.\n\nIf xs has more dimensions, N = ndims(xs) > 1, then the result is an AbstractArray{Bool, N+1} which is one-hot along the first dimension, i.e. result[:, k...] == onehot(xs[k...], labels).\n\nNote that xs can be any iterable, such as a string. And that using a tuple for labels will often speed up construction, certainly for less than 32 classes.\n\nExamples\n\njulia> oh = onehotbatch(\"abracadabra\", 'a':'e', 'e')\n5×11 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ 1\n ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅\n\njulia> reshape(1:15, 3, 5) * oh # this matrix multiplication is done efficiently\n3×11 Matrix{Int64}:\n 1 4 13 1 7 1 10 1 4 13 1\n 2 5 14 2 8 2 11 2 5 14 2\n 3 6 15 3 9 3 12 3 6 15 3\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.OneHotArray","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotArray","text":"OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}\nOneHotArray(indices, L)\n\nA one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.\n\nTypically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotVector","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotVector","text":"OneHotVector{T} = OneHotArray{T, 0, 1, T}\nOneHotVector(indices, L)\n\nA one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotMatrix","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotMatrix","text":"OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}\nOneHotMatrix(indices, L)\n\nA one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/zygote/#autodiff-zygote","page":"Gradients – Zygote.jl","title":"Automatic Differentiation using Zygote.jl","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Flux re-exports the gradient from Zygote, and uses this function within train! to differentiate the model. Zygote has its own documentation, in particular listing some important limitations.","category":"page"},{"location":"reference/training/zygote/#Explicit-style","page":"Gradients – Zygote.jl","title":"Explicit style","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"The preferred way of using Zygote, and the only way of using most other AD packages, is to explicitly provide a function and its arguments.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Zygote.gradient(f, args...)\nZygote.withgradient(f, args...)\nZygote.jacobian(f, args...)\nZygote.withjacobian(f, args...)\nZygote.hessian\nZygote.hessian_reverse\nZygote.diaghessian\nZygote.pullback","category":"page"},{"location":"reference/training/zygote/#Zygote.gradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.gradient","text":"gradient(f, args...)\n\nReturns a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or the gradient. If no gradient is defined, ∂f/∂x will be nothing.\n\nf(args...) must be a real number, see jacobian for array output.\n\nSee also withgradient to keep the value f(args...), and pullback for value and back-propagator.\n\njulia> gradient(*, 2.0, 3.0, 5.0)\n(15.0, 10.0, 6.0)\n\njulia> gradient(x -> sum(abs2,x), [7.0, 11.0, 13.0])\n([14.0, 22.0, 26.0],)\n\njulia> gradient([7, 11], 0, 1) do x, y, d\n p = size(x, d)\n sum(x.^p .+ y)\n end\n([14.0, 22.0], 2.0, nothing)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withgradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withgradient","text":"withgradient(f, args...)\nwithgradient(f, ::Params)\n\nReturns both the value of the function and the gradient, as a named tuple.\n\njulia> y, ∇ = withgradient(/, 1, 2)\n(val = 0.5, grad = (0.5, -0.25))\n\njulia> ∇ == gradient(/, 1, 2)\ntrue\n\nAllows you to capture auxillary outputs, in addition to the scalar used by gradient. To do this, f must return a Tuple or NamedTuple. Then it calculates grad = gradient(first∘f, args...) but returns the wholeval = f(args...)`:\n\njulia> withgradient([1,2,4]) do x\n z = 1 ./ x\n sum(z), z # here z is an auxillary output\n end\n(val = (1.75, [1.0, 0.5, 0.25]), grad = ([-1.0, -0.25, -0.0625],))\n\njulia> withgradient(3.0, 4.0) do x, y\n (div = x/y, mul = x*y)\n end\n(val = (div = 0.75, mul = 12.0), grad = (0.25, -0.1875))\n\nAlso supports implicit mode:\n\njulia> w = [3.0];\n\njulia> res = withgradient(() -> sum(abs2, w), Params([w]))\n(val = 9.0, grad = Grads(...))\n\njulia> res.grad[w]\n1-element Vector{Float64}:\n 6.0\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.jacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.jacobian","text":"jacobian(f, args...) -> Tuple\n\nFor each array a ∈ args this returns a matrix with Ja[k,i] = ∂y[k]/∂a[i] where y = f(args...) is usually a vector. Arrays of higher dimension are treated like vec(a), or vec(y) for output.\n\nFor scalar x::Number ∈ args, the result is a vector Jx[k] = ∂y[k]/∂x, while for scalar y all results have just one row.\n\nWith any other argument type, no result is produced, even if gradient would work.\n\nThis reverse-mode Jacobian needs to evaluate the pullback once for each element of y. Doing so is usually only efficient when length(y) is small compared to length(a), otherwise forward mode is likely to be better.\n\nSee also withjacobian, hessian, hessian_reverse.\n\nExamples\n\njulia> jacobian(a -> 100*a[1:3].^2, 1:7)[1] # first index (rows) is output\n3×7 Matrix{Int64}:\n 200 0 0 0 0 0 0\n 0 400 0 0 0 0 0\n 0 0 600 0 0 0 0\n\njulia> jacobian((a,x) -> a.^2 .* x, [1,2,3], 1) # scalar argument has vector jacobian\n([2 0 0; 0 4 0; 0 0 6], [1, 4, 9])\n\njulia> jacobian((a,d) -> prod(a, dims=d), [1 2; 3 4; 5 6], 2)\n([2 0 … 0 0; 0 4 … 3 0; 0 0 … 0 5], [0, 0, 0])\n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\njulia> jacobian((a,s) -> a.^length(s), [1,2,3], \"str\")\n([3 0 0; 0 12 0; 0 0 27], nothing)\n\njulia> jacobian((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))\n([4 4 4], nothing)\n\njulia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5)) # gradient undersands the tuple\n([4 4 4], (6, 1))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withjacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withjacobian","text":"withjacobian(f, args...)\n\nReturns both the value f(args...) and the jacobian as a named tuple.\n\njulia> withjacobian(cumsum, [1,2,3])\n(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.hessian","page":"Gradients – Zygote.jl","title":"Zygote.hessian","text":"hessian(f, x)\n\nConstruct the Hessian ∂²f/∂x², where x is a real number or an array, and f(x) is a real number. When x is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j], using linear indexing x[i] even if the argument is higher-dimensional.\n\nThis uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x). See hessian_reverse for an all-Zygote alternative.\n\nSee also diaghessian to compute only the diagonal part.\n\nExamples\n\njulia> hessian(x -> x[1]*x[2], randn(2))\n2×2 Matrix{Float64}:\n 0.0 1.0\n 1.0 0.0\n\njulia> hessian(x -> sum(x.^3), [1 2; 3 4]) # uses linear indexing of x\n4×4 Matrix{Int64}:\n 6 0 0 0\n 0 18 0 0\n 0 0 12 0\n 0 0 0 24\n\njulia> hessian(sin, pi/2)\n-1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.hessian_reverse","page":"Gradients – Zygote.jl","title":"Zygote.hessian_reverse","text":"hessian_reverse(f, x)\n\nThis should be equivalent to hessian(f, x), but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.diaghessian","page":"Gradients – Zygote.jl","title":"Zygote.diaghessian","text":"diaghessian(f, args...) -> Tuple\n\nDiagonal part of the Hessian. Returns a tuple containing, for each argument x, h of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]. The original evaluation y = f(args...) must give a real number y.\n\nFor one vector argument x, this is equivalent to (diag(hessian(f,x)),). Like hessian it uses ForwardDiff over Zygote. \n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\nExamples\n\njulia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]\n2×2 Matrix{Int64}:\n 6 12\n 18 24\n\njulia> Diagonal(vec(ans)) == hessian(x -> sum(x.^3), [1 2; 3 4]) # full Hessian is diagonal\ntrue\n\njulia> diaghessian((x,y) -> sum(x .* y .* y'), [1 22; 333 4], [0.5, 0.666]) # two array arguments\n([0.0 0.0; 0.0 0.0], [2.0, 8.0])\n\njulia> diaghessian(atan, 1, 2) # two scalar arguments\n(-0.16, 0.16)\n\njulia> hessian(xy -> atan(xy[1], xy[2]), [1, 2]) # full Hessian is not diagonal\n2×2 Matrix{Float64}:\n -0.16 -0.12\n -0.12 0.16\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ZygoteRules.pullback","page":"Gradients – Zygote.jl","title":"ZygoteRules.pullback","text":"pullback(f, args...)\npullback(f, ::Params)\n\nReturns the value of the function f and a back-propagator function, which can be called to obtain a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or gradient.\n\ny, back = pullback(f, args...)\n∇ = back(seed)\n\nback must be called with a start value seed matching the output of f(args...). If f(args...) returns a number, seed should be a number. If f(args...) returns an array, seed should be an equally-sized array.\n\nSee also withgradient to obtain the value and gradients in one call, and gradient for obtaining just the gradients.\n\njulia> y, back = pullback(*, 2.0, 3.0, 5.0);\n\njulia> y\n30.0\n\njulia> back(1.0)\n(15.0, 10.0, 6.0)\n\njulia> back(2.0)\n(30.0, 20.0, 12.0)\n\njulia> y, back = pullback(x -> [x, x], 1.0);\n\njulia> y\n2-element Vector{Float64}:\n 1.0\n 1.0\n\njulia> back([1.0, 1.0])\n(2.0,)\n\njulia> back([2.0, nothing])\n(2.0,)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRules","page":"Gradients – Zygote.jl","title":"ChainRules","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Sometimes it is necessary to exclude some code, or a whole function, from automatic differentiation. This can be done using ChainRules:","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.ignore_derivatives\nChainRulesCore.@non_differentiable","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.ignore_derivatives","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ignore_derivatives","text":"ignore_derivatives(f::Function)\n\nTells the AD system to ignore the gradients of the wrapped closure. The primal computation (forward pass) is executed normally.\n\nignore_derivatives() do\n value = rand()\n push!(collection, value)\nend\n\nUsing this incorrectly could lead to incorrect gradients. For example, the following function will have zero gradients with respect to its argument:\n\nfunction wrong_grads(x)\n y = ones(3)\n ignore_derivatives() do\n push!(y, x)\n end\n return sum(y)\nend\n\n\n\n\n\nignore_derivatives(x)\n\nTells the AD system to ignore the gradients of the argument. Can be used to avoid unnecessary computation of gradients.\n\nignore_derivatives(x) * w\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@non_differentiable","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@non_differentiable","text":"@non_differentiable(signature_expression)\n\nA helper to make it easier to declare that a method is not differentiable. This is a short-hand for defining an frule and rrule that return NoTangent() for all partials (even for the function s̄elf-partial itself)\n\nKeyword arguments should not be included.\n\njulia> @non_differentiable Base.:(==)(a, b)\n\njulia> _, pullback = rrule(==, 2.0, 3.0);\n\njulia> pullback(1.0)\n(NoTangent(), NoTangent(), NoTangent())\n\nYou can place type-constraints in the signature:\n\njulia> @non_differentiable Base.length(xs::Union{Number, Array})\n\njulia> frule((ZeroTangent(), 1), length, [2.0, 3.0])\n(2, NoTangent())\n\nwarning: Warning\nThis helper macro covers only the simple common cases. It does not support where-clauses. For these you can declare the rrule and frule directly\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"To manually supply the gradient for one function, you should define a method of rrule. ChainRules has detailed documentation on how this works.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.rrule\nChainRulesCore.frule\nChainRulesCore.@scalar_rule\nChainRulesCore.NoTangent\nChainRulesCore.ZeroTangent\nChainRulesCore.RuleConfig\nChainRulesCore.Tangent\nChainRulesCore.canonicalize","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.rrule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.rrule","text":"rrule([::RuleConfig,] f, x...)\n\nExpressing x as the tuple (x₁, x₂, ...) and the output tuple of f(x...) as Ω, return the tuple:\n\n(Ω, (Ω̄₁, Ω̄₂, ...) -> (s̄elf, x̄₁, x̄₂, ...))\n\nWhere the second return value is the the propagation rule or pullback. It takes in cotangents corresponding to the outputs (x̄₁, x̄₂, ...), and s̄elf, the internal values of the function itself (for closures)\n\nIf no method matching rrule(f, xs...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> x = rand();\n\njulia> sinx, sin_pullback = rrule(sin, x);\n\njulia> sinx == sin(x)\ntrue\n\njulia> sin_pullback(1) == (NoTangent(), cos(x))\ntrue\n\nbinary input, unary output scalar function:\n\njulia> x, y = rand(2);\n\njulia> hypotxy, hypot_pullback = rrule(hypot, x, y);\n\njulia> hypotxy == hypot(x, y)\ntrue\n\njulia> hypot_pullback(1) == (NoTangent(), (x / hypot(x, y)), (y / hypot(x, y)))\ntrue\n\nThe optional RuleConfig option allows specifying rrules only for AD systems that support given features. If not needed, then it can be omitted and the rrule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: frule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.frule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.frule","text":"frule([::RuleConfig,] (Δf, Δx...), f, x...)\n\nExpressing the output of f(x...) as Ω, return the tuple:\n\n(Ω, ΔΩ)\n\nThe second return value is the tangent w.r.t. the output.\n\nIf no method matching frule((Δf, Δx...), f, x...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> dself = NoTangent();\n\njulia> x = rand()\n0.8236475079774124\n\njulia> sinx, Δsinx = frule((dself, 1), sin, x)\n(0.7336293678134624, 0.6795498147167869)\n\njulia> sinx == sin(x)\ntrue\n\njulia> Δsinx == cos(x)\ntrue\n\nUnary input, binary output scalar function:\n\njulia> sincosx, Δsincosx = frule((dself, 1), sincos, x);\n\njulia> sincosx == sincos(x)\ntrue\n\njulia> Δsincosx[1] == cos(x)\ntrue\n\njulia> Δsincosx[2] == -sin(x)\ntrue\n\nNote that techically speaking julia does not have multiple output functions, just functions that return a single output that is iterable, like a Tuple. So this is actually a Tangent:\n\njulia> Δsincosx\nTangent{Tuple{Float64, Float64}}(0.6795498147167869, -0.7336293678134624)\n\nThe optional RuleConfig option allows specifying frules only for AD systems that support given features. If not needed, then it can be omitted and the frule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: rrule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@scalar_rule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@scalar_rule","text":"@scalar_rule(f(x₁, x₂, ...),\n @setup(statement₁, statement₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nA convenience macro that generates simple scalar forward or reverse rules using the provided partial derivatives. Specifically, generates the corresponding methods for frule and rrule:\n\nfunction ChainRulesCore.frule((NoTangent(), Δx₁, Δx₂, ...), ::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, (\n (∂f₁_∂x₁ * Δx₁ + ∂f₁_∂x₂ * Δx₂ + ...),\n (∂f₂_∂x₁ * Δx₁ + ∂f₂_∂x₂ * Δx₂ + ...),\n ...\n )\nend\n\nfunction ChainRulesCore.rrule(::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, ((ΔΩ₁, ΔΩ₂, ...)) -> (\n NoTangent(),\n ∂f₁_∂x₁ * ΔΩ₁ + ∂f₂_∂x₁ * ΔΩ₂ + ...),\n ∂f₁_∂x₂ * ΔΩ₁ + ∂f₂_∂x₂ * ΔΩ₂ + ...),\n ...\n )\nend\n\nIf no type constraints in f(x₁, x₂, ...) within the call to @scalar_rule are provided, each parameter in the resulting frule/rrule definition is given a type constraint of Number. Constraints may also be explicitly be provided to override the Number constraint, e.g. f(x₁::Complex, x₂), which will constrain x₁ to Complex and x₂ to Number.\n\nAt present this does not support defining for closures/functors. Thus in reverse-mode, the first returned partial, representing the derivative with respect to the function itself, is always NoTangent(). And in forward-mode, the first input to the returned propagator is always ignored.\n\nThe result of f(x₁, x₂, ...) is automatically bound to Ω. This allows the primal result to be conveniently referenced (as Ω) within the derivative/setup expressions.\n\nThis macro assumes complex functions are holomorphic. In general, for non-holomorphic functions, the frule and rrule must be defined manually.\n\nIf the derivative is one, (e.g. for identity functions) true can be used as the most general multiplicative identity.\n\nThe @setup argument can be elided if no setup code is need. In other words:\n\n@scalar_rule(f(x₁, x₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nis equivalent to:\n\n@scalar_rule(f(x₁, x₂, ...),\n @setup(nothing),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nFor examples, see ChainRules' rulesets directory.\n\nSee also: frule, rrule.\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/#ChainRulesCore.NoTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.NoTangent","text":"NoTangent() <: AbstractZero\n\nThis tangent indicates that the derivative does not exist. It is the tangent type for primal types that are not differentiable, such as integers or booleans (when they are not being used to represent floating-point values). The only valid way to perturb such values is to not change them at all. As a consequence, NoTangent is functionally identical to ZeroTangent(), but it provides additional semantic information.\n\nAdding NoTangent() to a primal is generally wrong: gradient-based methods cannot be used to optimize over discrete variables. An optimization package making use of this might want to check for such a case.\n\nnote: Note\nThis does not indicate that the derivative is not implemented, but rather that mathematically it is not defined.\n\nThis mostly shows up as the derivative with respect to dimension, index, or size arguments.\n\n function rrule(fill, x, len::Int)\n y = fill(x, len)\n fill_pullback(ȳ) = (NoTangent(), @thunk(sum(Ȳ)), NoTangent())\n return y, fill_pullback\n end\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.ZeroTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ZeroTangent","text":"ZeroTangent() <: AbstractZero\n\nThe additive identity for tangents. This is basically the same as 0. A derivative of ZeroTangent() does not propagate through the primal function.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.RuleConfig","page":"Gradients – Zygote.jl","title":"ChainRulesCore.RuleConfig","text":"RuleConfig{T}\n\nThe configuration for what rules to use. T: traits. This should be a Union of all special traits needed for rules to be allowed to be defined for your AD. If nothing special this should be set to Union{}.\n\nAD authors should define a subtype of RuleConfig to use when calling frule/rrule.\n\nRule authors can dispatch on this config when defining rules. For example:\n\n# only define rrule for `pop!` on AD systems where mutation is supported.\nrrule(::RuleConfig{>:SupportsMutation}, typeof(pop!), ::Vector) = ...\n\n# this definition of map is for any AD that defines a forwards mode\nrrule(conf::RuleConfig{>:HasForwardsMode}, typeof(map), ::Vector) = ...\n\n# this definition of map is for any AD that only defines a reverse mode.\n# It is not as good as the rrule that can be used if the AD defines a forward-mode as well.\nrrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...\n\nFor more details see rule configurations and calling back into AD.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.Tangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.Tangent","text":"Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent\n\nThis type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.\n\nTangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.\n\nT is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.\n\nFor Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.canonicalize","page":"Gradients – Zygote.jl","title":"ChainRulesCore.canonicalize","text":"canonicalize(tangent::Tangent{P}) -> Tangent{P}\n\nReturn the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().\n\n\n\n\n\n","category":"function"},{"location":"guide/models/basics/#man-basics","page":"Gradients and Layers","title":"How Flux Works: Gradients and Layers","text":"","category":"section"},{"location":"guide/models/basics/#man-taking-gradients","page":"Gradients and Layers","title":"Taking Gradients","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's core feature is taking gradients of Julia code. The gradient function takes another Julia function f and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> using Flux\n\njulia> f(x) = 3x^2 + 2x + 1;\n\njulia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2\n\njulia> df(2)\n14.0\n\njulia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6\n\njulia> d2f(2)\n6.0","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"When a function has many parameters, we can get gradients of each one at the same time:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> f(x, y) = sum((x .- y).^2);\n\njulia> gradient(f, [2, 1], [2, 0])\n([0.0, 2.0], [-0.0, -2.0])","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"These gradients are based on x and y. Flux works by instead taking gradients based on the weights and biases that make up the parameters of a model.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Machine learning often can have hundreds of parameter arrays. Instead of passing them to gradient individually, we can store them together in a structure. The simplest example is a named tuple, created by the following syntax:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> nt = (a = [2, 1], b = [2, 0], c = tanh);\n\njulia> g(x::NamedTuple) = sum(abs2, x.a .- x.b);\n\njulia> g(nt)\n1\n\njulia> dg_nt = gradient(g, nt)[1]\n(a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Notice that gradient has returned a matching structure. The field dg_nt.a is the gradient for nt.a, and so on. Some fields have no gradient, indicated by nothing. ","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Rather than define a function like g every time (and think up a name for it), it is often useful to use anonymous functions: this one is x -> sum(abs2, x.a .- x.b). Anonymous functions can be defined either with -> or with do, and such do blocks are often useful if you have a few steps to perform:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> gradient((x, y) -> sum(abs2, x.a ./ y .- x.b), nt, [1, 2])\n((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])\n\njulia> gradient(nt, [1, 2]) do x, y\n z = x.a ./ y\n sum(abs2, z .- x.b)\n end\n((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Sometimes you may want to know the value of the function, as well as its gradient. Rather than calling the function a second time, you can call withgradient instead:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> Flux.withgradient(g, nt)\n(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))","category":"page"},{"location":"guide/models/basics/#Building-Simple-Models","page":"Gradients and Layers","title":"Building Simple Models","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Consider a simple linear regression, which tries to predict an output array y from an input x.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"\npredict(W, b, x) = W*x .+ b\n\nfunction loss(W, b, x, y)\n ŷ = predict(W, b, x)\n sum((y .- ŷ).^2)\nend\n\nx, y = rand(5), rand(2) # Dummy data\nW = rand(2, 5)\nb = rand(2)\n\nloss(W, b, x, y) # ~ 3","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"To improve the prediction we can take the gradients of the loss with respect to W and b and perform gradient descent.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Flux\n\ndW, db = gradient((W, b) -> loss(W, b, x, y), W, b)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Now that we have gradients, we can pull them out and update W to train the model.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"W .-= 0.1 .* dW\n\nloss(W, b, x, y) # ~ 2.5","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The loss has decreased a little, meaning that our prediction x is closer to the target y. If we have some data we can already try training the model.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can look very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models.","category":"page"},{"location":"guide/models/basics/#Building-Layers","page":"Gradients and Layers","title":"Building Layers","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like sigmoid in between them. We could write this as:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Flux\n\nW1 = rand(3, 5)\nb1 = rand(3)\nlayer1(x) = W1 * x .+ b1\n\nW2 = rand(2, 3)\nb2 = rand(2)\nlayer2(x) = W2 * x .+ b2\n\nmodel(x) = layer2(sigmoid.(layer1(x)))\n\nmodel(rand(5)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"function linear(in, out)\n W = randn(out, in)\n b = randn(out)\n x -> W * x .+ b\nend\n\nlinear1 = linear(5, 3) # we can access linear1.W etc\nlinear2 = linear(3, 2)\n\nmodel(x) = linear2(sigmoid.(linear1(x)))\n\nmodel(rand(5)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Another (equivalent) way is to create a struct that explicitly represents the affine layer.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"struct Affine\n W\n b\nend\n\nAffine(in::Integer, out::Integer) =\n Affine(randn(out, in), zeros(out))\n\n# Overload call, so the object can be used as a function\n(m::Affine)(x) = m.W * x .+ m.b\n\na = Affine(10, 5)\n\na(rand(10)) # => 5-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Congratulations! You just built the Dense layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"(There is one small difference with Dense – for convenience it also takes an activation function, like Dense(10 => 5, sigmoid).)","category":"page"},{"location":"guide/models/basics/#Stacking-It-Up","page":"Gradients and Layers","title":"Stacking It Up","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"It's pretty common to write models that look something like:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"layer1 = Dense(10 => 5, relu)\n# ...\nmodel(x) = layer3(layer2(layer1(x)))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For long chains, it might be a bit more intuitive to have a list of layers, like this:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Flux\n\nlayers = [Dense(10 => 5, relu), Dense(5 => 2), softmax]\n\nmodel(x) = foldl((x, m) -> m(x), layers, init = x)\n\nmodel(rand(10)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Handily, this is also provided for in Flux:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model2 = Chain(\n Dense(10 => 5, relu),\n Dense(5 => 2),\n softmax)\n\nmodel2(rand(10)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"A nice property of this approach is that because \"models\" are just functions (possibly with trainable parameters), you can also see this as simple function composition.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"m = Dense(5 => 2) ∘ Dense(10 => 5, σ)\n\nm(rand(10))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Likewise, Chain will happily work with any Julia function.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"m = Chain(x -> x^2, x -> x+1)\n\nm(5) # => 26","category":"page"},{"location":"guide/models/basics/#Layer-Helpers","page":"Gradients and Layers","title":"Layer Helpers","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"There is still one problem with this Affine layer, that Flux does not know to look inside it. This means that Flux.train! won't see its parameters, nor will gpu be able to move them to your GPU. These features are enabled by the @layer macro:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux.@layer Affine","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the Affine layer as follows, using the helper function create_bias:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"function Affine((in, out)::Pair; bias=true, init=glorot_uniform)\n W = init(out, in)\n b = Flux.create_bias(W, bias, out)\n return Affine(W, b)\nend\n\nAffine(3 => 1, bias=false) |> gpu","category":"page"},{"location":"guide/models/recurrence/#Recurrent-Models","page":"Recurrence","title":"Recurrent Models","text":"","category":"section"},{"location":"guide/models/recurrence/#Recurrent-cells","page":"Recurrence","title":"Recurrent cells","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To introduce Flux's recurrence functionalities, we will consider the following vanilla recurrent neural network structure:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"(Image: )","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the above, we have a sequence of length 3, where x1 to x3 represent the input at each step (could be a timestamp or a word in a sentence), and y1 to y3 are their respective outputs.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"An aspect to recognise is that in such a model, the recurrent cells A all refer to the same structure. What distinguishes it from a simple dense layer is that the cell A is fed, in addition to an input x, with information from the previous state of the model (hidden state denoted as h1 & h2 in the diagram).","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the most basic RNN case, cell A could be defined by the following: ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"output_size = 5\ninput_size = 2\nWxh = randn(Float32, output_size, input_size)\nWhh = randn(Float32, output_size, output_size)\nb = randn(Float32, output_size)\n\nfunction rnn_cell(h, x)\n h = tanh.(Wxh * x .+ Whh * h .+ b)\n return h, h\nend\n\nx = rand(Float32, input_size) # dummy input data\nh = rand(Float32, output_size) # random initial hidden state\n\nh, y = rnn_cell(h, x)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Notice how the above is essentially a Dense layer that acts on two inputs, h and x.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"If you run the last line a few times, you'll notice the output y changing slightly even though the input x is the same.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"There are various recurrent cells available in Flux, notably RNNCell, LSTMCell and GRUCell, which are documented in the layer reference. The hand-written example above can be replaced with:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Flux\n\nrnn = Flux.RNNCell(2, 5)\n\nx = rand(Float32, 2) # dummy data\nh = rand(Float32, 5) # initial hidden state\n\nh, y = rnn(h, x)","category":"page"},{"location":"guide/models/recurrence/#Stateful-Models","page":"Recurrence","title":"Stateful Models","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"For the most part, we don't want to manage hidden states ourselves, but to treat our models as being stateful. Flux provides the Recur wrapper to do this.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"x = rand(Float32, 2)\nh = rand(Float32, 5)\n\nm = Flux.Recur(rnn, h)\n\ny = m(x)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The Recur wrapper stores the state between runs in the m.state field.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"If we use the RNN(2, 5) constructor – as opposed to RNNCell – you'll see that it's simply a wrapped cell.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> using Flux\n\njulia> RNN(2, 5) # or equivalently RNN(2 => 5)\nRecur(\n RNNCell(2 => 5, tanh), # 45 parameters\n) # Total: 4 trainable arrays, 45 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 412 bytes.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Equivalent to the RNN stateful constructor, LSTM and GRU are also available. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Using these tools, we can now build the model shown in the above diagram with: ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> m = Chain(RNN(2 => 5), Dense(5 => 1))\nChain(\n Recur(\n RNNCell(2 => 5, tanh), # 45 parameters\n ),\n Dense(5 => 1), # 6 parameters\n) # Total: 6 trainable arrays, 51 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 580 bytes. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In this example, each output has only one component.","category":"page"},{"location":"guide/models/recurrence/#Working-with-sequences","page":"Recurrence","title":"Working with sequences","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Using the previously defined m recurrent model, we can now apply it to a single step from our sequence:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> x = rand(Float32, 2);\n\njulia> m(x)\n1-element Vector{Float32}:\n 0.45860028","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The m(x) operation would be represented by x1 -> A -> y1 in our diagram. If we perform this operation a second time, it will be equivalent to x2 -> A -> y2 since the model m has stored the state resulting from the x1 step.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Now, instead of computing a single step at a time, we can get the full y1 to y3 sequence in a single pass by iterating the model on a sequence of data. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To do so, we'll need to structure the input data as a Vector of observations at each time step. This Vector will therefore be of length = seq_length and each of its elements will represent the input features for a given step. In our example, this translates into a Vector of length 3, where each element is a Matrix of size (features, batch_size), or just a Vector of length features if dealing with a single observation. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> x = [rand(Float32, 2) for i = 1:3];\n\njulia> [m(xi) for xi in x]\n3-element Vector{Vector{Float32}}:\n [0.36080405]\n [-0.13914406]\n [0.9310162]","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"warning: Use of map and broadcast\nMapping and broadcasting operations with stateful layers such are discouraged, since the julia language doesn't guarantee a specific execution order. Therefore, avoid y = m.(x)\n# or \ny = map(m, x)and use explicit loops y = [m(x) for x in x]","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"If for some reason one wants to exclude the first step of the RNN chain for the computation of the loss, that can be handled with:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Flux.Losses: mse\n\nfunction loss(x, y)\n m(x[1]) # ignores the output but updates the hidden states\n sum(mse(m(xi), yi) for (xi, yi) in zip(x[2:end], y))\nend\n\ny = [rand(Float32, 1) for i=1:2]\nloss(x, y)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In such a model, only the last two outputs are used to compute the loss, hence the target y being of length 2. This is a strategy that can be used to easily handle a seq-to-one kind of structure, compared to the seq-to-seq assumed so far. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Alternatively, if one wants to perform some warmup of the sequence, it could be performed once, followed with a regular training where all the steps of the sequence would be considered for the gradient update:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"function loss(m, x, y)\n sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))\nend\n\nseq_init = [rand(Float32, 2)]\nseq_1 = [rand(Float32, 2) for i = 1:3]\nseq_2 = [rand(Float32, 2) for i = 1:3]\n\ny1 = [rand(Float32, 1) for i = 1:3]\ny2 = [rand(Float32, 1) for i = 1:3]\n\nX = [seq_1, seq_2]\nY = [y1, y2]\ndata = zip(X,Y)\n\nFlux.reset!(m)\n[m(x) for x in seq_init]\n\nopt = Flux.setup(Adam(1e-3), m)\nFlux.train!(loss, m, data, opt)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In this previous example, model's state is first reset with Flux.reset!. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with seq_init, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (seq_1 and seq_2) and all the timesteps outputs are considered for the loss.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In this scenario, it is important to note that a single continuous sequence is considered. Since the model state is not reset between the 2 batches, the state of the model flows through the batches, which only makes sense in the context where seq_1 is the continuation of seq_init and so on.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, if we set the batch size to 4, a single batch would be of the shape:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"x = [rand(Float32, 2, 4) for i = 1:3]\ny = [rand(Float32, 1, 4) for i = 1:3]","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing m(batch[1]), would still represent x1 -> y1 in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix). We do not need to use Flux.reset!(m) here; each sentence in the batch will output in its own \"column\", and the outputs of the different sentences won't mix. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To illustrate, we go through an example of batching with our implementation of rnn_cell. The implementation doesn't need to change; the batching comes for \"free\" from the way Julia does broadcasting and the rules of matrix multiplication.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"output_size = 5\ninput_size = 2\nWxh = randn(Float32, output_size, input_size)\nWhh = randn(Float32, output_size, output_size)\nb = randn(Float32, output_size)\n\nfunction rnn_cell(h, x)\n h = tanh.(Wxh * x .+ Whh * h .+ b)\n return h, h\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Here, we use the last dimension of the input and the hidden state as the batch dimension. I.e., h[:, n] would be the hidden state of the nth sentence in the batch.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"batch_size = 4\nx = rand(Float32, input_size, batch_size) # dummy input data\nh = rand(Float32, output_size, batch_size) # random initial hidden state\n\nh, y = rnn_cell(h, x)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> size(h) == size(y) == (output_size, batch_size)\ntrue","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In many situations, such as when dealing with a language model, the sentences in each batch are independent (i.e. the last item of the first sentence of the first batch is independent from the first item of the first sentence of the second batch), so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situations, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"function loss(x, y)\n Flux.reset!(m)\n sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"A potential source of ambiguity with RNN in Flux can come from the different data layout compared to some common frameworks where data is typically a 3 dimensional array: (features, seq length, samples). In Flux, those 3 dimensions are provided through a vector of seq length containing a matrix (features, samples).","category":"page"},{"location":"reference/models/nnlib/#Neural-Network-primitives-from-NNlib.jl","page":"Low-level Operations – NNlib.jl","title":"Neural Network primitives from NNlib.jl","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux re-exports all of the functions exported by the NNlib package. This includes activation functions, described on their own page. Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.","category":"page"},{"location":"reference/models/nnlib/#Attention","page":"Low-level Operations – NNlib.jl","title":"Attention","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Primitives for the MultiHeadAttention layer.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dot_product_attention\nNNlib.dot_product_attention_scores\nNNlib.make_causal_mask","category":"page"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention","text":"dot_product_attention(query, key, value, [bias]; [fdrop, mask, nheads])\n\nMultihead dot product attention used in transformer architectures.\n\nThe input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.\n\nReturns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).\n\nSee also dot_product_attention_scores if you only need the attention scores.\n\nArguments\n\nquery: Query array of size (qk_dim, q_len, batch_size...).\nkey: Key array of size (qk_dim, kv_len, batch_size...).\nvalue: Value array of size (v_dim, kv_len, batch_size...).\nbias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.\nfdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).\nmask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.\nnheads: Number of heads to split the input arrays into. Default 1.\n\nExamples\n\nq, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)\ny, α = dot_product_attention(q, k, v)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention_scores","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention_scores","text":"dot_product_attention_scores(query, key, [bias]; [fdrop, mask])\n\nReturn the attention scores for the dot_product_attention. Input arrays must have dimensions (num_features ÷ nheads, nheads, sequence_length, batch_size).\n\nSee dot_product_attention for more details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.make_causal_mask","page":"Low-level Operations – NNlib.jl","title":"NNlib.make_causal_mask","text":"make_causal_mask(x, dims=2)\n\nReturn a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.\n\nCan be used to mask the attention scores in dot_product_attention.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Softmax","page":"Low-level Operations – NNlib.jl","title":"Softmax","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"softmax\nlogsoftmax","category":"page"},{"location":"reference/models/nnlib/#NNlib.softmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.softmax","text":"softmax(x; dims = 1)\n\nSoftmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:\n\nsoftmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)\n\nwith additional manipulations enhancing numerical stability.\n\nFor a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.\n\nSee also logsoftmax.\n\nExamples\n\njulia> softmax([1, 2, 3])\n3-element Vector{Float64}:\n 0.09003057317038046\n 0.24472847105479764\n 0.6652409557748218\n\njulia> softmax([1 2 3; 2 2 2]) # dims=1\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.731059\n 0.731059 0.5 0.268941\n\njulia> softmax([1 2 3; 2 2 2]; dims=2)\n2×3 Matrix{Float64}:\n 0.0900306 0.244728 0.665241\n 0.333333 0.333333 0.333333\n\nNote that, when used with Flux.jl, softmax must not be passed to layers like Dense which accept an activation function. The activation is broadcasted over the result, thus applies to individual numbers. But softmax always needs to see the whole column.\n\njulia> using Flux\n\njulia> x = randn(Float32, 4, 4, 3, 13);\n\njulia> model = Chain(Conv((4, 4), 3 => 8, tanh), Flux.flatten, Dense(8 => 7), softmax);\n\njulia> model(x) |> size\n(7, 13)\n\njulia> Dense(4 => 7, softmax)(x)\nERROR: `softmax(x)` called with a number, but it expects an array. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.logsoftmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsoftmax","text":"logsoftmax(x; dims = 1)\n\nComputes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.\n\nIt is semantically equivalent to the following:\n\nlogsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))\n\nSee also softmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Pooling","page":"Low-level Operations – NNlib.jl","title":"Pooling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.PoolDims\nNNlib.lpnormpool\nNNlib.maxpool\nNNlib.meanpool","category":"page"},{"location":"reference/models/nnlib/#NNlib.PoolDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.PoolDims","text":"PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};\n stride=k, padding=0, dilation=1) where {M, L}\n\nDimensions for a \"pooling\" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.lpnormpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.lpnormpool","text":"lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`\np is restricted to 0 < p < Inf.\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\nFor all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.\n\nThus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.maxpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.maxpool","text":"maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform max pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.meanpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.meanpool","text":"meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform mean pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Padding","page":"Low-level Operations – NNlib.jl","title":"Padding","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.pad_circular\nNNlib.pad_constant\nNNlib.pad_reflect\nNNlib.pad_repeat\nNNlib.pad_symmetric\nNNlib.pad_zeros","category":"page"},{"location":"reference/models/nnlib/#NNlib.pad_circular","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_circular","text":"pad_circular(x, pad::Tuple; [dims])\npad_circular(x, pad::Int; [dims])\n\nPad the array x \"circularly\" across the border by wrapping around values from the opposite side of x. \n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nThe pad length on either side in any dimension must not exceed the size of x in that dimension, i.e. pad_circular is not able to create abitrary sized tilings of x.\n\nSee also pad_repeat, pad_reflect, pad_symmetric, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_circular(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_constant","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_constant","text":"pad_constant(x, pad::Tuple, val = 0; [dims = :])\npad_constant(x, pad::Int, val = 0; [dims = :])\n\nPad the array x with the constant value val.\n\npad can be a tuple of integers. If it is of some length 2 * length(dims) that specifies the left and right padding size for each of the dimensions in dims as (l1, r1, ..., ln, rn). If supplied with a tuple of length length(dims) instead, it applies symmetric padding. If dims is not given, it defaults to all dimensions.\n\nFor integer pad input, it is applied on both sides on every dimension in dims.\n\nSee also pad_zeros, pad_repeat, pad_reflect, pad_symmetric, and pad_circular.\n\njulia> r = reshape(1:4, 2, 2)\n2×2 reshape(::UnitRange{Int64}, 2, 2) with eltype Int64:\n 1 3\n 2 4\n\njulia> pad_constant(r, (1, 2, 3, 4), 8)\n5×9 Matrix{Int64}:\n 8 8 8 8 8 8 8 8 8\n 8 8 8 1 3 8 8 8 8\n 8 8 8 2 4 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n\njulia> pad_constant(r, 1, 8)\n4×4 Matrix{Int64}:\n 8 8 8 8\n 8 1 3 8\n 8 2 4 8\n 8 8 8 8\n\njulia> r = reshape(1:27, 3, 3, 3)\n3×3×3 reshape(::UnitRange{Int64}, 3, 3, 3) with eltype Int64:\n[:, :, 1] =\n 1 4 7\n 2 5 8\n 3 6 9\n\n[:, :, 2] =\n 10 13 16\n 11 14 17\n 12 15 18\n\n[:, :, 3] =\n 19 22 25\n 20 23 26\n 21 24 27\n\njulia> pad_constant(r, (2,1), dims = 1) # assymetric padding\n6×3×3 Array{Int64, 3}:\n[:, :, 1] =\n 0 0 0\n 0 0 0\n 1 4 7\n 2 5 8\n 3 6 9\n 0 0 0\n\n[:, :, 2] =\n 0 0 0\n 0 0 0\n 10 13 16\n 11 14 17\n 12 15 18\n 0 0 0\n\n[:, :, 3] =\n 0 0 0\n 0 0 0\n 19 22 25\n 20 23 26\n 21 24 27\n 0 0 0\n\njulia> pad_constant(r, (2,1, 3), dims = (1,2)) # padding must always be either the same length as dims, or double it\nERROR: ArgumentError: Could not parse padding (2, 1, 3) and dims (1, 2)\nStacktrace:\n[...]\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_reflect","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_reflect","text":"pad_reflect(x, pad::Tuple; [dims])\npad_reflect(x, pad::Int; [dims])\n\nPad the array x reflecting its values across the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_reflect(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n 5 2 5 8 5 2\n 6 3 6 9 6 3\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_repeat","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_repeat","text":"pad_repeat(x, pad::Tuple; [dims])\npad_repeat(x, pad::Int; [dims])\n\nPad the array x repeating the values on the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_reflect, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_repeat(r, (1,2,3,4))\n6×10 Matrix{Int64}:\n 1 1 1 1 4 7 7 7 7 7\n 1 1 1 1 4 7 7 7 7 7\n 2 2 2 2 5 8 8 8 8 8\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_symmetric","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_symmetric","text":"pad_symmetric(x, pad::Tuple; [dims])\npad_symmetric(x, pad::Int; [dims])\n\nPad the array x reflecting its values symmetrically across the border, i.e. the border values of x are present in the padding values, in contrast to pad_reflect.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_reflect, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_symmetric(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 1 1 4 7 7 4\n 1 1 4 7 7 4\n 2 2 5 8 8 5\n 3 3 6 9 9 6\n 3 3 6 9 9 6\n 2 2 5 8 8 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_zeros","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_zeros","text":"pad_zeros(x, pad::Tuple; [dims])\npad_zeros(x, pad::Int; [dims])\n\nPad the array x with zeros. Equivalent to pad_constant with the constant equal to 0. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Convolution","page":"Low-level Operations – NNlib.jl","title":"Convolution","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally. ","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"conv\nConvDims\ndepthwiseconv\nDepthwiseConvDims\nDenseConvDims","category":"page"},{"location":"reference/models/nnlib/#NNlib.conv","page":"Low-level Operations – NNlib.jl","title":"NNlib.conv","text":"conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)\n\nApply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.ConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.ConvDims","text":"ConvDims\n\nType system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.depthwiseconv","page":"Low-level Operations – NNlib.jl","title":"NNlib.depthwiseconv","text":"depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)\n\nDepthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.DepthwiseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DepthwiseConvDims","text":"DepthwiseConvDims\n\nConcrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.DenseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DenseConvDims","text":"DenseConvDims\n\nConcrete subclass of ConvDims for a normal, dense, conv2d/conv3d.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#Dropout","page":"Low-level Operations – NNlib.jl","title":"Dropout","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dropout\nNNlib.dropout!","category":"page"},{"location":"reference/models/nnlib/#NNlib.dropout","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout","text":"dropout([rng], A, p; [dims])\n\nReturns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).\n\nBy default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.\n\nOptional first argument is the random number generator used.\n\nExamples\n\njulia> dropout(ones(2, 10), 0.2)\n2×10 Matrix{Float64}:\n 1.25 1.25 0.0 1.25 1.25 1.25 1.25 1.25 1.25 1.25\n 1.25 1.25 1.25 0.0 1.25 1.25 0.0 1.25 1.25 1.25\n\njulia> mean(dropout(ones(10^4, 5), 0.2), dims=1)\n1×5 Matrix{Float64}:\n 0.998 1.00075 0.99125 0.99575 1.00075\n\njulia> dropout(ones(5, 5), 0.7, dims=1) # whole row the same\n5×5 Matrix{Float64}:\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n\njulia> mean(dropout(ones(10^4, 5), 0.3, dims=1), dims=1)\n1×5 Matrix{Float64}:\n 1.00571 1.00571 1.00571 1.00571 1.00571\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dropout!","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout!","text":"dropout!(B, A, p; [dims])\n\nThis does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Upsampling","page":"Low-level Operations – NNlib.jl","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"upsample_nearest\nupsample_linear\n∇upsample_linear\nupsample_bilinear\n∇upsample_bilinear\nupsample_trilinear\n∇upsample_trilinear\npixel_shuffle","category":"page"},{"location":"reference/models/nnlib/#NNlib.upsample_nearest","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_nearest","text":"upsample_nearest(x, scale::NTuple{S,Int})\nupsample_nearest(x; size::NTuple{S,Int})\n\nUpsamples the array x by integer multiples along the first S dimensions. Subsequent dimensions of x are not altered.\n\nEither the scale factors or the final output size can be specified.\n\nSee also upsample_bilinear, for two dimensions of an N=4 array.\n\nExample\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2, 3))\n4×9 Matrix{Int64}:\n 1 1 1 2 2 2 3 3 3\n 1 1 1 2 2 2 3 3 3\n 4 4 4 5 5 5 6 6 6\n 4 4 4 5 5 5 6 6 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6]; size=(4, 9)) # equivalent\ntrue\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2,))\n4×3 Matrix{Int64}:\n 1 2 3\n 1 2 3\n 4 5 6\n 4 5 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6], size=(4,))\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_linear","text":"upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)\nupsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)\n\nUpsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_linear","text":"∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_bilinear","text":"upsample_bilinear(x::AbstractArray{T,4}, scale::NTuple{2,Real}; align_corners::Bool = true)\nupsample_bilinear(x::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true)\n\nUpsamples the first 2 dimensions of the array x by the upsample factors stored in scale, using bilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, S3, S4), where S1, S2, S3, S4 = size(x).\n\nExamples\n\njulia> x = reshape(Float32[1 2 3; 4 5 6], (2,3,1,1))\n2×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 2.0 3.0\n 4.0 5.0 6.0\n\njulia> upsample_bilinear(x, (2, 3))\n4×9×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.25 1.5 1.75 2.0 2.25 2.5 2.75 3.0\n 2.0 2.25 2.5 2.75 3.0 3.25 3.5 3.75 4.0\n 3.0 3.25 3.5 3.75 4.0 4.25 4.5 4.75 5.0\n 4.0 4.25 4.5 4.75 5.0 5.25 5.5 5.75 6.0\n\njulia> ans == upsample_bilinear(x; size=(4, 9)) # specify ouput size instead\ntrue\n\njulia> upsample_bilinear(x, (2.5, 3.5)) # non-integer scaling factors are allowed\n5×10×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.22222 1.44444 1.66667 1.88889 … 2.33333 2.55556 2.77778 3.0\n 1.75 1.97222 2.19444 2.41667 2.63889 3.08333 3.30556 3.52778 3.75\n 2.5 2.72222 2.94444 3.16667 3.38889 3.83333 4.05556 4.27778 4.5\n 3.25 3.47222 3.69444 3.91667 4.13889 4.58333 4.80556 5.02778 5.25\n 4.0 4.22222 4.44444 4.66667 4.88889 5.33333 5.55556 5.77778 6.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_bilinear","text":"∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral (W,H) size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_trilinear","text":"upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)\nupsample_trilinear(x::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true)\n\nUpsamples the first 3 dimensions of the array x by the upsample factors stored in scale, using trilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, scale[3]*S3, S4, S5), where S1, S2, S3, S4, S5 = size(x).\n\nExamples\n\nupsample_trilinear(x, (2, 3, 4))\nupsample_trilinear(x; size=(4, 9, 11)) # specify ouput size instead\nupsample_trilinear(x, (2.5, 3.5, pi)) # non-integer scaling factors are allowed\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_trilinear","text":"∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral size & depth (W,H,D) of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pixel_shuffle","page":"Low-level Operations – NNlib.jl","title":"NNlib.pixel_shuffle","text":"pixel_shuffle(x, r::Integer)\n\nPixel shuffling operation, upscaling by a factor r.\n\nFor 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.\n\nUsed in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., \"Real-Time Single Image and Video Super-Resolution ...\", CVPR 2016, https://arxiv.org/abs/1609.05158\n\nExamples\n\njulia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 12.1 13.1\n 21.1 22.1 23.1\n\n[:, :, 2, 1] =\n 11.2 12.2 13.2\n 21.2 22.2 23.2\n\n[:, :, 3, 1] =\n 11.3 12.3 13.3\n 21.3 22.3 23.3\n\n[:, :, 4, 1] =\n 11.4 12.4 13.4\n 21.4 22.4 23.4\n\njulia> pixel_shuffle(x, 2) # 4 channels used up as 2x upscaling of image dimensions\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 11.3 12.1 12.3 13.1 13.3\n 11.2 11.4 12.2 12.4 13.2 13.4\n 21.1 21.3 22.1 22.3 23.1 23.3\n 21.2 21.4 22.2 22.4 23.2 23.4\n\njulia> y = [i + channel/10 for i in 1:3, channel in 1:6, batch in 1:1]\n3×6×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.2 1.3 1.4 1.5 1.6\n 2.1 2.2 2.3 2.4 2.5 2.6\n 3.1 3.2 3.3 3.4 3.5 3.6\n\njulia> pixel_shuffle(y, 2) # 1D image, with 6 channels reduced to 3\n6×3×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.3 1.5\n 1.2 1.4 1.6\n 2.1 2.3 2.5\n 2.2 2.4 2.6\n 3.1 3.3 3.5\n 3.2 3.4 3.6\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Batched-Operations","page":"Low-level Operations – NNlib.jl","title":"Batched Operations","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"batched_mul\nbatched_mul!\nbatched_adjoint\nbatched_transpose\nbatched_vec","category":"page"},{"location":"reference/models/nnlib/#NNlib.batched_mul","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul","text":"batched_mul(A, B) -> C\nA ⊠ B # \\boxtimes\n\nBatched matrix multiplication. Result has C[:,:,k...] == A[:,:,k...] * B[:,:,k...] where k... represent any indices in the last dimensions.\n\nIf ndims(A) == ndims(B) == 3 and size(B,3) == 1 then instead C[:,:,k] == A[:,:,k] * B[:,:,1], and similarly for A.\n\nTo transpose each matrix, apply batched_transpose to the array, or batched_adjoint for conjugate-transpose:\n\njulia> A, B = randn(2,5,17), randn(5,9,17);\n\njulia> A ⊠ B |> size\n(2, 9, 17)\n\njulia> batched_adjoint(A) |> size\n(5, 2, 17)\n\njulia> batched_mul(A, batched_adjoint(randn(9,5,17))) |> size\n(2, 9, 17)\n\njulia> A ⊠ randn(5,9,1) |> size\n(2, 9, 17)\n\njulia> batched_transpose(A) == PermutedDimsArray(A, (2,1,3))\ntrue\n\nThe equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.\n\nHowever, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.\n\nBoth this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV[\"JULIA_DEBUG\"] = NNlib will display them.\n\n\n\n\n\nbatched_mul(A::Array{T,3}, B::Matrix)\nbatched_mul(A::Matrix, B::Array{T,3})\nA ⊠ B\n\nThis is always matrix-matrix multiplication, but either A or B may lack a batch index.\n\nWhen B is a matrix, result has C[:,:,k] == A[:,:,k] * B[:,:] for all k.\nWhen A is a matrix, then C[:,:,k] == A[:,:] * B[:,:,k]. This can also be done by reshaping and calling *, for instance A ⊡ B using TensorCore.jl, but is implemented here using batched_gemm instead of gemm.\n\njulia> randn(16,8,32) ⊠ randn(8,4) |> size\n(16, 4, 32)\n\njulia> randn(16,8,32) ⊠ randn(8,4,1) |> size # equivalent\n(16, 4, 32)\n\njulia> randn(16,8) ⊠ randn(8,4,32) |> size\n(16, 4, 32)\n\nSee also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_mul!","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul!","text":"batched_mul!(C, A, B) -> C\nbatched_mul!(C, A, B, α=1, β=0)\n\nIn-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.\n\nThis will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.\n\nFor complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_adjoint","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_adjoint","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_transpose","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_transpose","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_vec","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_vec","text":"batched_vec(A::Array{T,3}, B::Matrix)\nbatched_vec(A::Array{T,3}, b::Vector)\n\nBatched matrix-vector multiplication: the result has C[:,:,k] == A[:,:,k] * B[:,k] for all k, or else C[:,:,k] == A[:,:,k] * b for b::Vector.\n\nWith the same argument types, batched_mul(A, B) would regard B as a fixed matrix, not a batch of vectors. Both reshape and then call batched_mul(::Array{T,3}, ::Array{T,3}).\n\njulia> A, B, b = randn(16,8,32), randn(8,32), randn(8);\n\njulia> batched_vec(A,B) |> size\n(16, 32)\n\njulia> batched_vec(A,b) |> size\n(16, 32)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Gather-and-Scatter","page":"Low-level Operations – NNlib.jl","title":"Gather and Scatter","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Embedding layer uses NNlib.gather as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.gather\nNNlib.gather!\nNNlib.scatter\nNNlib.scatter!","category":"page"},{"location":"reference/models/nnlib/#NNlib.gather","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather","text":"NNlib.gather(src, idx) -> dst\n\nReverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather! for an in-place version.\n\nExamples\n\njulia> NNlib.gather([1,20,300,4000], [2,4,2])\n3-element Vector{Int64}:\n 20\n 4000\n 20\n\njulia> NNlib.gather([1 2 3; 4 5 6], [1,3,1,3,1])\n2×5 Matrix{Int64}:\n 1 3 1 3 1\n 4 6 4 6 4\n\n\n\n\n\ngather(src, IJK...)\n\nConvert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).\n\nExamples\n\njulia> src = reshape([1:15;], 3, 5)\n3×5 Matrix{Int64}:\n 1 4 7 10 13\n 2 5 8 11 14\n 3 6 9 12 15\n\njulia> NNlib.gather(src, [1, 2], [2, 4])\n2-element Vector{Int64}:\n 4\n 11\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.gather!","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather!","text":"NNlib.gather!(dst, src, idx)\n\nReverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather for an allocating version.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter","text":"NNlib.scatter(op, src, idx; [init, dstsize])\n\nScatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.\n\nIf keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).\nIf dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.\n\nSee scatter! for full details on how idx works.\n\nExamples\n\njulia> NNlib.scatter(+, [10,100,1000], [3,1,2])\n3-element Vector{Int64}:\n 100\n 1000\n 10\n\njulia> NNlib.scatter(+, [1 2 3 4; 5 6 7 8], [2,1,1,5])\n2×5 Matrix{Int64}:\n 5 1 0 0 4\n 13 5 0 0 8\n\njulia> NNlib.scatter(*, [10,200,3000], [1,4,2]; init = 10, dstsize = 6)\n6-element Vector{Int64}:\n 100\n 30000\n 10\n 2000\n 10\n 10\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter!","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter!","text":"NNlib.scatter!(op, dst, src, idx)\n\nScatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to\n\ndst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])\n\nSee also scatter, gather.\n\nArguments\n\nop: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.\ndst: The destination for src to aggregate to. This argument will be mutated.\nsrc: The source data for aggregating.\nidx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.\n\nExamples\n\njulia> NNlib.scatter!(+, ones(3), [10,100], [1,3])\n3-element Vector{Float64}:\n 11.0\n 1.0\n 101.0\n\njulia> NNlib.scatter!(*, fill(0.5, 2, 4), [1 10; 100 1000], [3,2])\n2×4 Matrix{Float64}:\n 0.5 5.0 0.5 0.5\n 0.5 500.0 50.0 0.5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Sampling","page":"Low-level Operations – NNlib.jl","title":"Sampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"grid_sample\n∇grid_sample","category":"page"},{"location":"reference/models/nnlib/#NNlib.grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.grid_sample","text":"grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)\n\nGiven input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.\n\nThis implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).\n\nArguments\n\ninput: Input array in (W_in, H_in, C, N) shape.\ngrid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.\nTherefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.\nOut-of-bound values are handled according to the padding_mode.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.\n\nReturns\n\n(W_out, H_out, C, N) sampled grid from input.\n\nExamples\n\nIn the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.\n\njulia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))\n2×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 2.0 4.0\n\njulia> grid = Array{Float64}(undef, 2, 3, 2, 1);\n\njulia> grid[:, 1, 1, 1] .= (-3, -1);\n\njulia> grid[:, 2, 1, 1] .= (0, -1);\n\njulia> grid[:, 3, 1, 1] .= (1, -1);\n\njulia> grid[:, 1, 2, 1] .= (-1, 1);\n\njulia> grid[:, 2, 2, 1] .= (0, 1);\n\njulia> grid[:, 3, 2, 1] .= (3, 1);\n\njulia> grid_sample(x, grid; padding_mode=:zeros)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 0.0 3.0\n 1.5 3.5\n 2.0 0.0\n\njulia> grid_sample(x, grid; padding_mode=:border)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 1.5 3.5\n 2.0 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇grid_sample","text":"∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T\n\nArguments\n\nΔ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).\ninput: Input from primal computation in (W_in, H_in, C, N) shape.\ngrid: Grid from primal computation in (2, W_out, H_out, N) shape.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.\n\nReturns\n\ndinput (same shape as input) and dgrid (same shape as grid) gradients.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Losses","page":"Low-level Operations – NNlib.jl","title":"Losses","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"ctc_loss","category":"page"},{"location":"reference/models/nnlib/#NNlib.ctc_loss","page":"Low-level Operations – NNlib.jl","title":"NNlib.ctc_loss","text":"ctc_loss(ŷ, y)\n\nComputes the connectionist temporal classification loss between ŷ and y. ŷ must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to ŷ, so ŷ must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with ŷ. The blank label is assumed to be the last label category in ŷ, so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Miscellaneous","page":"Low-level Operations – NNlib.jl","title":"Miscellaneous","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"logsumexp\nNNlib.glu","category":"page"},{"location":"reference/models/nnlib/#NNlib.logsumexp","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsumexp","text":"logsumexp(x; dims = :)\n\nComputes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.\n\nSee also logsoftmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.glu","page":"Low-level Operations – NNlib.jl","title":"NNlib.glu","text":"glu(x, dim = 1)\n\nThe gated linear unit from the \"Language Modeling with Gated Convolutional Networks\" paper.\n\nCalculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/training/optimisers/#man-optimisers","page":"Optimisation Rules","title":"Optimisation Rules","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Any optimization rule from Optimisers.jl can be used with train! and other training functions.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"For full details of how the new interface works, see the Optimisers.jl documentation.","category":"page"},{"location":"reference/training/optimisers/#Optimisers-Reference","page":"Optimisation Rules","title":"Optimisers Reference","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"All optimisers return an object that, when passed to train!, will update the parameters passed to it.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.Descent\nOptimisers.Momentum\nOptimisers.Nesterov\nOptimisers.RMSProp\nOptimisers.Adam\nOptimisers.RAdam\nOptimisers.AdaMax\nOptimisers.AdaGrad\nOptimisers.AdaDelta\nOptimisers.AMSGrad\nOptimisers.NAdam\nOptimisers.AdamW\nOptimisers.OAdam\nOptimisers.AdaBelief\nOptimisers.Lion","category":"page"},{"location":"reference/training/optimisers/#Optimisers.Descent","page":"Optimisation Rules","title":"Optimisers.Descent","text":"Descent(η = 1f-1)\nDescent(; eta)\n\nClassic gradient descent optimiser with learning rate η. For each parameter p and its gradient dp, this runs p -= η*dp.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Momentum","page":"Optimisation Rules","title":"Optimisers.Momentum","text":"Momentum(η = 0.01, ρ = 0.9)\nMomentum(; [eta, rho])\n\nGradient descent optimizer with learning rate η and momentum ρ.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Nesterov","page":"Optimisation Rules","title":"Optimisers.Nesterov","text":"Nesterov(η = 0.001, ρ = 0.9)\n\nGradient descent optimizer with learning rate η and Nesterov momentum ρ.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nNesterov momentum (ρ): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RMSProp","page":"Optimisation Rules","title":"Optimisers.RMSProp","text":"RMSProp(η = 0.001, ρ = 0.9, ϵ = 1e-8; centred = false)\nRMSProp(; [eta, rho, epsilon, centred])\n\nOptimizer using the RMSProp algorithm. Often a good choice for recurrent networks. Parameters other than learning rate generally don't need tuning.\n\nCentred RMSProp is a variant which normalises gradients by an estimate their variance, instead of their second moment.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\nKeyword centred (or centered): Indicates whether to use centred variant of the algorithm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Adam","page":"Optimisation Rules","title":"Optimisers.Adam","text":"Adam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nAdam optimiser.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RAdam","page":"Optimisation Rules","title":"Optimisers.RAdam","text":"RAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nRectified Adam optimizer.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaMax","page":"Optimisation Rules","title":"Optimisers.AdaMax","text":"AdaMax(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nAdaMax is a variant of Adam based on the ∞-norm.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaGrad","page":"Optimisation Rules","title":"Optimisers.AdaGrad","text":"AdaGrad(η = 0.1, ϵ = 1e-8)\n\nAdaGrad optimizer. It has parameter specific learning rates based on how frequently it is updated. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaDelta","page":"Optimisation Rules","title":"Optimisers.AdaDelta","text":"AdaDelta(ρ = 0.9, ϵ = 1e-8)\n\nAdaDelta is a version of AdaGrad adapting its learning rate based on a window of past gradient updates. Parameters don't need tuning.\n\nParameters\n\nRho (ρ): Factor by which the gradient is decayed at each time step.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AMSGrad","page":"Optimisation Rules","title":"Optimisers.AMSGrad","text":"AMSGrad(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nThe AMSGrad version of the Adam optimiser. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.NAdam","page":"Optimisation Rules","title":"Optimisers.NAdam","text":"NAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nNAdam is a Nesterov variant of Adam. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdamW","page":"Optimisation Rules","title":"Optimisers.AdamW","text":"AdamW(η = 0.001, β = (0.9, 0.999), λ = 0, ϵ = 1e-8)\nAdamW(; [eta, beta, lambda, epsilon])\n\nAdamW is a variant of Adam fixing (as in repairing) its weight decay regularization. Implemented as an OptimiserChain of Adam and WeightDecay`.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nWeight decay (λ == lambda): Controls the strength of L_2 regularisation.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/optimisers/#Optimisers.OAdam","page":"Optimisation Rules","title":"Optimisers.OAdam","text":"OAdam(η = 0.001, β = (0.5, 0.9), ϵ = 1e-8)\n\nOAdam (Optimistic Adam) is a variant of Adam adding an \"optimistic\" term suitable for adversarial training.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaBelief","page":"Optimisation Rules","title":"Optimisers.AdaBelief","text":"AdaBelief(η = 0.001, β = (0.9, 0.999), ϵ = 1e-16)\n\nThe AdaBelief optimiser is a variant of the well-known Adam optimiser.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ::Float32): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Lion","page":"Optimisation Rules","title":"Optimisers.Lion","text":"Lion(η = 0.001, β = (0.9, 0.999))\n\nLion optimiser.\n\nParameters\n\nLearning rate (η): Magnitude by which gradients are updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Composing-Optimisers","page":"Optimisation Rules","title":"Composing Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Flux (through Optimisers.jl) defines a special kind of optimiser called OptimiserChain which takes in arbitrary optimisers as input. Its behaviour is similar to the usual optimisers, but differs in that it acts by calling the optimisers listed in it sequentially. Each optimiser produces a modified gradient that will be fed into the next, and the resultant update will be applied to the parameter as usual. A classic use case is where adding decays is desirable. Optimisers.jl defines the basic decay corresponding to an L_2 regularization in the loss as WeightDecay.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(WeightDecay(1e-4), Descent())","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Here we apply the weight decay to the Descent optimiser. The resulting optimiser opt can be used as any optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"w = [randn(10, 10), randn(10, 10)]\nopt_state = Flux.setup(opt, w)\n\nloss(w, x) = Flux.mse(w[1] * x, w[2] * x)\n\nloss(w, rand(10)) # around 0.9\n\nfor t = 1:10^5\n g = gradient(w -> loss(w[1], w[2], rand(10)), w)\n Flux.update!(opt_state, w, g)\nend\n\nloss(w, rand(10)) # around 0.9","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"It is possible to compose optimisers for some added flexibility.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.OptimiserChain","category":"page"},{"location":"reference/training/optimisers/#Optimisers.OptimiserChain","page":"Optimisation Rules","title":"Optimisers.OptimiserChain","text":"OptimiserChain(opts...)\n\nCompose a sequence of optimisers so that each opt in opts updates the gradient, in the order specified.\n\nWith an empty sequence, OptimiserChain() is the identity, so update! will subtract the full gradient from the parameters. This is equivalent to Descent(1).\n\nExample\n\njulia> o = OptimiserChain(ClipGrad(1.0), Descent(0.1));\n\njulia> m = (zeros(3),);\n\njulia> s = Optimisers.setup(o, m)\n(Leaf(OptimiserChain(ClipGrad(1.0), Descent(0.1)), (nothing, nothing)),)\n\njulia> Optimisers.update(s, m, ([0.3, 1, 7],))[2] # clips before discounting\n([-0.03, -0.1, -0.1],)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Scheduling-Optimisers","page":"Optimisation Rules","title":"Scheduling Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in ParameterSchedulers.jl. The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a cosine annealing schedule with a momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between 1e-4 and 1e-2 every 10 steps. We also create a new Momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers\n\nopt = Momentum()\nschedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)\nfor (eta, epoch) in zip(schedule, 1:100)\n opt.eta = eta\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"schedule can also be indexed (e.g. schedule(100)) or iterated like any iterator in Julia.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a stateful schedule, you can use ParameterSchedulers.Stateful:","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers: Stateful, next!\n\nschedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10))\nfor epoch in 1:100\n opt.eta = next!(schedule)\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.","category":"page"},{"location":"reference/training/optimisers/#Decays","page":"Optimisation Rules","title":"Decays","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.SignDecay\nOptimisers.WeightDecay","category":"page"},{"location":"reference/training/optimisers/#Optimisers.SignDecay","page":"Optimisation Rules","title":"Optimisers.SignDecay","text":"SignDecay(λ = 1e-3)\n\nImplements L_1 regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.\n\nSee also [WeightDecay] for L_2 normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.WeightDecay","page":"Optimisation Rules","title":"Optimisers.WeightDecay","text":"WeightDecay(λ = 5e-4)\n\nImplements L_2 regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.\n\nSee also [SignDecay] for L_1 normalisation.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Gradient-Clipping","page":"Optimisation Rules","title":"Gradient Clipping","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.ClipGrad\nOptimisers.ClipNorm","category":"page"},{"location":"reference/training/optimisers/#Optimisers.ClipGrad","page":"Optimisation Rules","title":"Optimisers.ClipGrad","text":"ClipGrad(δ = 10)\n\nRestricts every gradient component to obey -δ ≤ dx[i] ≤ δ.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipNorm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.ClipNorm","page":"Optimisation Rules","title":"Optimisers.ClipNorm","text":"ClipNorm(ω = 10, p = 2; throw = true)\n\nScales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).\n\nThrows an error if the norm is infinite or NaN, which you can turn off with throw = false.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipGrad.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#GPU-Support","page":"GPU Support","title":"GPU Support","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Starting with v0.14, Flux doesn't force a specific GPU backend and the corresponding package dependencies on the users. Thanks to the package extension mechanism introduced in julia v1.9, Flux conditionally loads GPU specific code once a GPU package is made available (e.g. through using CUDA).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"NVIDIA GPU support requires the packages CUDA.jl and cuDNN.jl to be installed in the environment. In the julia REPL, type ] add CUDA, cuDNN to install them. For more details see the CUDA.jl readme.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"AMD GPU support is available since Julia 1.9 on systems with ROCm and MIOpen installed. For more details refer to the AMDGPU.jl repository.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Metal GPU acceleration is available on Apple Silicon hardware. For more details refer to the Metal.jl repository. Metal support in Flux is experimental and many features are not yet available.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In order to trigger GPU support in Flux, you need to call using CUDA, using AMDGPU or using Metal in your code. Notice that for CUDA, explicitly loading also cuDNN is not required, but the package has to be installed in the environment. ","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"compat: Flux ≤ 0.13\nOld versions of Flux automatically installed CUDA.jl to provide GPU support. Starting from Flux v0.14, CUDA.jl is not a dependency anymore and has to be installed manually.","category":"page"},{"location":"guide/gpu/#Checking-GPU-Availability","page":"GPU Support","title":"Checking GPU Availability","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default, Flux will run the checks on your system to see if it can support GPU functionality. You can check if Flux identified a valid GPU setup by typing the following:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using CUDA\n\njulia> CUDA.functional()\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For AMD GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using AMDGPU\n\njulia> AMDGPU.functional()\ntrue\n\njulia> AMDGPU.functional(:MIOpen)\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For Metal GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Metal\n\njulia> Metal.functional()\ntrue","category":"page"},{"location":"guide/gpu/#Selecting-GPU-backend","page":"GPU Support","title":"Selecting GPU backend","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Available GPU backends are: CUDA, AMDGPU and Metal.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux relies on Preferences.jl for selecting default GPU backend to use.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"There are two ways you can specify it:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"From the REPL/code in your project, call Flux.gpu_backend!(\"AMDGPU\") and restart (if needed) Julia session for the changes to take effect.\nIn LocalPreferences.toml file in you project directory specify:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"[Flux]\ngpu_backend = \"AMDGPU\"","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Current GPU backend can be fetched from Flux.GPU_BACKEND variable:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> Flux.GPU_BACKEND\n\"CUDA\"","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The current backend will affect the behaviour of methods like the method gpu described below.","category":"page"},{"location":"guide/gpu/#Basic-GPU-Usage","page":"GPU Support","title":"Basic GPU Usage","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Support for array operations on other hardware backends, like GPUs, is provided by external packages like CUDA.jl, AMDGPU.jl, and Metal.jl. Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For example, we can use CUDA.CuArray (with the cu converter) to run our basic example on an NVIDIA GPU.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"(Note that you need to have CUDA available to use CUDA.CuArray – please see the CUDA.jl instructions for more details.)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using CUDA\n\nW = cu(rand(2, 5)) # a 2×5 CuArray\nb = cu(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = cu(rand(5)), cu(rand(2)) # Dummy data\nloss(x, y) # ~ 3","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Note that we convert both the parameters (W, b) and the data set (x, y) to cuda arrays. Taking derivatives and training works exactly as before.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If you define a structured model, like a Dense layer or Chain, you just need to convert the internal parameters. Flux provides fmap, which allows you to alter all parameters of a model at once.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"d = Dense(10 => 5, σ)\nd = fmap(cu, d)\nd.weight # CuArray\nd(cu(rand(10))) # CuArray output\n\nm = Chain(Dense(10 => 5, σ), Dense(5 => 2), softmax)\nm = fmap(cu, m)\nm(cu(rand(10)))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"As a convenience, Flux provides the gpu function to convert models and data to the GPU if one is available. By default, it'll do nothing. So, you can safely call gpu on some data or model (as shown below), and the code will not error, regardless of whether the GPU is available or not. If a GPU library (e.g. CUDA) loads successfully, gpu will move data from the CPU to the GPU. As is shown below, this will change the type of something like a regular array to a CuArray.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA\n\njulia> m = Dense(10, 5) |> gpu\nDense(10 => 5) # 55 parameters\n\njulia> x = rand(10) |> gpu\n10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.066846445\n ⋮\n 0.76706964\n\njulia> m(x)\n5-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n -0.99992573\n ⋮\n -0.547261","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The analogue cpu is also available for moving models and data back off of the GPU.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> x = rand(10) |> gpu\n10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.8019236\n ⋮\n 0.7766742\n\njulia> x |> cpu\n10-element Vector{Float32}:\n 0.8019236\n ⋮\n 0.7766742","category":"page"},{"location":"guide/gpu/#Transferring-Training-Data","page":"GPU Support","title":"Transferring Training Data","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Iterating over the batches in a DataLoader object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this:\ntrain_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true)\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x_cpu, y_cpu) in train_loader\n x = gpu(x_cpu)\n y = gpu(y_cpu)\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nRather than write this out every time, you can just call gpu(::DataLoader):\ngpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nThis is equivalent to DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...). Something similar can also be done with CUDA.CuIterator, gpu_train_loader = CUDA.CuIterator(train_loader). However, this only works with a limited number of data types: first(train_loader) should be a tuple (or NamedTuple) of arrays.\nTransferring all training data to the GPU at once before creating the DataLoader. This is usually performed for smaller datasets which are sure to fit in the available GPU memory.\ngpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32)\n# ...\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n # ...\nHere (X, Y) |> gpu applies gpu to both arrays, as it recurses into structures.","category":"page"},{"location":"guide/gpu/#Saving-GPU-Trained-Models","page":"GPU Support","title":"Saving GPU-Trained Models","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"After the training process is done, one must always transfer the trained model back to the cpu memory scope before serializing or saving to disk. This can be done, as described in the previous section, with:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"model = cpu(model) # or model = model |> cpu","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"and then","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using BSON\n# ...\nBSON.@save \"./path/to/trained_model.bson\" model\n\n# in this approach the cpu-transferred model (referenced by the variable `model`)\n# only exists inside the `let` statement\nlet model = cpu(model)\n # ...\n BSON.@save \"./path/to/trained_model.bson\" model\nend\n\n# is equivalent to the above, but uses `key=value` storing directive from BSON.jl\nBSON.@save \"./path/to/trained_model.bson\" model = cpu(model)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The reason behind this is that models trained in the GPU but not transferred to the CPU memory scope will expect CuArrays as input. In other words, Flux models expect input data coming from the same kind device in which they were trained on.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In controlled scenarios in which the data fed to the loaded models is garanteed to be in the GPU there's no need to transfer them back to CPU memory scope, however in production environments, where artifacts are shared among different processes, equipments or configurations, there is no garantee that the CUDA.jl package will be available for the process performing inference on the model loaded from the disk.","category":"page"},{"location":"guide/gpu/#Disabling-CUDA-or-choosing-which-GPUs-are-visible-to-Flux","page":"GPU Support","title":"Disabling CUDA or choosing which GPUs are visible to Flux","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Sometimes it is required to control which GPUs are visible to julia on a system with multiple GPUs or disable GPUs entirely. This can be achieved with an environment variable CUDA_VISIBLE_DEVICES.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To disable all devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='-1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To select specific devices by device id:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='0,1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"More information for conditional use of GPUs in CUDA.jl can be found in its documentation, and information about the specific use of the variable is described in the Nvidia CUDA blog post.","category":"page"},{"location":"guide/gpu/#Using-device-objects","page":"GPU Support","title":"Using device objects","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"As a more convenient syntax, Flux allows the usage of GPU device objects which can be used to easily transfer models to GPUs (and defaulting to using the CPU if no GPU backend is available). This syntax has a few advantages including automatic selection of the GPU backend and type stability of data movement. To do this, the Flux.get_device function can be used.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux.get_device first checks for a GPU preference, and if possible returns a device for the preference backend. For instance, consider the following example, where we load the CUDA.jl package to use an NVIDIA GPU (\"CUDA\" is the default preference):","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA;\n\njulia> device = Flux.get_device(; verbose=true) # returns handle to an NVIDIA GPU\n[ Info: Using backend set in preferences: CUDA.\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> device.deviceID # check the id of the GPU\nCuDevice(0): NVIDIA GeForce GTX 1650\n\njulia> model = Dense(2 => 3);\n\njulia> model.weight # the model initially lives in CPU memory\n3×2 Matrix{Float32}:\n -0.984794 -0.904345\n 0.720379 -0.486398\n 0.851011 -0.586942\n\njulia> model = model |> device # transfer model to the GPU\nDense(2 => 3) # 9 parameters\n\njulia> model.weight\n3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n -0.984794 -0.904345\n 0.720379 -0.486398\n 0.851011 -0.586942\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The device preference can also be set via the Flux.gpu_backend! function. For instance, below we first set our device preference to \"CPU\":","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux; Flux.gpu_backend!(\"CPU\")\n┌ Info: New GPU backend set: CPU.\n└ Restart your Julia session for this change to take effect!","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, after restarting the Julia session, Flux.get_device returns a handle to the \"CPU\":","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA; # even if CUDA is loaded, we'll still get a CPU device\n\njulia> device = Flux.get_device(; verbose=true) # get a CPU device\n[ Info: Using backend set in preferences: CPU.\n(::Flux.FluxCPUDevice) (generic function with 1 method)\n\njulia> model = Dense(2 => 3);\n\njulia> model = model |> device\nDense(2 => 3) # 9 parameters\n\njulia> model.weight # no change; model still lives on CPU\n3×2 Matrix{Float32}:\n -0.942968 0.856258\n 0.440009 0.714106\n -0.419192 -0.471838","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Clearly, this means that the same code will work for any GPU backend and the CPU. ","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If the preference backend isn't available or isn't functional, then Flux.get_device looks for a CUDA, AMDGPU or Metal backend, and returns a corresponding device (if the backend is available and functional). Otherwise, a CPU device is returned. In the below example, the GPU preference is \"CUDA\":","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux; # preference is CUDA, but CUDA.jl not loaded\n\njulia> device = Flux.get_device(; verbose=true) # this will resort to automatic device selection\n[ Info: Using backend set in preferences: CUDA.\n┌ Warning: Trying to use backend: CUDA but it's trigger package is not loaded.\n│ Please load the package and call this function again to respect the preferences backend.\n└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:637\n[ Info: Using backend: CPU.\n(::Flux.FluxCPUDevice) (generic function with 1 method)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For detailed information about how the backend is selected, check the documentation for Flux.get_device.","category":"page"},{"location":"guide/gpu/#Data-movement-across-GPU-devices","page":"GPU Support","title":"Data movement across GPU devices","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux also supports getting handles to specific GPU devices, and transferring models from one GPU device to another GPU device from the same backend. Let's try it out for NVIDIA GPUs. First, we list all the available devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA;\n\njulia> CUDA.devices()\nCUDA.DeviceIterator() for 3 devices:\n0. GeForce RTX 2080 Ti\n1. GeForce RTX 2080 Ti\n2. TITAN X (Pascal)\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's select the device with id 0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device0 = Flux.get_device(\"CUDA\", 0) # the currently supported values for backend are \"CUDA\" and \"AMDGPU\"\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's move a simple dense layer to the GPU represented by device0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> dense_model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> dense_model = dense_model |> device0;\n\njulia> dense_model.weight\n3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.695662 0.816299\n -0.204763 -0.10232\n -0.955829 0.538412\n\njulia> CUDA.device(dense_model.weight) # check the GPU to which dense_model is attached\nCuDevice(0): GeForce RTX 2080 Ti\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Next, we'll get a handle to the device with id 1, and move dense_model to that device:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device1 = Flux.get_device(\"CUDA\", 1)\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> dense_model = dense_model |> device1; # don't directly print the model; see warning below\n\njulia> CUDA.device(dense_model.weight)\nCuDevice(1): GeForce RTX 2080 Ti\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"warning: Printing models after moving to a different device\nDue to a limitation in how GPU packages currently work, printing models on the REPL after moving them to a GPU device which is different from the current device will lead to an error.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux.AbstractDevice\nFlux.FluxCPUDevice\nFlux.FluxCUDADevice\nFlux.FluxAMDGPUDevice\nFlux.FluxMetalDevice\nFlux.supported_devices\nFlux.get_device\nFlux.gpu_backend!","category":"page"},{"location":"guide/gpu/#Flux.AbstractDevice","page":"GPU Support","title":"Flux.AbstractDevice","text":"Flux.AbstractDevice <: Function\n\nAn abstract type representing device objects for different GPU backends. The currently supported backends are \"CUDA\", \"AMDGPU\", \"Metal\" and \"CPU\"; the \"CPU\" backend is the fallback case when no GPU is available. GPU extensions of Flux define subtypes of this type.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxCPUDevice","page":"GPU Support","title":"Flux.FluxCPUDevice","text":"Flux.FluxCPUDevice <: Flux.AbstractDevice\n\nA type representing device objects for the \"CPU\" backend for Flux. This is the fallback case when no GPU is available to Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxCUDADevice","page":"GPU Support","title":"Flux.FluxCUDADevice","text":"FluxCUDADevice <: AbstractDevice\n\nA type representing device objects for the \"CUDA\" backend for Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxAMDGPUDevice","page":"GPU Support","title":"Flux.FluxAMDGPUDevice","text":"FluxAMDGPUDevice <: AbstractDevice\n\nA type representing device objects for the \"AMDGPU\" backend for Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxMetalDevice","page":"GPU Support","title":"Flux.FluxMetalDevice","text":"FluxMetalDevice <: AbstractDevice\n\nA type representing device objects for the \"Metal\" backend for Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.supported_devices","page":"GPU Support","title":"Flux.supported_devices","text":"Flux.supported_devices()\n\nGet all supported backends for Flux, in order of preference.\n\nExample\n\njulia> using Flux;\n\njulia> Flux.supported_devices()\n(\"CUDA\", \"AMDGPU\", \"Metal\", \"CPU\")\n\n\n\n\n\n","category":"function"},{"location":"guide/gpu/#Flux.get_device","page":"GPU Support","title":"Flux.get_device","text":"Flux.get_device(; verbose=false)::Flux.AbstractDevice\n\nReturns a device object for the most appropriate backend for the current Julia session. \n\nFirst, the function checks whether a backend preference has been set via the Flux.gpu_backend! function. If so, an attempt is made to load this backend. If the corresponding trigger package has been loaded and the backend is functional, a device corresponding to the given backend is loaded. Otherwise, the backend is chosen automatically. To update the backend preference, use Flux.gpu_backend!.\n\nIf there is no preference, then for each of the \"CUDA\", \"AMDGPU\", \"Metal\" and \"CPU\" backends in the given order, this function checks whether the given backend has been loaded via the corresponding trigger package, and whether the backend is functional. If so, the device corresponding to the backend is returned. If no GPU backend is available, a Flux.FluxCPUDevice is returned.\n\nIf verbose is set to true, then the function prints informative log messages.\n\nExamples\n\nFor the example given below, the backend preference was set to \"AMDGPU\" via the gpu_backend! function.\n\njulia> using Flux;\n\njulia> model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> device = Flux.get_device(; verbose=true) # this will just load the CPU device\n[ Info: Using backend set in preferences: AMDGPU.\n┌ Warning: Trying to use backend: AMDGPU but it's trigger package is not loaded.\n│ Please load the package and call this function again to respect the preferences backend.\n└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:638\n[ Info: Using backend: CPU.\n(::Flux.FluxCPUDevice) (generic function with 1 method)\n\njulia> model = model |> device\nDense(2 => 3) # 9 parameters\n\njulia> model.weight\n3×2 Matrix{Float32}:\n -0.304362 -0.700477\n -0.861201 0.67825\n -0.176017 0.234188\n\nHere is the same example, but using \"CUDA\":\n\njulia> using Flux, CUDA;\n\njulia> model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> device = Flux.get_device(; verbose=true)\n[ Info: Using backend set in preferences: AMDGPU.\n┌ Warning: Trying to use backend: AMDGPU but it's trigger package is not loaded.\n│ Please load the package and call this function again to respect the preferences backend.\n└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:637\n[ Info: Using backend: CUDA.\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> model = model |> device\nDense(2 => 3) # 9 parameters\n\njulia> model.weight\n3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.820013 0.527131\n -0.915589 0.549048\n 0.290744 -0.0592499\n\n\n\n\n\nFlux.get_device(backend::String, idx::Int = 0)::Flux.AbstractDevice\n\nGet a device object for a backend specified by the string backend and idx. The currently supported values of backend are \"CUDA\", \"AMDGPU\" and \"CPU\". idx must be an integer value between 0 and the number of available devices.\n\nExamples\n\njulia> using Flux, CUDA;\n\njulia> CUDA.devices()\nCUDA.DeviceIterator() for 3 devices:\n0. GeForce RTX 2080 Ti\n1. GeForce RTX 2080 Ti\n2. TITAN X (Pascal)\n\njulia> device0 = Flux.get_device(\"CUDA\", 0)\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> device0.deviceID\nCuDevice(0): GeForce RTX 2080 Ti\n\njulia> device1 = Flux.get_device(\"CUDA\", 1)\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> device1.deviceID\nCuDevice(1): GeForce RTX 2080 Ti\n\njulia> cpu_device = Flux.get_device(\"CPU\")\n(::Flux.FluxCPUDevice) (generic function with 1 method)\n\n\n\n\n\n\n","category":"function"},{"location":"guide/gpu/#Flux.gpu_backend!","page":"GPU Support","title":"Flux.gpu_backend!","text":"gpu_backend!(backend::String)\n\nSet the GPU backend to backend in the LocalPreferences.toml file in you project directory. After restarting Julia, the new backend will affect all subsequent calls to gpu and get_device.\n\nThe supported backends are \"CUDA\", \"AMDGPU\", \"Metal\" and \"CPU\".\n\n\n\n\n\n","category":"function"},{"location":"guide/gpu/#Distributed-data-parallel-training","page":"GPU Support","title":"Distributed data parallel training","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux supports now distributed data parallel training with DistributedUtils module. If you want to run your code on multiple GPUs, you have to install MPI.jl (see docs for more info).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using MPI\n\njulia> MPI.install_mpiexecjl()","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can run your code with mpiexecjl --project=. -n julia .jl from CLI.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You can use either the MPIBackend or NCCLBackend, the latter only if also NCCL.jl is loaded. First, initialize a backend with DistributedUtils.initialize, e.g.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, MPI, NCCL\n\njulia> DistributedUtils.initialize(NCCLBackend)\n\njulia> backend = DistributedUtils.get_distributed_backend(NCCLBackend)\nNCCLBackend{Communicator, MPIBackend{MPI.Comm}}(Communicator(Ptr{NCCL.LibNCCL.ncclComm} @0x000000000607a660), MPIBackend{MPI.Comm}(MPI.Comm(1140850688)))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Pass your model, as well as any data to GPU device.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = Chain(Dense(1 => 256, tanh), Dense(256 => 1)) |> gpu\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.\n\njulia> x = rand(Float32, 1, 16) |> gpu\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.239324 0.331029 0.924996 0.55593 0.853093 0.874513 0.810269 0.935858 0.477176 0.564591 0.678907 0.729682 0.96809 0.115833 0.66191 0.75822\n\njulia> y = x .^ 3\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.0137076 0.0362744 0.791443 0.171815 0.620854 0.668804 0.53197 0.819654 0.108651 0.179971 0.312918 0.388508 0.907292 0.00155418 0.29 0.435899","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You can also use DistributedUtils.DistributedDataContainer to split the data uniformly across processes.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> data = DistributedUtils.DistributedDataContainer(backend, x)\nFlux.DistributedUtils.DistributedDataContainer(Float32[0.23932439 0.33102947 … 0.66191036 0.75822026], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You have to wrap your model in DistributedUtils.FluxDistributedModel and synchronize it (broadcast accross all processes):","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0)\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Time to set up an optimizer by using DistributedUtils.DistributedOptimizer and synchronize it as well.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Optimisers\nopt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))\nst_opt = Optimisers.setup(opt, model)\nst_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0) ","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can define loss and train the model.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"for epoch in 1:100\n global model, st_opt\n l, grad = Zygote.withgradient(loss, model)\n println(\"Epoch $epoch: Loss $l\")\n st_opt, model = Optimisers.update(st_opt, model, grad[1])\nend","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Remember that in order to run it on multiple GPUs you have to run from CLI mpiexecjl --project=. -n julia .jl, where is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default MPI.jl MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more here on custom installation and rebuilding MPI.jl. Then test if your MPI is CUDA-aware by","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"import Pkg\nPkg.test(\"MPI\"; test_args=[\"--backend=CUDA\"])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If it is, set your local preference as below","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using Preferences\nset_preferences!(\"Flux\", \"FluxDistributedMPICUDAAware\" => true)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"warning: Known shortcomings\nWe don't run CUDA-aware tests so you're running it at own risk.","category":"page"},{"location":"reference/utilities/#man-init-funcs","page":"Weight Initialisation","title":"Random Weight Initialisation","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux initialises convolutional layers and recurrent cells with glorot_uniform by default. Most layers accept a function as an init keyword, which replaces this default. For example:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> conv = Conv((3, 3), 3 => 2, relu; init=Flux.glorot_normal)\nConv((3, 3), 3 => 2, relu) # 56 parameters\n\njulia> conv.bias\n2-element Vector{Float32}:\n 0.0\n 0.0","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Note that init creates the weight array, but not the bias vector.","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Many of the initialisation functions accept keywords such as gain, and a random number generator. To make it easy to pass these to layers, there are methods which return a function:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> Dense(4 => 5, tanh; init=Flux.glorot_uniform(gain=2))\nDense(4 => 5, tanh) # 25 parameters\n\njulia> Dense(4 => 5, tanh; init=Flux.randn32(MersenneTwister(1)))\nDense(4 => 5, tanh) # 25 parameters","category":"page"},{"location":"reference/utilities/#Initialisation-functions","page":"Weight Initialisation","title":"Initialisation functions","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.glorot_uniform\nFlux.glorot_normal\nFlux.kaiming_uniform\nFlux.kaiming_normal\nFlux.truncated_normal\nFlux.orthogonal\nFlux.sparse_init\nFlux.identity_init\nFlux.ones32\nFlux.zeros32\nFlux.rand32\nFlux.randn32\nFlux.create_bias","category":"page"},{"location":"reference/utilities/#Flux.glorot_uniform","page":"Weight Initialisation","title":"Flux.glorot_uniform","text":"glorot_uniform([rng], size...; gain = 1) -> Array\nglorot_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval -x x, where x = gain * sqrt(6 / (fan_in + fan_out)).\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> Flux.glorot_uniform(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.glorot_uniform(10, 100)), digits=3)\n(-0.233f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 10)), digits=3)\n(-0.234f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 100)), digits=3)\n(-0.173f0, 0.173f0)\n\njulia> Dense(3 => 2, tanh; init = Flux.glorot_uniform(MersenneTwister(1)))\nDense(3 => 2, tanh) # 8 parameters\n\njulia> ans.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.glorot_normal","page":"Weight Initialisation","title":"Flux.glorot_normal","text":"glorot_normal([rng], size...; gain = 1) -> Array\nglorot_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a normal distribution with standard deviation gain * sqrt(2 / (fan_in + fan_out)), using nfan.\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.glorot_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.glorot_normal(1000, 10)), digits=3)\n0.045f0\n\njulia> round(std(Flux.glorot_normal(1000, 1000)), digits=3)\n0.032f0\n\njulia> Dense(10 => 1000, tanh; init = Flux.glorot_normal(gain=100))\nDense(10 => 1000, tanh) # 11_000 parameters\n\njulia> round(std(ans.weight), sigdigits=3)\n4.45f0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_uniform","page":"Weight Initialisation","title":"Flux.kaiming_uniform","text":"kaiming_uniform([rng], size...; gain = √2) -> Array\nkaiming_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval [-x, x], where x = gain * sqrt(3/fan_in) using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 10)), digits=3)\n(-0.774f0, 0.773f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(10, 100)), digits=3)\n(-0.243f0, 0.245f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 100)), digits=3)\n(-0.245f0, 0.245f0)\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_normal","page":"Weight Initialisation","title":"Flux.kaiming_normal","text":"kaiming_normal([rng], size...; gain = √2) -> Array\nkaiming_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers taken from a normal distribution standard deviation gain / sqrt(fan_in), using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.kaiming_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.kaiming_normal(1000, 10)), digits=3)\n0.449f0\n\njulia> round(std(Flux.kaiming_normal(1000, 1000)), digits=3)\n0.045f0\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.truncated_normal","page":"Weight Initialisation","title":"Flux.truncated_normal","text":"truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array\ntruncated_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size where each element is drawn from a truncated normal distribution. The numbers are distributed like filter(x -> lo<=x<=hi, mean .+ std .* randn(100)).\n\nThe values are generated by sampling a Uniform(0, 1) (rand()) and then applying the inverse CDF of the truncated normal distribution. This method works best when lo ≤ mean ≤ hi.\n\nExamples\n\njulia> using Statistics\n\njulia> Flux.truncated_normal(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.truncated_normal(10^6)); digits=3)\n(-2.0f0, 2.0f0)\n\njulia> round(std(Flux.truncated_normal(10^6; lo = -100, hi = 100)))\n1.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.orthogonal","page":"Weight Initialisation","title":"Flux.orthogonal","text":"orthogonal([rng], size...; gain = 1) -> Array\northogonal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size which is a (semi) orthogonal matrix, as described in [1].\n\nCannot construct a vector, i.e. length(size) == 1 is forbidden. For length(size) > 2, a prod(size[1:(end - 1)]) by size[end] orthogonal matrix is computed before reshaping it to the original dimensions.\n\nExamples\n\njulia> W = Flux.orthogonal(5, 7);\n\njulia> summary(W)\n\"5×7 Matrix{Float32}\"\n\njulia> W * W' ≈ I(5)\ntrue\n\njulia> W2 = Flux.orthogonal(7, 5);\n\njulia> W2 * W2' ≈ I(7)\nfalse\n\njulia> W2' * W2 ≈ I(5)\ntrue\n\njulia> W3 = Flux.orthogonal(3, 3, 2, 4);\n\njulia> transpose(reshape(W3, :, 4)) * reshape(W3, :, 4) ≈ I(4)\ntrue\n\nReferences\n\n[1] Saxe, McClelland, Ganguli. \"Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\", ICLR 2014, https://arxiv.org/abs/1312.6120\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.sparse_init","page":"Weight Initialisation","title":"Flux.sparse_init","text":"sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array\nsparse_init([rng]; kw...) -> Function\n\nReturn a Matrix{Float32} of size rows, cols where each column contains a fixed fraction of zero elements given by sparsity. Non-zero elements are normally distributed with a mean of zero and standard deviation std.\n\nThis method is described in [1].\n\nExamples\n\njulia> count(iszero, Flux.sparse_init(10, 10, sparsity=1/5))\n20\n\njulia> sum(0 .== Flux.sparse_init(10, 11, sparsity=0.9), dims=1)\n1×11 Matrix{Int64}:\n 9 9 9 9 9 9 9 9 9 9 9\n\njulia> Dense(3 => 10, tanh; init=Flux.sparse_init(sparsity=0.5))\nDense(3 => 10, tanh) # 40 parameters\n\njulia> count(iszero, ans.weight, dims=1)\n1×3 Matrix{Int64}:\n 5 5 5\n\nReferences\n\n[1] Martens, J, \"Deep learning via Hessian-free optimization\" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.identity_init","page":"Weight Initialisation","title":"Flux.identity_init","text":"identity_init(size...; gain=1, shift=0) -> Array\nidentity_init(; kw...) -> Function\n\nReturn an Array{Float32} of the given size which yields an identity mapping when used as parameters in most Flux layers. Use gain to scale the identity by a constant.\n\nOften useful in the context of transfer learning, i.e when one wants to add more capacity to a model but start from the same mapping.\n\nHas the following behaviour\n\n1D: A Vector of zeros (useful for an identity bias)\n2D: An identity matrix (useful for an identity matrix multiplication)\nMore than 2D: A dense block array of center tap spatial filters (useful for an identity convolution)\n\nSome caveats: \n\nNot all layers will be identity mapping when used with this init. Exceptions include recurrent layers and normalization layers.\nLayers must have input_size == output_size for identity mapping to be possible. When this is not the case, extra dimensions of the array are padded with zeros.\nFor convolutional layers, in addition to the above, the kernel sizes must also be odd and padding must be applied so that output feature maps have the same size as input feature maps, e.g by using SamePad.\n\nUse keyword shift (integer or tuple) to apply circular shift to the output, equivalent to Base.circshift(identity_init(size...), shift).\n\nFor consistency with other initialisers, it accepts rng::AbstractRNG as an optional first argument. But this is ignored, since the result is not random.\n\nExamples\n\njulia> Flux.identity_init(3,5)\n3×5 Matrix{Float32}:\n 1.0 0.0 0.0 0.0 0.0\n 0.0 1.0 0.0 0.0 0.0\n 0.0 0.0 1.0 0.0 0.0\n\njulia> Dense(5 => 3, relu, init=Flux.identity_init)([1,-2,3,-4,5])\n3-element Vector{Float32}:\n 1.0\n 0.0\n 3.0\n\njulia> Flux.identity_init(3,3,2; gain=100)\n3×3×2 Array{Float32, 3}:\n[:, :, 1] =\n 0.0 0.0 0.0\n 100.0 0.0 0.0\n 0.0 0.0 0.0\n\n[:, :, 2] =\n 0.0 0.0 0.0\n 0.0 100.0 0.0\n 0.0 0.0 0.0\n\njulia> x4 = cat([1 2 3; 4 5 6; 7 8 9]; dims=4);\n\njulia> Conv((2,2), 1 => 1, init=Flux.identity_init(gain=10), pad=SamePad())(x4)\n3×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 10.0 20.0 30.0\n 40.0 50.0 60.0\n 70.0 80.0 90.0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.ones32","page":"Weight Initialisation","title":"Flux.ones32","text":"ones32(size...) = ones(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 1s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.zeros32","page":"Weight Initialisation","title":"Flux.zeros32","text":"zeros32(size...) = zeros(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 0s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.rand32","page":"Weight Initialisation","title":"Flux.rand32","text":"rand32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.randn32","page":"Weight Initialisation","title":"Flux.randn32","text":"randn32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.create_bias","page":"Weight Initialisation","title":"Flux.create_bias","text":"create_bias(weights, bias, size...)\n\nReturn a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.\n\nbias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.\nbias == false returns false, which is understood by AD to be non-differentiable.\nbias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"These functions call:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.rng_from_array\nFlux.nfan","category":"page"},{"location":"reference/utilities/#Flux.rng_from_array","page":"Weight Initialisation","title":"Flux.rng_from_array","text":"rng_from_array(x)\n\nCreate an instance of the RNG most appropriate for x. The current defaults are:\n\nx isa CuArray: CUDA.default_rng()\nx isa AbstractArray: `Random.default_rng()\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.nfan","page":"Weight Initialisation","title":"Flux.nfan","text":"nfan(n_out, n_in=1) -> Tuple\nnfan(dims...)\nnfan(dims::Tuple)\n\nFor a layer characterized by dimensions dims, return a tuple (fan_in, fan_out), where fan_in is the number of input neurons connected to an output one, and fan_out is the number of output neurons connected to an input one.\n\nThis function is mainly used by weight initializers, e.g., kaiming_normal.\n\nExamples\n\njulia> layer = Dense(10, 20);\n\njulia> Flux.nfan(size(layer.weight))\n(10, 20)\n\njulia> layer = Conv((3, 3), 2=>10);\n\njulia> Flux.nfan(size(layer.weight))\n(18, 90)\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Changing-the-type-of-all-parameters","page":"Weight Initialisation","title":"Changing the type of all parameters","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.f64\nFlux.f32\nFlux.f16","category":"page"},{"location":"reference/utilities/#Flux.f64","page":"Weight Initialisation","title":"Flux.f64","text":"f64(m)\n\nConverts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.\n\nSee also f32 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f32","page":"Weight Initialisation","title":"Flux.f32","text":"f32(m)\n\nConverts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.\n\nSee also f64 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f16","page":"Weight Initialisation","title":"Flux.f16","text":"f16(m)\n\nConverts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.\n\nSupport for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.\n\nSee also f32 and f64.\n\nExample\n\njulia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10)) # all Float32\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 6.211 MiB.\n\njulia> m |> f16 # takes half the memory\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.\n\n\n\n\n\n","category":"function"},{"location":"reference/outputsize/#Shape-Inference","page":"Shape Inference","title":"Shape Inference","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux has some tools to help generate models in an automated fashion, by inferring the size of arrays that layers will recieve, without doing any computation. This is especially useful for convolutional models, where the same Conv layer accepts any size of image, but the next layer may not. ","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The higher-level tool is a macro @autosize which acts on the code defining the layers, and replaces each appearance of _ with the relevant size. This simple example returns a model with Dense(845 => 10) as the last layer:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"@autosize (28, 28, 1, 32) Chain(Conv((3, 3), _ => 5, relu, stride=2), Flux.flatten, Dense(_ => 10))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The input size may be provided at runtime, like @autosize (sz..., 1, 32) Chain(Conv(..., but all the layer constructors containing _ must be explicitly written out – the macro sees the code as written.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"This macro relies on a lower-level function outputsize, which you can also use directly:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"c = Conv((3, 3), 1 => 5, relu, stride=2)\nFlux.outputsize(c, (28, 28, 1, 32)) # returns (13, 13, 5, 32)","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The function outputsize works by passing a \"dummy\" array into the model, which propagates through very cheaply. It should work for all layers, including custom layers, out of the box.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"An example of how to automate model building is this:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"\"\"\"\n make_model(width, height, [inchannels, nclasses; layer_config])\n\nCreate a CNN for a given set of configuration parameters. Arguments:\n- `width`, `height`: the input image size in pixels\n- `inchannels`: the number of channels in the input image, default `1`\n- `nclasses`: the number of output classes, default `10`\n- Keyword `layer_config`: a vector of the number of channels per layer, default `[16, 16, 32, 64]`\n\"\"\"\nfunction make_model(width, height, inchannels = 1, nclasses = 10;\n layer_config = [16, 16, 32, 64])\n # construct a vector of layers:\n conv_layers = []\n push!(conv_layers, Conv((5, 5), inchannels => layer_config[1], relu, pad=SamePad()))\n for (inch, outch) in zip(layer_config, layer_config[2:end])\n push!(conv_layers, Conv((3, 3), inch => outch, sigmoid, stride=2))\n end\n\n # compute the output dimensions after these conv layers:\n conv_outsize = Flux.outputsize(conv_layers, (width, height, inchannels); padbatch=true)\n\n # use this to define appropriate Dense layer:\n last_layer = Dense(prod(conv_outsize) => nclasses)\n return Chain(conv_layers..., Flux.flatten, last_layer)\nend\n\nm = make_model(28, 28, 3, layer_config = [9, 17, 33, 65])\n\nFlux.outputsize(m, (28, 28, 3, 42)) == (10, 42) == size(m(randn(Float32, 28, 28, 3, 42)))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Alternatively, using the macro, the definition of make_model could end with:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":" # compute the output dimensions & construct appropriate Dense layer:\n return @autosize (width, height, inchannels, 1) Chain(conv_layers..., Flux.flatten, Dense(_ => nclasses))\nend","category":"page"},{"location":"reference/outputsize/#Listing","page":"Shape Inference","title":"Listing","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux.@autosize\nFlux.outputsize","category":"page"},{"location":"reference/outputsize/#Flux.@autosize","page":"Shape Inference","title":"Flux.@autosize","text":"@autosize (size...,) Chain(Layer(_ => 2), Layer(_), ...)\n\nReturns the specified model, with each _ replaced by an inferred number, for input of the given size.\n\nThe unknown sizes are usually the second-last dimension of that layer's input, which Flux regards as the channel dimension. (A few layers, Dense & LayerNorm, instead always use the first dimension.) The underscore may appear as an argument of a layer, or inside a =>. It may be used in further calculations, such as Dense(_ => _÷4).\n\nExamples\n\njulia> @autosize (3, 1) Chain(Dense(_ => 2, sigmoid), BatchNorm(_, affine=false))\nChain(\n Dense(3 => 2, σ), # 8 parameters\n BatchNorm(2, affine=false),\n) \n\njulia> img = [28, 28];\n\njulia> @autosize (img..., 1, 32) Chain( # size is only needed at runtime\n Chain(c = Conv((3,3), _ => 5; stride=2, pad=SamePad()),\n p = MeanPool((3,3)),\n b = BatchNorm(_),\n f = Flux.flatten),\n Dense(_ => _÷4, relu, init=Flux.rand32), # can calculate output size _÷4\n SkipConnection(Dense(_ => _, relu), +),\n Dense(_ => 10),\n )\nChain(\n Chain(\n c = Conv((3, 3), 1 => 5, pad=1, stride=2), # 50 parameters\n p = MeanPool((3, 3)),\n b = BatchNorm(5), # 10 parameters, plus 10\n f = Flux.flatten,\n ),\n Dense(80 => 20, relu), # 1_620 parameters\n SkipConnection(\n Dense(20 => 20, relu), # 420 parameters\n +,\n ),\n Dense(20 => 10), # 210 parameters\n) # Total: 10 trainable arrays, 2_310 parameters,\n # plus 2 non-trainable, 10 parameters, summarysize 10.469 KiB.\n\njulia> outputsize(ans, (28, 28, 1, 32))\n(10, 32)\n\nLimitations:\n\nWhile @autosize (5, 32) Flux.Bilinear(_ => 7) is OK, something like Bilinear((_, _) => 7) will fail.\nWhile Scale(_) and LayerNorm(_) are fine (and use the first dimension), Scale(_,_) and LayerNorm(_,_) will fail if size(x,1) != size(x,2).\n\n\n\n\n\n","category":"macro"},{"location":"reference/outputsize/#Flux.outputsize","page":"Shape Inference","title":"Flux.outputsize","text":"outputsize(m, x_size, y_size, ...; padbatch=false)\n\nFor model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.\n\nExamples\n\njulia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);\n\njulia> par = Parallel(vcat, Dense(5 => 9), Dense(7 => 11));\n\njulia> Flux.outputsize(par, (5, 64), (7, 64))\n(20, 64)\n\njulia> m = Chain(par, Dense(20 => 13), softmax);\n\njulia> Flux.outputsize(m, (5,), (7,); padbatch=true)\n(13, 1)\n\njulia> par(x, y) == par((x, y)) == Chain(par, identity)((x, y))\ntrue\n\nNotice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.\n\n\n\n\n\n","category":"function"},{"location":"guide/performance/#man-performance-tips","page":"Performance Tips","title":"Performance Tips","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"All the usual Julia performance tips apply. As always profiling your code is generally a useful way of finding bottlenecks. Below follow some Flux specific tips/reminders.","category":"page"},{"location":"guide/performance/#Don't-use-more-precision-than-you-need","page":"Performance Tips","title":"Don't use more precision than you need","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Flux works great with all kinds of number types. But often you do not need to be working with say Float64 (let alone BigFloat). Switching to Float32 can give you a significant speed up, not because the operations are faster, but because the memory usage is halved. Which means allocations occur much faster. And you use less memory.","category":"page"},{"location":"guide/performance/#Preserve-inputs'-types","page":"Performance Tips","title":"Preserve inputs' types","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Not only should your activation and loss functions be type-stable, they should also preserve the type of their inputs.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"A very artificial example using an activation function like","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"my_tanh(x) = Float64(tanh(x))","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"will result in performance on Float32 input orders of magnitude slower than the normal tanh would, because it results in having to use slow mixed type multiplication in the dense layers. Similar situations can occur in the loss function during backpropagation.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Which means if you change your data say from Float64 to Float32 (which should give a speedup: see above), you will see a large slow-down.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"This can occur sneakily, because you can cause type-promotion by interacting with a numeric literals. E.g. the following will have run into the same problem as above:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = 0.01*x + tanh(x)","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While one could change the activation function (e.g. to use 0.01f0*x), the idiomatic (and safe way) to avoid type casts whenever inputs changes is to use oftype:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = oftype(x/1, 0.01)*x + tanh(x)","category":"page"},{"location":"guide/performance/#Evaluate-batches-as-matrices-of-features","page":"Performance Tips","title":"Evaluate batches as matrices of features","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While it can sometimes be tempting to process your observations (feature vectors) one at a time e.g.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector})\n sum(zip(xs, ys)) do (x, y_target)\n y_pred = model(x) # evaluate the model\n return loss(y_pred, y_target)\n end\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"It is much faster to concatenate them into a matrix, as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications. The improvement is enough that it is worthwhile allocating new memory to store them contiguously.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"x_batch = reduce(hcat, xs)\ny_batch = reduce(hcat, ys)\n...\nfunction loss_total(x_batch::Matrix, y_batch::Matrix)\n y_preds = model(x_batch)\n sum(loss.(y_preds, y_batch))\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.","category":"page"},{"location":"guide/performance/#Be-aware-of-GPU-memory-inefficiencies","page":"Performance Tips","title":"Be aware of GPU memory inefficiencies","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.","category":"page"},{"location":"#Flux:-The-Julia-Machine-Learning-Library","page":"Welcome","title":"Flux: The Julia Machine Learning Library","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Flux is a library for machine learning. It comes \"batteries-included\" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.\nExtensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.\nPlay nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.","category":"page"},{"location":"#Installation","page":"Welcome","title":"Installation","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Download Julia 1.9 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.","category":"page"},{"location":"#Learning-Flux","page":"Welcome","title":"Learning Flux","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"The quick start page trains a simple neural network.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"This rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.","category":"page"},{"location":"#Community","page":"Welcome","title":"Community","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.","category":"page"},{"location":"tutorials/linear_regression/#man-linear-regression","page":"Linear Regression","title":"Tutorial: Linear Regression","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The following page contains a step-by-step walkthrough of the linear regression algorithm in Julia using Flux! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by Flux.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let us start by building a simple linear regression model. This model would be trained on the data points of the form (x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ). In the real world, these xs can have multiple features, and the ys denote a label. In our example, each x has a single feature; hence, our data would have n data points, each point mapping a single feature to a single label.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Importing the required Julia packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Plots","category":"page"},{"location":"tutorials/linear_regression/#Generating-a-dataset","page":"Linear Regression","title":"Generating a dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the xs of our data points and map them to the respective ys using a simple function. Remember, here each x is equivalent to a feature, and each y is the corresponding label. Combining all the xs and ys would create the complete dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = hcat(collect(Float32, -3:0.1:3)...)\n1×61 Matrix{Float32}:\n -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 … 2.4 2.5 2.6 2.7 2.8 2.9 3.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The hcat call generates a Matrix with numbers ranging from -3.0 to 3.0 with a gap of 0.1 between them. Each column of this matrix holds a single x, a total of 61 xs. The next step would be to generate the corresponding labels or the ys.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> f(x) = @. 3x + 2;\n\njulia> y = f(x)\n1×61 Matrix{Float32}:\n -7.0 -6.7 -6.4 -6.1 -5.8 -5.5 … 9.5 9.8 10.1 10.4 10.7 11.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The function f maps each x to a y, and as x is a Matrix, the expression broadcasts the scalar values using @. macro. Our data points are ready, but they are too perfect. In a real-world scenario, we will not have an f function to generate y values, but instead, the labels would be manually added.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = x .* reshape(rand(Float32, 61), (1, 61));","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Visualizing the final data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(vec(x), vec(y), lw = 3, seriestype = :scatter, label = \"\", title = \"Generated data\", xlabel = \"x\", ylabel= \"y\");","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-data)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data looks random enough now! The x and y values are still somewhat correlated; hence, the linear regression algorithm should work fine on our dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed ahead and build a model for our dataset!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-model","page":"Linear Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A linear regression model is defined mathematically as -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"model(W b x) = Wx + b","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"where W is the weight matrix and b is the bias. For our case, the weight matrix (W) would constitute only a single element, as we have only a single feature. We can define our model in Julia using the exact same notation!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) = @. W*x + b\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The @. macro allows you to perform the calculations by broadcasting the scalar quantities (for example - the bias).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The next step would be to initialize the model parameters, which are the weight and the bias. There are a lot of initialization techniques available for different machine learning models, but for the sake of this example, let's pull out the weight from a uniform distribution and initialize the bias as 0.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = rand(Float32, 1, 1)\n1×1 Matrix{Float32}:\n 0.99285793\n\njulia> b = [0.0f0]\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Time to test if our model works!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) |> size\n(1, 61)\n\njulia> custom_model(W, b, x)[1], y[1]\n(-1.6116865f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function custom_loss(W, b, x, y)\n ŷ = custom_model(W, b, x)\n sum((y .- ŷ).^2) / length(x)\n end;\n\njulia> custom_loss(W, b, x, y)\n23.772217f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Calling the loss function on our xs and ys shows how far our predictions (ŷ) are from the real labels. More precisely, it calculates the sum of the squares of residuals and divides it by the total number of data points.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We have successfully defined our model and the loss function, but surprisingly, we haven't used Flux anywhere till now. Let's see how we can write the same code using Flux. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A Dense(1 => 1) layer denotes a layer of one neuron with one input (one feature) and one output. This layer is exactly same as the mathematical model defined by us above! Under the hood, Flux too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model.weight, flux_model.bias\n(Float32[-1.2678515;;], Float32[0.0])","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Now we can check if our model is acting right. We can pass the complete data in one go, with each x having exactly one feature (one input) -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model(x) |> size\n(1, 61)\n\njulia> flux_model(x)[1], y[1]\n(-1.8525281f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It is! The next step would be defining the loss function using Flux's functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function flux_loss(flux_model, x, y)\n ŷ = flux_model(x)\n Flux.mse(ŷ, y)\n end;\n\njulia> flux_loss(flux_model, x, y)\n22.74856f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Everything works as before! It almost feels like Flux provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the flux_model is from our custom_model. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom_model to match that of the flux_model -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = Float32[1.1412252]\n1-element Vector{Float32}:\n 1.1412252","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"To check how both the models are performing on the data, let's find out the losses using the loss and flux_loss functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y), flux_loss(flux_model, x, y)\n(22.74856f0, 22.74856f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The losses are identical! This means that our model and the flux_model are identical on some level, and the loss functions are completely identical! The difference in models would be that Flux's Dense layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple custom_model.","category":"page"},{"location":"tutorials/linear_regression/#Training-the-model","page":"Linear Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W .= W .- 0.1 .* dLdW\n1-element Vector{Float32}:\n 1.8144473\n\njulia> b .= b .- 0.1 .* dLdb\n1-element Vector{Float32}:\n 0.41325632","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The parameters have been updated! We can now check the value of the loss function -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y)\n17.157953f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, change in loss < 0.1. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's plug our super training logic inside a function and test it again -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_custom_model()\n dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y)\n @. W = W - 0.1 * dLdW\n @. b = b - 0.1 * dLdb\n end;\n\njulia> train_custom_model();\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[2.340657], Float32[0.7516814], 13.64972f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It works, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 30 epochs.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> for i = 1:40\n train_custom_model()\n end\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[4.2422233], Float32[2.2460847], 7.6680417f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"There was a significant reduction in loss, and the parameters were updated!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from 22.74856 to 7.6680417f. Time for some visualization!","category":"page"},{"location":"tutorials/linear_regression/#Results","page":"Linear Regression","title":"Results","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, Wx + b is nothing more than a line's equation, with slope = W[1] and y-intercept = b[1] (indexing at 1 as W and b are iterable).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Plotting the line and the data points using Plot.jl -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = \"\", title = \"Simple Linear Regression\", xlabel = \"x\", ylabel= \"y\");\n\njulia> plot!((x) -> b[1] + W[1] * x, -3, 3, label=\"Custom model\", lw=2);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-line)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!","category":"page"},{"location":"tutorials/linear_regression/#Linear-regression-model-on-a-real-dataset","page":"Linear Regression","title":"Linear regression model on a real dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We now move on to a relatively complex linear regression model. Here we will use a real dataset from MLDatasets.jl, which will not confine our data points to have only one feature. Let's start by importing the required packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames","category":"page"},{"location":"tutorials/linear_regression/#Gathering-real-data","page":"Linear Regression","title":"Gathering real data","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's start by initializing our dataset. We will be using the BostonHousing dataset consisting of 506 data points. Each of these data points has 13 features and a corresponding label, the house's price. The xs are still mapped to a single y, but now, a single x data point has 13 features. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dataset = BostonHousing();\n\njulia> x, y = BostonHousing(as_df=false)[:];\n\njulia> x, y = Float32.(x), Float32.(y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now split the obtained data into training and testing data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end];\n\njulia> x_train |> size, x_test |> size, y_train |> size, y_test |> size\n((13, 400), (13, 106), (1, 400), (1, 106))","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This data contains a diverse number of features, which means that the features have different scales. A wise option here would be to normalise the data, making the training process more efficient and fast. Let's check the standard deviation of the training data before normalising it.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> std(x_train)\n134.06786f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data is indeed not normalised. We can use the Flux.normalise function to normalise the training data.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train_n = Flux.normalise(x_train);\n\njulia> std(x_train_n)\n1.0000844f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The standard deviation is now close to one! Our data is ready!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-Flux-model","page":"Linear Regression","title":"Building a Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now directly use Flux and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and Flux will handle everything for us! Remember, we could have declared a model in plain Julia as well. The model will have 14 parameters: 13 weights and 1 bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> model = Dense(13 => 1)\nDense(13 => 1) # 14 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function loss(model, x, y)\n ŷ = model(x)\n Flux.mse(ŷ, y)\n end;\n\njulia> loss(model, x_train_n, y_train)\n676.1656f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed to the training phase!","category":"page"},{"location":"tutorials/linear_regression/#Training-the-Flux-model","page":"Linear Regression","title":"Training the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The training procedure would make use of the same mathematics, but now we can pass in the model inside the gradient call and let Flux and Zygote handle the derivatives!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_model()\n dLdm, _, _ = gradient(loss, model, x_train_n, y_train)\n @. model.weight = model.weight - 0.000001 * dLdm.weight\n @. model.bias = model.bias - 0.000001 * dLdm.bias\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when change in loss < δ. The quantity δ can be altered according to a user's need, but let's fix it to 10⁻³ for this tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can write such custom training loops effortlessly using Flux and plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss_init = Inf;\n\njulia> while true\n train_model()\n if loss_init == Inf\n loss_init = loss(model, x_train_n, y_train)\n continue\n end\n if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-4\n break\n else\n loss_init = loss(model, x_train_n, y_train)\n end\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The code starts by initializing an initial value for the loss, infinity. Next, it runs an infinite loop that breaks if change in loss < 10⁻³, or the code changes the value of loss_init to the current loss and moves on to the next iteration.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This custom loop works! This shows how easily a user can write down any custom training routine using Flux and Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's have a look at the loss -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss(model, x_train_n, y_train)\n27.1272f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down significantly! It can be minimized further by choosing an even smaller δ.","category":"page"},{"location":"tutorials/linear_regression/#Testing-the-Flux-model","page":"Linear Regression","title":"Testing the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_test_n = Flux.normalise(x_test);\n\njulia> loss(model, x_test_n, y_test)\n66.91015f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"info: Info\nOriginally published on 21 November 2022, by Saransh Chopra.","category":"page"},{"location":"guide/saving/#Saving-and-Loading-Models","page":"Saving & Loading","title":"Saving and Loading Models","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You may wish to save models so that they can be loaded and run in a later session. Flux provides a number of ways to do this. The recommended way, which is the most robust one for long term storage, is to use Flux.state in combination with a serialization format like JLD2.jl or BSON.jl.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> struct MyModel\n net\n end\n\njulia> Flux.@layer MyModel\n\njulia> MyModel() = MyModel(Chain(Dense(10 => 5, relu), Dense(5 => 2)));\n\njulia> model = MyModel()\nMyModel(Chain(Dense(10 => 5, relu), Dense(5 => 2))) # 67 parameters\n\njulia> model_state = Flux.state(model);\n\njulia> using JLD2\n\njulia> jldsave(\"mymodel.jld2\"; model_state)","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session using Flux.loadmodel!:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, JLD2\n\njulia> model_state = JLD2.load(\"mymodel.jld2\", \"model_state\");\n\njulia> model = MyModel(); # MyModel definition must be available\n\njulia> Flux.loadmodel!(model, model_state);","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"note: Note\nIf a saved model's parameters are stored on the GPU, the model will not load later on if there is no GPU support available. It's best to move your model to the CPU with cpu(model) before saving it.","category":"page"},{"location":"guide/saving/#Checkpointing","page":"Saving & Loading","title":"Checkpointing","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"In longer training runs it's a good idea to periodically save your model, so that you can resume if training is interrupted (for example, if there's a power cut). ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux: throttle\n\njulia> using JLD2\n\njulia> m = Chain(Dense(10 => 5, relu), Dense(5 => 2))\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 524 bytes.\n\njulia> for epoch in 1:10\n # ... train model ...\n jldsave(\"model-checkpoint.jld2\", model_state = Flux.state(m))\n end;","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"This will update the \"model-checkpoint.jld2\" every epoch.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can get more advanced by saving a series of models throughout training, for example","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m))","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"will produce a series of models like \"model-2018-03-06T02:57:10.41.jld2\". You could also store the current test set loss, so that it's easy to (for example) revert to an older copy of the model if it starts to overfit.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m), loss = testloss())","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Note that to resume a model's training, you might need to restore other stateful parts of your training loop. Possible examples are the optimiser state and the randomness used to partition the original data into the training and validation sets.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can store the optimiser state alongside the model, to resume training exactly where you left off: ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"model = MyModel()\nopt_state = Flux.setup(AdamW(), model)\n\n# ... train model ...\n\nmodel_state = Flux.state(model)\njldsave(\"checkpoint_epoch=42.jld2\"; model_state, opt_state)","category":"page"},{"location":"guide/saving/#Saving-Models-as-Julia-Structs","page":"Saving & Loading","title":"Saving Models as Julia Structs","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Models are just normal Julia structs, so it's fine to use any Julia storage format to save the struct as it is instead of saving the state returned by Flux.state. BSON.jl is particularly convenient for this, since it can also save anonymous functions, which are sometimes part of a model definition.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> model = Chain(Dense(10 => 5, NNlib.relu), Dense(5 => 2));\n\njulia> using BSON: @save\n\njulia> @save \"mymodel.bson\" model","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, BSON\n\njulia> BSON.@load \"mymodel.bson\" model\n\njulia> model\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 524 bytes.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"warning: Warning\nSaving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.","category":"page"}] +[{"location":"guide/models/quickstart/#man-quickstart","page":"Quick Start","title":"A Neural Network in One Minute","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you have used neural networks before, then this simple example might be helpful for seeing how the major parts of Flux work together. Try pasting the code into the REPL prompt.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"If you haven't, then you might prefer the Fitting a Straight Line page.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"# This will prompt if neccessary to install everything, including CUDA:\nusing Flux, CUDA, Statistics, ProgressMeter\n\n# Generate some data for the XOR problem: vectors of length 2, as columns of a matrix:\nnoisy = rand(Float32, 2, 1000) # 2×1000 Matrix{Float32}\ntruth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)] # 1000-element Vector{Bool}\n\n# Define our model, a multi-layer perceptron with one hidden layer of size 3:\nmodel = Chain(\n Dense(2 => 3, tanh), # activation function inside layer\n BatchNorm(3),\n Dense(3 => 2)) |> gpu # move model to GPU, if available\n\n# The model encapsulates parameters, randomly initialised. Its initial output is:\nout1 = model(noisy |> gpu) |> cpu # 2×1000 Matrix{Float32}\nprobs1 = softmax(out1) # normalise to get probabilities\n\n# To train the model, we use batches of 64 samples, and one-hot encoding:\ntarget = Flux.onehotbatch(truth, [true, false]) # 2×1000 OneHotMatrix\nloader = Flux.DataLoader((noisy, target) |> gpu, batchsize=64, shuffle=true);\n# 16-element DataLoader with first element: (2×64 Matrix{Float32}, 2×64 OneHotMatrix)\n\noptim = Flux.setup(Flux.Adam(0.01), model) # will store optimiser momentum, etc.\n\n# Training loop, using the whole data set 1000 times:\nlosses = []\n@showprogress for epoch in 1:1_000\n for (x, y) in loader\n loss, grads = Flux.withgradient(model) do m\n # Evaluate model and loss inside gradient context:\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\n Flux.update!(optim, model, grads[1])\n push!(losses, loss) # logging, outside gradient context\n end\nend\n\noptim # parameters, momenta and output have all changed\nout2 = model(noisy |> gpu) |> cpu # first row is prob. of true, second row p(false)\nprobs2 = softmax(out2) # normalise to get probabilities\nmean((probs2[1,:] .> 0.5) .== truth) # accuracy 94% so far!","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"(Image: )","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"using Plots # to draw the above figure\n\np_true = scatter(noisy[1,:], noisy[2,:], zcolor=truth, title=\"True classification\", legend=false)\np_raw = scatter(noisy[1,:], noisy[2,:], zcolor=probs1[1,:], title=\"Untrained network\", label=\"\", clims=(0,1))\np_done = scatter(noisy[1,:], noisy[2,:], zcolor=probs2[1,:], title=\"Trained network\", legend=false)\n\nplot(p_true, p_raw, p_done, layout=(1,3), size=(1000,330))","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Here's the loss during training:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"plot(losses; xaxis=(:log10, \"iteration\"),\n yaxis=\"loss\", label=\"per batch\")\nn = length(loader)\nplot!(n:n:length(losses), mean.(Iterators.partition(losses, n)),\n label=\"epoch mean\", dpi=200)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"This XOR (\"exclusive or\") problem is a variant of the famous one which drove Minsky and Papert to invent deep neural networks in 1969. For small values of \"deep\" – this has one hidden layer, while earlier perceptrons had none. (What they call a hidden layer, Flux calls the output of the first layer, model[1](noisy).)","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Since then things have developed a little. ","category":"page"},{"location":"guide/models/quickstart/#Features-to-Note","page":"Quick Start","title":"Features to Note","text":"","category":"section"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Some things to notice in this example are:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"The batch dimension of data is always the last one. Thus a 2×1000 Matrix is a thousand observations, each a column of length 2. Flux defaults to Float32, but most of Julia to Float64.\nThe model can be called like a function, y = model(x). Each layer like Dense is an ordinary struct, which encapsulates some arrays of parameters (and possibly other state, as for BatchNorm).\nBut the model does not contain the loss function, nor the optimisation rule. The momenta needed by Adam are stored in the object returned by setup. And Flux.logitcrossentropy is an ordinary function that combines the softmax and crossentropy functions.\nThe do block creates an anonymous function, as the first argument of gradient. Anything executed within this is differentiated.","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"Instead of calling gradient and update! separately, there is a convenience function train!. If we didn't want anything extra (like logging the loss), we could replace the training loop with the following:","category":"page"},{"location":"guide/models/quickstart/","page":"Quick Start","title":"Quick Start","text":"for epoch in 1:1_000\n Flux.train!(model, loader, optim) do m, x, y\n y_hat = m(x)\n Flux.logitcrossentropy(y_hat, y)\n end\nend","category":"page"},{"location":"reference/training/reference/#Training-API-Reference","page":"Training API","title":"Training API Reference","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The new version of Flux's training code was written as an independent package, Optimisers.jl. Only the function train! belongs to Flux itself.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The Optimisers package is designed to allow for immutable objects. But at present all Flux models contain parameter arrays (such as Arrays and CuArrays) which can be updated in-place. Because of this:","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The objects returned by Optimisers.update! can be ignored.\nFlux defines its own version of setup which checks this assumption. (Using instead Optimisers.setup will also work, they return the same thing.)","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The available optimization rules are listed the optimisation rules page here. See the Optimisers documentation for details on how the rules work.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Flux.Train.setup\nFlux.Train.train!(loss, model, data, state)\nOptimisers.update\nOptimisers.update!\nOptimisers.setup","category":"page"},{"location":"reference/training/reference/#Flux.Train.setup","page":"Training API","title":"Flux.Train.setup","text":"opt_state = setup(rule, model)\n\nThis is a version of Optimisers.setup, and is the first step before using train!. It differs from Optimisers.setup in that it:\n\nhas one extra check for mutability (since Flux expects to mutate the model in-place, while Optimisers.jl is designed to return an updated model)\nhas methods which accept Flux's old optimisers, and convert them. (The old Flux.Optimise.Adam and new Optimisers.Adam are distinct types.)\n\nExample\n\njulia> model = Dense(2 => 1, leakyrelu; init=ones);\n\njulia> opt_state = Flux.setup(Momentum(0.1), model) # this encodes the optimiser and its state\n(weight = Leaf(Momentum(0.1, 0.9), [0.0 0.0]), bias = Leaf(Momentum(0.1, 0.9), [0.0]), σ = ())\n\njulia> x1, y1 = [0.2, -0.3], [0.4]; # use the same data for two steps:\n\njulia> Flux.train!(model, [(x1, y1), (x1, y1)], opt_state) do m, x, y\n sum(abs.(m(x) .- y)) * 100\n end\n\njulia> model.bias # was zero, mutated by Flux.train!\n1-element Vector{Float64}:\n 10.19\n\njulia> opt_state # mutated by Flux.train!\n(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Flux.Optimise.train!-NTuple{4, Any}","page":"Training API","title":"Flux.Optimise.train!","text":"train!(loss, model, data, opt_state)\n\nUses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.\n\nIf model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.\n\nFor example, with these definitions...\n\ndata = [(x1, y1), (x2, y2), (x3, y3)]\n\nloss3(m, x, y) = norm(m(x) .- y) # the model is the first argument\n\nopt_state = Flux.setup(Adam(), model) # explicit setup of optimiser momenta\n\n...calling Flux.train!(loss3, model, data, opt_state) runs a loop much like this:\n\nfor d in data\n ∂L∂m = gradient(loss3, model, d...)[1]\n update!(opt_state, model, ∂L∂m)\nend\n\nYou can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:\n\nStop with a DomainError if the loss is infinite or NaN at any point.\nShow a progress bar using @withprogress.\n\ncompat: New\nThis method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's \"implicit\" parameter handling, with Grads.)\nInstead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.\nopt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.\nCallback functions are not supported. (But any code can be included in the above for loop.)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/reference/#Optimisers.update","page":"Training API","title":"Optimisers.update","text":"Optimisers.update(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nSee also update!, which will be faster for models of ordinary Arrays or CuArrays.\n\nExample\n\njulia> m = (x = Float32[1,2,3], y = tanh);\n\njulia> t = Optimisers.setup(Descent(0.1), m)\n(x = Leaf(Descent(0.1), nothing), y = ())\n\njulia> g = (x = [1,1,1], y = nothing); # fake gradient\n\njulia> Optimisers.update(t, m, g)\n((x = Leaf(Descent(0.1), nothing), y = ()), (x = Float32[0.9, 1.9, 2.9], y = tanh))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.update!","page":"Training API","title":"Optimisers.update!","text":"Optimisers.update!(tree, model, gradient) -> (tree, model)\n\nUses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.\n\nThis is used in exactly the same manner as update, but because it may mutate arrays within the old model (and the old state), it will be faster for models of ordinary Arrays or CuArrays. However, you should not rely on the old model being fully updated but rather use the returned model. (The original state tree is always mutated, as each Leaf is mutable.)\n\nExample\n\njulia> using StaticArrays, Zygote, Optimisers\n\njulia> m = (x = [1f0, 2f0], y = SA[4f0, 5f0]); # partly mutable model\n\njulia> t = Optimisers.setup(Momentum(1/30, 0.9), m) # tree of states\n(x = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]), y = Leaf(Momentum(0.0333333, 0.9), Float32[0.0, 0.0]))\n\njulia> g = gradient(m -> sum(abs2.(m.x .+ m.y)), m)[1] # structural gradient\n(x = Float32[10.0, 14.0], y = Float32[10.0, 14.0])\n\njulia> t2, m2 = Optimisers.update!(t, m, g);\n\njulia> m2 # after update or update!, this is the new model\n(x = Float32[0.6666666, 1.5333333], y = Float32[3.6666667, 4.5333333])\n\njulia> m2.x === m.x # update! has re-used this array, for efficiency\ntrue\n\njulia> m # original should be discarded, may be mutated but no guarantee\n(x = Float32[0.6666666, 1.5333333], y = Float32[4.0, 5.0])\n\njulia> t == t2 # original state tree is guaranteed to be mutated\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.setup","page":"Training API","title":"Optimisers.setup","text":"Optimisers.setup(rule, model) -> state_tree\n\nInitialises the given optimiser for every trainable parameter within the model. Returns a tree of the relevant states, which must be passed to update or update!.\n\nExample\n\njulia> m = (x = rand(3), y = (true, false), z = tanh);\n\njulia> Optimisers.setup(Momentum(), m) # same field names as m\n(x = Leaf(Momentum(0.01, 0.9), [0.0, 0.0, 0.0]), y = ((), ()), z = ())\n\nThe recursion into structures uses Functors.jl, and any new structs containing parameters need to be marked with Functors.@functor before use. See the Flux docs for more about this.\n\njulia> struct Layer; mat; fun; end\n\njulia> model = (lay = Layer([1 2; 3 4f0], sin), vec = [5, 6f0]);\n\njulia> Optimisers.setup(Momentum(), model) # new struct is by default ignored\n(lay = (), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[5.0, 6.0], Restructure(NamedTuple, ..., 2))\n\njulia> using Functors; @functor Layer # annotate this type as containing parameters\n\njulia> Optimisers.setup(Momentum(), model)\n(lay = (mat = Leaf(Momentum(0.01, 0.9), Float32[0.0 0.0; 0.0 0.0]), fun = ()), vec = Leaf(Momentum(0.01, 0.9), Float32[0.0, 0.0]))\n\njulia> destructure(model)\n(Float32[1.0, 3.0, 2.0, 4.0, 5.0, 6.0], Restructure(NamedTuple, ..., 6))\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"train! uses @progress which should show a progress bar in VSCode automatically. To see one in a terminal, you will need to install TerminalLoggers.jl and follow its setup instructions.","category":"page"},{"location":"reference/training/reference/#Optimisation-Modifiers","page":"Training API","title":"Optimisation Modifiers","text":"","category":"section"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"The state returned by setup can be modified to temporarily prevent training of some parts of the model, or to change the learning rate or other hyperparameter. The functions for doing so may be accessed as Flux.freeze!, Flux.thaw!, and Flux.adjust!. All mutate the state (or part of it) and return nothing.","category":"page"},{"location":"reference/training/reference/","page":"Training API","title":"Training API","text":"Optimisers.adjust!\nOptimisers.freeze!\nOptimisers.thaw!","category":"page"},{"location":"reference/training/reference/#Optimisers.adjust!","page":"Training API","title":"Optimisers.adjust!","text":"Optimisers.adjust!(tree, η)\n\nAlters the state tree = setup(rule, model) to change the parameters of the optimisation rule, without destroying its stored state. Typically used mid-way through training.\n\nCan be applied to part of a model, by acting only on the corresponding part of the state tree.\n\nTo change just the learning rate, provide a number η::Real.\n\nExample\n\njulia> m = (vec = rand(Float32, 2), fun = sin);\n\njulia> st = Optimisers.setup(Nesterov(), m) # stored momentum is initialised to zero\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[0.0, 0.0]), fun = ())\n\njulia> st, m = Optimisers.update(st, m, (vec = [16, 88], fun = nothing)); # with fake gradient\n\njulia> st\n(vec = Leaf(Nesterov(0.001, 0.9), Float32[-0.016, -0.088]), fun = ())\n\njulia> Optimisers.adjust!(st, 0.123) # change learning rate, stored momentum untouched\n\njulia> st\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\nTo change other parameters, adjust! also accepts keyword arguments matching the field names of the optimisation rule's type.\n\njulia> fieldnames(Adam)\n(:eta, :beta, :epsilon)\n\njulia> st2 = Optimisers.setup(OptimiserChain(ClipGrad(), Adam()), m)\n(vec = Leaf(OptimiserChain(ClipGrad(10.0), Adam(0.001, (0.9, 0.999), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st2; beta = (0.777, 0.909), delta = 11.1) # delta acts on ClipGrad\n(vec = Leaf(OptimiserChain(ClipGrad(11.1), Adam(0.001, (0.777, 0.909), 1.0e-8)), (nothing, (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))), fun = ())\n\njulia> Optimisers.adjust(st; beta = \"no such field\") # silently ignored!\n(vec = Leaf(Nesterov(0.123, 0.9), Float32[-0.016, -0.088]), fun = ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.freeze!","page":"Training API","title":"Optimisers.freeze!","text":"Optimisers.freeze!(tree)\n\nTemporarily alters the state tree = setup(rule, model) so that parameters will not be updated. Un-done by thaw!.\n\nCan be applied to the state corresponding to only part of a model, for instance with model::Chain, to freeze model.layers[1] you should call freeze!(tree.layers[1]).\n\nExample\n\njulia> m = (x = ([1.0], 2.0), y = [3.0]);\n\njulia> s = Optimisers.setup(Momentum(), m);\n\njulia> Optimisers.freeze!(s.x)\n\njulia> Optimisers.update!(s, m, (x = ([pi], 10pi), y = [100pi])); # with fake gradient\n\njulia> m\n(x = ([1.0], 2.0), y = [-0.14159265358979312])\n\njulia> s\n(x = (Leaf(Momentum(0.01, 0.9), [0.0], frozen = true), ()), y = Leaf(Momentum(0.01, 0.9), [3.14159]))\n\njulia> Optimisers.thaw!(s)\n\njulia> s.x\n(Leaf(Momentum(0.01, 0.9), [0.0]), ())\n\n\n\n\n\n","category":"function"},{"location":"reference/training/reference/#Optimisers.thaw!","page":"Training API","title":"Optimisers.thaw!","text":"Optimisers.thaw!(tree)\n\nThe reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).\n\n\n\n\n\n","category":"function"},{"location":"tutorials/logistic_regression/#Logistic-Regression","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The following page contains a step-by-step walkthrough of the logistic regression algorithm in Julia using Flux. We will then create a simple logistic regression model without any usage of Flux and compare the different working parts with Flux's implementation. ","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing the required Julia packages.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames, OneHotArrays","category":"page"},{"location":"tutorials/logistic_regression/#Dataset","page":"Logistic Regression","title":"Dataset","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by importing a dataset from MLDatasets.jl. We will use the Iris dataset that contains the data of three different Iris species. The data consists of 150 data points (xs), each having four features. Each of these x is mapped to y, the name of a particular Iris specie. The following code will download the Iris dataset when run for the first time.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> Iris()\ndataset Iris:\n metadata => Dict{String, Any} with 4 entries\n features => 150×4 DataFrame\n targets => 150×1 DataFrame\n dataframe => 150×5 DataFrame\n\njulia> x, y = Iris(as_df=false)[:];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's have a look at our dataset -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> y\n1×150 Matrix{InlineStrings.String15}:\n \"Iris-setosa\" \"Iris-setosa\" … \"Iris-virginica\" \"Iris-virginica\"\n\njulia> x |> summary\n\"4×150 Matrix{Float64}\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The y values here corresponds to a type of iris plant, with a total of 150 data points. The x values depict the sepal length, sepal width, petal length, and petal width (all in cm) of 150 iris plant (hence the matrix size 4×150). Different type of iris plants have different lengths and widths of sepals and petals associated with them, and there is a definitive pattern for this in nature. We can leverage this to train a simple classifier that outputs the type of iris plant using the length and width of sepals and petals as inputs.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step would be to convert this data into a form that can be fed to a machine learning model. The x values are arranged in a matrix and should ideally be converted to Float32 type (see Performance tips), but the labels must be one hot encoded. Here is a great discourse thread on different techniques that can be used to one hot encode data with or without using any external Julia package.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> x = Float32.(x);\n\njulia> y = vec(y);\n\njulia> custom_y_onehot = unique(y) .== permutedims(y)\n3×150 BitMatrix:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"This same operation can also be performed using OneHotArrays' onehotbatch function. We will use both of these outputs parallelly to show how intuitive FluxML is!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> const classes = [\"Iris-setosa\", \"Iris-versicolor\", \"Iris-virginica\"];\n\njulia> flux_y_onehot = onehotbatch(y, classes)\n3×150 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 1 1 1 1 1 1 1 1 1","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our data is ready. The next step would be to build a classifier for the same.","category":"page"},{"location":"tutorials/logistic_regression/#Building-a-model","page":"Logistic Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A logistic regression model is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"model(x) = σ(Wx + b)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"where W is the weight matrix, b is the bias vector, and σ is any activation function. For our case, let's use the softmax activation function as we will be performing a multiclass classification task.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> m(W, b, x) = W*x .+ b\nm (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note that this model lacks an activation function, but we will come back to that.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now move ahead to initialize the parameters of our model. Given that our model has four inputs (4 features in every data point), and three outputs (3 different classes), the parameters can be initialized in the following way -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W = rand(Float32, 3, 4);\n\njulia> b = [0.0f0, 0.0f0, 0.0f0];","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now our model can take in the complete dataset and predict the class of each x in one go. But, we need to ensure that our model outputs the probabilities of an input belonging to the respective classes. As our model has three outputs, each would denote the probability of the input belonging to a particular class.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We will use an activation function to map our outputs to a probability value. It would make sense to use a softmax activation function here, which is defined mathematically as -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"σ(vecx) = frace^z_isum_j=1^k e^z_j","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function scales down the outputs to probability values such that the sum of all the final outputs equals 1. Let's implement this in Julia.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_softmax(x) = exp.(x) ./ sum(exp.(x), dims=1)\ncustom_softmax (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The implementation looks straightforward enough! Note that we specify dims=1 in the sum function to calculate the sum of probabilities in each column. Remember, we will have a 3×150 matrix (predicted ys) as the output of our model, where each column would be an output of a corresponding input.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's combine this softmax function with our model to construct the complete custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) = m(W, b, x) |> custom_softmax\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's check if our model works.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_model(W, b, x) |> size\n(3, 150)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works! Let's check if the softmax function is working.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> all(0 .<= custom_model(W, b, x) .<= 1)\ntrue\n\njulia> sum(custom_model(W, b, x), dims=1)\n1×150 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Every output value is between 0 and 1, and every column adds to 1!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's convert our custom_model to a Flux model. Flux provides the users with a very elegant API that almost feels like writing your code!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Note, all the flux_* variables in this tutorial would be general, that is, they can be used as it is with some other similar-looking dataset, but the custom_* variables will remain specific to this tutorial.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model = Chain(Dense(4 => 3), softmax)\nChain(\n Dense(4 => 3), # 15 parameters\n NNlib.softmax,\n)","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A Dense(4 => 3) layer denotes a layer with four inputs (four features in every data point) and three outputs (three classes or labels). This layer is the same as the mathematical model defined by us above. Under the hood, Flux too calculates the output using the same expression, but we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The softmax function provided by NNLib.jl is re-exported by Flux, which has been used here. Lastly, Flux provides users with a Chain struct which makes stacking layers seamless.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"A model's weights and biases can be accessed as follows -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_model[1].weight, flux_model[1].bias\n(Float32[0.78588694 -0.45968163 -0.77409476 0.2358028; -0.9049773 -0.58643705 0.466441 -0.79523873; 0.82426906 0.4143493 0.7630932 0.020588955], Float32[0.0, 0.0, 0.0])","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now pass the complete data in one go, with each data point having four features (four inputs)!","category":"page"},{"location":"tutorials/logistic_regression/#Loss-and-accuracy","page":"Logistic Regression","title":"Loss and accuracy","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our next step should be to define some quantitative values for our model, which we will maximize or minimize during the complete training procedure. These values will be the loss function and the accuracy metric.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's start by defining a loss function, a logitcrossentropy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_logitcrossentropy(ŷ, y) = mean(.-sum(y .* logsoftmax(ŷ; dims = 1); dims = 1));","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can wrap the custom_logitcrossentropy inside a function that takes in the model parameters, xs, and ys, and returns the loss value.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_loss(W, b, x, y)\n ŷ = custom_model(W, b, x)\n custom_logitcrossentropy(ŷ, y)\n end;\n\njulia> custom_loss(W, b, x, custom_y_onehot)\n1.1714406827505623","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss function works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides us with many minimal yet elegant loss functions. In fact, the custom_logitcrossentropy defined above has been taken directly from Flux. The functions present in Flux includes sanity checks, ensures efficient performance, and behaves well with the overall FluxML ecosystem.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function flux_loss(flux_model, x, y)\n ŷ = flux_model(x)\n Flux.logitcrossentropy(ŷ, y)\n end;\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n1.2156688659673647","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Next, let's define an accuracy function, which we will try to maximize during our training procedure. Before jumping to accuracy, let's define a onecold function. The onecold function would convert our output, which remember, are probability values, to the actual class names.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can divide this task into two parts -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Identify the index of the maximum element of each column in the output matrix\nConvert this index to a class name","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The maximum index should be calculated along the columns (remember, each column is the output of a single x data point). We can use Julia's argmax function to achieve this.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> argmax(custom_y_onehot, dims=1) # calculate the cartesian index of max element column-wise\n1×150 Matrix{CartesianIndex{2}}:\n CartesianIndex(1, 1) CartesianIndex(1, 2) … CartesianIndex(3, 150)\n\njulia> max_idx = [x[1] for x in argmax(custom_y_onehot; dims=1)]\n1×150 Matrix{Int64}:\n 1 1 1 1 1 1 1 1 1 1 1 1 1 … 3 3 3 3 3 3 3 3 3 3 3 3","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Now we can write a function that calculates the indices of the maximum element in each column, and maps them to a class name.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function custom_onecold(custom_y_onehot)\n max_idx = [x[1] for x in argmax(custom_y_onehot; dims=1)]\n vec(classes[max_idx])\n end;\n\njulia> custom_onecold(custom_y_onehot)\n150-element Vector{String}:\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n \"Iris-setosa\"\n ⋮\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"\n \"Iris-virginica\"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"It works!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Flux provides users with the onecold function so that we don't have to write it on our own. Let's see how our custom_onecold function compares to Flux.onecold.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> istrue = Flux.onecold(flux_y_onehot, classes) .== custom_onecold(custom_y_onehot);\n\njulia> all(istrue)\ntrue","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Both the functions act identically!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We now move to the accuracy metric and run it with the untrained custom_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_accuracy(W, b, x, y) = mean(custom_onecold(custom_model(W, b, x)) .== y);\n\njulia> custom_accuracy(W, b, x, y)\n0.3333333333333333","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We could also have used Flux's built-in functionality to define this accuracy function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_accuracy(x, y) = mean(Flux.onecold(flux_model(x), classes) .== y);\n\njulia> flux_accuracy(x, y)\n0.24","category":"page"},{"location":"tutorials/logistic_regression/#Training-the-model","page":"Logistic Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality. gradient takes in a function and its arguments, and returns a tuple containing ∂f/∂x for each argument x. Let's pass in custom_loss and the arguments required by custom_loss to gradient. We will require the derivatives of the loss function (custom_loss) with respect to the weights (∂f/∂w) and the bias (∂f/∂b) to carry out gradient descent, but we can ignore the partial derivatives of the loss function (custom_loss) with respect to x (∂f/∂x) and one hot encoded y (∂f/∂y).","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, custom_y_onehot);","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> W .= W .- 0.1 .* dLdW;\n\njulia> b .= b .- 0.1 .* dLdb;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The parameters have been updated! We can now check the value of our custom loss function -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n1.164742997664842","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"The loss went down! Let's plug our super training logic inside a function.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> function train_custom_model()\n dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, custom_y_onehot)\n W .= W .- 0.1 .* dLdW\n b .= b .- 0.1 .* dLdb\n end;","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can plug the training function inside a loop and train the model for more epochs. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia. Here we will train the model for a maximum of 500 epochs, but to ensure that the model does not overfit, we will break as soon as our accuracy value crosses or becomes equal to 0.98.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> for i = 1:500\n train_custom_model();\n custom_accuracy(W, b, x, y) >= 0.98 && break\n end\n \njulia> @show custom_accuracy(W, b, x, y);\ncustom_accuracy(W, b, x, y) = 0.98","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Everything works! Our model achieved an accuracy of 0.98! Let's have a look at the loss.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> custom_loss(W, b, x, custom_y_onehot)\n0.6520349798243569","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"As expected, the loss went down too! Now, let's repeat the same steps with our flux_model.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We can write a similar-looking training loop for our flux_model and train it similarly.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> flux_loss(flux_model, x, flux_y_onehot)\n1.215731131385928\n\njulia> function train_flux_model()\n dLdm, _, _ = gradient(flux_loss, flux_model, x, flux_y_onehot)\n @. flux_model[1].weight = flux_model[1].weight - 0.1 * dLdm[:layers][1][:weight]\n @. flux_model[1].bias = flux_model[1].bias - 0.1 * dLdm[:layers][1][:bias]\n end;\n\njulia> for i = 1:500\n train_flux_model();\n flux_accuracy(x, y) >= 0.98 && break\n end","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Looking at the accuracy and loss value -","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"julia> @show flux_accuracy(x, y);\nflux_accuracy(x, y) = 0.98\n\njulia> flux_loss(flux_model, x, flux_y_onehot)\n0.6952386604624324","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"We see a very similar final loss and accuracy.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!","category":"page"},{"location":"tutorials/logistic_regression/","page":"Logistic Regression","title":"Logistic Regression","text":"info: Info\nOriginally published on 1st April 2023, by Saransh Chopra.","category":"page"},{"location":"tutorials/model_zoo/#Model-Zoo","page":"Model Zoo","title":"Model Zoo","text":"","category":"section"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"The model zoo is a collection of examples that demonstrate how to build and train models using Flux. The examples are organised by domain and include vision, text, and audio. Each example includes a description of the model, the data used, and the training process.","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Some of the examples are pedagogical, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Multilayer Perceptron\nSimple Convolutional Neural Network","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Others are more advanced, see for instance","category":"page"},{"location":"tutorials/model_zoo/","page":"Model Zoo","title":"Model Zoo","text":"Variational Autoencoder","category":"page"},{"location":"guide/models/custom_layers/#man-advanced","page":"Custom Layers","title":"Defining Customised Layers","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here we will try and describe usage of some more advanced features that Flux provides to give more control over model building.","category":"page"},{"location":"guide/models/custom_layers/#Custom-Model-Example","page":"Custom Layers","title":"Custom Model Example","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Here is a basic example of a custom model. It simply adds the input to the result from the neural network.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"struct CustomModel{T <: Chain} # Parameter to avoid type instability\n chain::T\nend\n\nfunction (m::CustomModel)(x)\n # Arbitrary code can go here, but note that everything will be differentiated.\n # Zygote does not allow some operations, like mutating arrays.\n\n return m.chain(x) + x\nend\n\n# Call @layer to allow for training. Described below in more detail.\nFlux.@layer CustomModel","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice that we parameterized the type of the chain field. This is necessary for fast Julia code, so that that struct field can be given a concrete type. Chains have a type parameter fully specifying the types of the layers they contain. By using a type parameter, we are freeing Julia to determine the correct concrete type, so that we do not need to specify the full, possibly quite long, type ourselves.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"You can then use the model like:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"chain = Chain(Dense(10 => 10, relu), Dense(10 => 10))\nmodel = CustomModel(chain)\nmodel(rand(Float32, 10))","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"For an intro to Flux and automatic differentiation, see this tutorial.","category":"page"},{"location":"guide/models/custom_layers/#Customising-Parameter-Collection-for-a-Model","page":"Custom Layers","title":"Customising Parameter Collection for a Model","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Taking reference from our example Affine layer from the basics.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"By default all the fields in the Affine type are collected as its parameters, however, in some cases it may be desired to hold other metadata in our \"layers\" that may not be needed for training, and are hence supposed to be ignored while the parameters are collected. With Flux, the way to mark some fields of our layer as trainable is through overloading the trainable function:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> struct Affine\n W\n b\n end\n\njulia> Affine(in::Int, out::Int) = Affine(randn(out, in), randn(out));\n\njulia> (m::Affine)(x) = m.W * x .+ m.b;\n\njulia> Flux.@layer Affine\n\njulia> a = Affine(Float32[1 2; 3 4; 5 6], Float32[7, 8, 9])\nAffine(Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a) # default behavior\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], b = Float32[7.0, 8.0, 9.0])\n\njulia> Flux.trainable(a::Affine) = (; W = a.W) # returns a NamedTuple using the field's name\n\njulia> Flux.trainable(a)\n(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0],)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Only the fields returned by trainable will be seen by Flux.setup and Flux.update! for training. But all fields wil be seen by gpu and similar functions, for example:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"julia> a |> f16\nAffine(Float16[1.0 2.0; 3.0 4.0; 5.0 6.0], Float16[7.0, 8.0, 9.0])","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Note that there is no need to overload trainable to hide fields which do not contain numerical array (for example, activation functions, or Boolean flags). These are always ignored by training.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The exact same method of trainable can also be defined using the macro, for convenience:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Affine trainable=(W,)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling Functors.@functor Affine (W,) means that all no exploration of the model will ever visit the other fields: They will not be moved to the GPU by gpu, and their precision will not be changed by f32. This requires the struct to have a corresponding constructor that accepts only W as an argument.","category":"page"},{"location":"guide/models/custom_layers/#Custom-multiple-input-or-output-layer","page":"Custom Layers","title":"Custom multiple input or output layer","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the inception module.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We could have a struct that stores the weights of along each path and implement the joining/splitting in the forward pass function. That would mean a new struct for each different block, e.g. one would have a TransformerBlock struct for a transformer block, and a ResNetBlock struct for a ResNet block, each block being composed by smaller sub-blocks. This is often the simplest and cleanest way to implement complex models.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"This guide instead will show you how to construct a high-level layer (like Chain) that is made of multiple sub-layers for each path.","category":"page"},{"location":"guide/models/custom_layers/#Multiple-inputs:-a-custom-Join-layer","page":"Custom Layers","title":"Multiple inputs: a custom Join layer","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Join layer will accept multiple inputs at once, pass each input through a separate path, then combine the results together. Note that this layer can already be constructed using Parallel, but we will first walk through how do this manually.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by defining a new struct, Join, that stores the different paths and a combine operation as its fields.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom join layer\nstruct Join{T, F}\n combine::F\n paths::T\nend\n\n# allow Join(op, m1, m2, ...) as a constructor\nJoin(combine, paths...) = Join(combine, paths)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Notice again that we parameterized the type of the combine and paths fields. In addition to the performance considerations of concrete types, this allows either field to be Vectors, Tuples, or one of each - we don't need to pay attention to which.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"The next step is to use Flux.@layer to make our struct behave like a Flux layer. This is important so that calling Flux.setup on a Join maps over the underlying trainable arrays on each path.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux.@layer Join","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Finally, we define the forward pass. For Join, this means applying each path in paths to each input array, then using combine to merge the results.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"(m::Join)(xs::Tuple) = m.combine(map((f, x) -> f(x), m.paths, xs)...)\n(m::Join)(xs...) = m(xs)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Lastly, we can test our new layer. Thanks to the proper abstractions in Julia, our layer works on GPU arrays out of the box!","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)), # branch 1\n Dense(1 => 2), # branch 2\n Dense(1 => 1) # branch 3\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Join layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"guide/models/custom_layers/#Using-Parallel","page":"Custom Layers","title":"Using Parallel","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Flux already provides Parallel that can offer the same functionality. In this case, Join is going to just be syntactic sugar for Parallel.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Join(combine, paths) = Parallel(combine, paths)\nJoin(combine, paths...) = Join(combine, paths)\n\n# use vararg/tuple version of Parallel forward pass\nmodel = Chain(\n Join(vcat,\n Chain(Dense(1 => 5, relu), Dense(5 => 1)),\n Dense(1 => 2),\n Dense(1 => 1)\n ),\n Dense(4 => 1)\n ) |> gpu\n\nxs = map(gpu, (rand(1), rand(1), rand(1)))\n\nmodel(xs)\n# returns a single float vector with one value","category":"page"},{"location":"guide/models/custom_layers/#Multiple-outputs:-a-custom-Split-layer","page":"Custom Layers","title":"Multiple outputs: a custom Split layer","text":"","category":"section"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Our custom Split layer will accept a single input, then pass the input through a separate path to produce multiple outputs.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"We start by following the same steps as the Join layer: define a struct, use Flux.@layer, and define the forward pass.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Flux\nusing CUDA\n\n# custom split layer\nstruct Split{T}\n paths::T\nend\n\nSplit(paths...) = Split(paths)\n\nFlux.@layer Split\n\n(m::Split)(x::AbstractArray) = map(f -> f(x), m.paths)","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"Now we can test to see that our Split does indeed produce multiple outputs.","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"model = Chain(\n Dense(10 => 5),\n Split(Dense(5 => 1, tanh), Dense(5 => 3, tanh), Dense(5 => 2))\n ) |> gpu\n\nmodel(gpu(rand(10)))\n# returns a tuple with three float vectors","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"A custom loss function for the multiple outputs may look like this:","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"using Statistics\n\n# assuming model returns the output of a Split\n# x is a single input\n# ys is a tuple of outputs\nfunction loss(x, ys, model)\n # rms over all the mse\n ŷs = model(x)\n return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs)))\nend","category":"page"},{"location":"guide/models/custom_layers/","page":"Custom Layers","title":"Custom Layers","text":"note: Note\nThis Split layer is available from the Fluxperimental.jl package.","category":"page"},{"location":"reference/data/mlutils/#Working-with-Data,-using-MLUtils.jl","page":"Batching Data – MLUtils.jl","title":"Working with Data, using MLUtils.jl","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.","category":"page"},{"location":"reference/data/mlutils/#DataLoader","page":"Batching Data – MLUtils.jl","title":"DataLoader","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The DataLoader can be used to create mini-batches of data, in the format train! expects.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.DataLoader","category":"page"},{"location":"reference/data/mlutils/#MLUtils.DataLoader","page":"Batching Data – MLUtils.jl","title":"MLUtils.DataLoader","text":"DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])\n\nAn object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).\n\nTakes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data object that implements the numobs and getobs methods.\n\nThe last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.\n\nThe original data is preserved in the data field of the DataLoader.\n\nArguments\n\ndata: The data to be iterated over. The data type has to be supported by numobs and getobs.\nbatchsize: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize observations. Default 1.\nbuffer: If buffer=true and supported by the type of data, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer. Default false.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\nparallel: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads() to see the number of available threads. Passing parallel = true breaks ordering guarantees. Default false.\npartial: This argument is used only when batchsize > 0. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true.\nrng: A random number generator. Default Random.GLOBAL_RNG.\nshuffle: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data), shuffle=true ensures that the observations are shuffled anew every time you start iterating over eachobs. Default false.\n\nExamples\n\njulia> Xtrain = rand(10, 100);\n\njulia> array_loader = DataLoader(Xtrain, batchsize=2);\n\njulia> for x in array_loader\n @assert size(x) == (10, 2)\n # do something with x, 50 times\n end\n\njulia> array_loader.data === Xtrain\ntrue\n\njulia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples\n\njulia> for x in tuple_loader\n @assert x isa Tuple{Matrix}\n @assert size(x[1]) == (10, 2)\n end\n\njulia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples\n\njulia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);\n\njulia> for epoch in 1:100\n for (x, y) in train_loader # access via tuple destructuring\n @assert size(x) == (10, 5)\n @assert size(y) == (5,)\n # loss += f(x, y) # etc, runs 100 * 20 times\n end\n end\n\njulia> first(train_loader).label isa Vector{Char} # access via property name\ntrue\n\njulia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true\nfalse\n\njulia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last\n10×30 Matrix{Int8}\n10×30 Matrix{Int8}\n10×4 Matrix{Int8}\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#Utility-Functions","page":"Batching Data – MLUtils.jl","title":"Utility Functions","text":"","category":"section"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.","category":"page"},{"location":"reference/data/mlutils/","page":"Batching Data – MLUtils.jl","title":"Batching Data – MLUtils.jl","text":"MLUtils.batch\nMLUtils.batchsize\nMLUtils.batchseq\nMLUtils.BatchView\nMLUtils.chunk\nMLUtils.eachobs\nMLUtils.fill_like\nMLUtils.filterobs\nMLUtils.flatten\nMLUtils.getobs\nMLUtils.getobs!\nMLUtils.joinobs\nMLUtils.group_counts\nMLUtils.group_indices\nMLUtils.groupobs\nMLUtils.kfolds\nMLUtils.leavepout\nMLUtils.mapobs\nMLUtils.numobs\nMLUtils.normalise\nMLUtils.obsview\nMLUtils.ObsView\nMLUtils.ones_like\nMLUtils.oversample\nMLUtils.randobs\nMLUtils.rand_like\nMLUtils.randn_like\nMLUtils.rpad_constant\nMLUtils.shuffleobs\nMLUtils.splitobs\nMLUtils.unbatch\nMLUtils.undersample\nMLUtils.unsqueeze\nMLUtils.unstack\nMLUtils.zeros_like","category":"page"},{"location":"reference/data/mlutils/#MLUtils.batch","page":"Batching Data – MLUtils.jl","title":"MLUtils.batch","text":"batch(xs)\n\nBatch the arrays in xs into a single array with an extra dimension.\n\nIf the elements of xs are tuples, named tuples, or dicts, the output will be of the same type. \n\nSee also unbatch.\n\nExamples\n\njulia> batch([[1,2,3], \n [4,5,6]])\n3×2 Matrix{Int64}:\n 1 4\n 2 5\n 3 6\n\njulia> batch([(a=[1,2], b=[3,4])\n (a=[5,6], b=[7,8])]) \n(a = [1 5; 2 6], b = [3 7; 4 8])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchsize","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchsize","text":"batchsize(data::BatchView) -> Int\n\nReturn the fixed size of each batch in data.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert batchsize(A) == 30\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.batchseq","page":"Batching Data – MLUtils.jl","title":"MLUtils.batchseq","text":"batchseq(seqs, val = 0)\n\nTake a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by val.\n\nExamples\n\njulia> batchseq([[1, 2, 3], [4, 5]], 0)\n3-element Vector{Vector{Int64}}:\n [1, 4]\n [2, 5]\n [3, 0]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.BatchView","page":"Batching Data – MLUtils.jl","title":"MLUtils.BatchView","text":"BatchView(data, batchsize; partial=true, collate=nothing)\nBatchView(data; batchsize=1, partial=true, collate=nothing)\n\nCreate a view of the given data that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter batchsize. In the case that the size of the dataset is not dividable by the specified batchsize, the remaining observations will be ignored if partial=false. If partial=true instead the last batch-size can be slightly smaller.\n\nNote that any data access is delayed until getindex is called.\n\nIf used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch.\n\nFor BatchView to work on some data structure, the type of the given variable data must implement the data container interface. See ObsView for more info.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nbatchsize : The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).\npartial : If partial=false and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.\ncollate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.\n\nExamples\n\nusing MLUtils\nX, Y = MLUtils.load_iris()\n\nA = BatchView(X, batchsize=30)\n@assert typeof(A) <: BatchView <: AbstractVector\n@assert eltype(A) <: SubArray{Float64,2}\n@assert length(A) == 5 # Iris has 150 observations\n@assert size(A[1]) == (4,30) # Iris has 4 features\n\n# 5 batches of size 30 observations\nfor x in BatchView(X, batchsize=30)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert numobs(x) === 30\nend\n\n# 7 batches of size 20 observations\n# Note that the iris dataset has 150 observations,\n# which means that with a batchsize of 20, the last\n# 10 observations will be ignored\nfor (x, y) in BatchView((X, Y), batchsize=20, partial=false)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\n @assert numobs(x) == numobs(y) == 20\nend\n\n# collate tuple observations\nfor (x, y) in BatchView((rand(10, 3), [\"a\", \"b\", \"c\"]), batchsize=2, collate=true, partial=false)\n @assert size(x) == (10, 2)\n @assert size(y) == (2,)\nend\n\n\n# randomly assign observations to one and only one batch.\nfor (x, y) in BatchView(shuffleobs((X, Y)), batchsize=20)\n @assert typeof(x) <: SubArray{Float64,2}\n @assert typeof(y) <: SubArray{String,1}\nend\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.chunk","page":"Batching Data – MLUtils.jl","title":"MLUtils.chunk","text":"chunk(x, n; [dims])\nchunk(x; [size, dims])\n\nSplit x into n parts or alternatively, if size is an integer, into equal chunks of size size. The parts contain the same number of elements except possibly for the last one that can be smaller.\n\nIn case size is a collection of integers instead, the elements of x are split into chunks of the given sizes.\n\nIf x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).\n\nExamples\n\njulia> chunk(1:10, 3)\n3-element Vector{UnitRange{Int64}}:\n 1:4\n 5:8\n 9:10\n\njulia> chunk(1:10; size = 2)\n5-element Vector{UnitRange{Int64}}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\njulia> x = reshape(collect(1:20), (5, 4))\n5×4 Matrix{Int64}:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n 4 9 14 19\n 5 10 15 20\n\njulia> xs = chunk(x, 2, dims=1)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:\n [1 6 11 16; 2 7 12 17; 3 8 13 18]\n [4 9 14 19; 5 10 15 20]\n\njulia> xs[1]\n3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:\n 1 6 11 16\n 2 7 12 17\n 3 8 13 18\n\njulia> xes = chunk(x; size = 2, dims = 2)\n2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1 6; 2 7; … ; 4 9; 5 10]\n [11 16; 12 17; … ; 14 19; 15 20]\n\njulia> xes[2]\n5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:\n 11 16\n 12 17\n 13 18\n 14 19\n 15 20\n\njulia> chunk(1:6; size = [2, 4])\n2-element Vector{UnitRange{Int64}}:\n 1:2\n 3:6\n\n\n\n\n\nchunk(x, partition_idxs; [npartitions, dims])\n\nPartition the array x along the dimension dims according to the indexes in partition_idxs.\n\npartition_idxs must be sorted and contain only positive integers between 1 and the number of partitions. \n\nIf the number of partition npartitions is not provided, it is inferred from partition_idxs.\n\nIf dims is not provided, it defaults to the last dimension.\n\nSee also unbatch.\n\nExamples\n\njulia> x = reshape([1:10;], 2, 5)\n2×5 Matrix{Int64}:\n 1 3 5 7 9\n 2 4 6 8 10\n\njulia> chunk(x, [1, 2, 2, 3, 3])\n3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:\n [1; 2;;]\n [3 5; 4 6]\n [7 9; 8 10]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.eachobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.eachobs","text":"eachobs(data; kws...)\n\nReturn an iterator over data.\n\nSupports the same arguments as DataLoader. The batchsize default is -1 here while it is 1 for DataLoader.\n\nExamples\n\nX = rand(4,100)\n\nfor x in eachobs(X)\n # loop entered 100 times\n @assert typeof(x) <: Vector{Float64}\n @assert size(x) == (4,)\nend\n\n# mini-batch iterations\nfor x in eachobs(X, batchsize=10)\n # loop entered 10 times\n @assert typeof(x) <: Matrix{Float64}\n @assert size(x) == (4,10)\nend\n\n# support for tuples, named tuples, dicts\nfor (x, y) in eachobs((X, Y))\n # ...\nend\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.fill_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.fill_like","text":"fill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to val. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and ones_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.16087806\n 0.89916044\n\njulia> fill_like(x, 1.7, (3, 3))\n3×3 Matrix{Float32}:\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n 1.7 1.7 1.7\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.803167 0.476101\n 0.303041 0.317581\n\njulia> fill_like(x, 1.7, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.7 1.7\n 1.7 1.7\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.filterobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.filterobs","text":"filterobs(f, data)\n\nReturn a subset of data container data including all indices i for which f(getobs(data, i)) === true.\n\ndata = 1:10\nnumobs(data) == 10\nfdata = filterobs(>(5), data)\nnumobs(fdata) == 5\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.flatten","page":"Batching Data – MLUtils.jl","title":"MLUtils.flatten","text":"flatten(x::AbstractArray)\n\nReshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.\n\nSee also unsqueeze.\n\nExamples\n\njulia> rand(3,4,5) |> flatten |> size\n(12, 5)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs","text":"getobs(data, [idx])\n\nReturn the observations corresponding to the observation index idx. Note that idx can be any type as long as data has defined getobs for that type. If idx is not provided, then materialize all observations in data.\n\nIf data does not have getobs defined, then in the case of Tables.table(data) == true returns the row(s) in position idx, otherwise returns data[idx].\n\nAuthors of custom data containers should implement Base.getindex for their type instead of getobs. getobs should only be implemented for types where there is a difference between getobs and Base.getindex (such as multi-dimensional arrays).\n\nThe returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this \"actual data\" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx is a scalar vs vector.\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs! and numobs.\n\nExamples\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\n\ngetobs(x, 2) == (a = 2, b = x.b[:, 2])\ngetobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])\n\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\n\ngetobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])\ngetobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.getobs!","page":"Batching Data – MLUtils.jl","title":"MLUtils.getobs!","text":"getobs!(buffer, data, idx)\n\nInplace version of getobs(data, idx). If this method is defined for the type of data, then buffer should be used to store the result, instead of allocating a dedicated object.\n\nImplementing this function is optional. In the case no such method is provided for the type of data, then buffer will be ignored and the result of getobs returned. This could be because the type of data may not lend itself to the concept of copy!. Thus, supporting a custom getobs! is optional and not required.\n\nSee also getobs and numobs. \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.joinobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.joinobs","text":"joinobs(datas...)\n\nConcatenate data containers datas.\n\ndata1, data2 = 1:10, 11:20\njdata = joinumobs(data1, data2)\ngetobs(jdata, 15) == 15\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_counts","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_counts","text":"group_counts(x)\n\nCount the number of times that each element of x appears.\n\nSee also group_indices\n\nExamples\n\njulia> group_counts(['a', 'b', 'b'])\nDict{Char, Int64} with 2 entries:\n 'a' => 1\n 'b' => 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.group_indices","page":"Batching Data – MLUtils.jl","title":"MLUtils.group_indices","text":"group_indices(x) -> Dict\n\nComputes the indices of elements in the vector x for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.\n\nSee also group_counts.\n\nExamples\n\njulia> x = [:yes, :no, :maybe, :yes];\n\njulia> group_indices(x)\nDict{Symbol, Vector{Int64}} with 3 entries:\n :yes => [1, 4]\n :maybe => [3]\n :no => [2]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.groupobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.groupobs","text":"groupobs(f, data)\n\nSplit data container data data into different data containers, grouping observations by f(obs).\n\ndata = -10:10\ndatas = groupobs(>(0), data)\nlength(datas) == 2\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.kfolds","page":"Batching Data – MLUtils.jl","title":"MLUtils.kfolds","text":"kfolds(n::Integer, k = 5) -> Tuple\n\nCompute the train/validation assignments for k repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5 or k = 10. The following code snippet generates the indices assignments for k = 5\n\njulia> train_idx, val_idx = kfolds(10, 5);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nkfolds(data, [k = 5])\n\nRepartition a data container k times using a k folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs is invoked.\n\nConceptually, a k-folds repartitioning strategy divides the given data into k roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k different partitions of data.\n\nIn the case that the size of the dataset is not dividable by the specified k, the remaining observations will be evenly distributed among the parts.\n\nfor (x_train, x_val) in kfolds(X, k=10)\n # code called 10 times\n # nobs(x_val) may differ up to ±1 over iterations\nend\n\nMultiple variables are supported (e.g. for labeled data)\n\nfor ((x_train, y_train), val) in kfolds((X, Y), k=10)\n # ...\nend\n\nBy default the folds are created using static splits. Use shuffleobs to randomly assign observations to the folds.\n\nfor (x_train, x_val) in kfolds(shuffleobs(X), k = 10)\n # ...\nend\n\nSee leavepout for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.leavepout","page":"Batching Data – MLUtils.jl","title":"MLUtils.leavepout","text":"leavepout(n::Integer, [size = 1]) -> Tuple\n\nCompute the train/validation assignments for k ≈ n/size repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size or size+1 observations assigned to it. The following code snippet generates the index-vectors for size = 2.\n\njulia> train_idx, val_idx = leavepout(10, 2);\n\nEach observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.\n\njulia> train_idx\n5-element Array{Array{Int64,1},1}:\n [3,4,5,6,7,8,9,10]\n [1,2,5,6,7,8,9,10]\n [1,2,3,4,7,8,9,10]\n [1,2,3,4,5,6,9,10]\n [1,2,3,4,5,6,7,8]\n\njulia> val_idx\n5-element Array{UnitRange{Int64},1}:\n 1:2\n 3:4\n 5:6\n 7:8\n 9:10\n\n\n\n\n\nleavepout(data, p = 1)\n\nRepartition a data container using a k-fold strategy, where k is chosen in such a way, that each validation subset of the resulting folds contains roughly p observations. Defaults to p = 1, which is also known as \"leave-one-out\" partitioning.\n\nThe resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until getobs is invoked.\n\nfor (train, val) in leavepout(X, p=2)\n # if nobs(X) is dividable by 2,\n # then numobs(val) will be 2 for each iteraton,\n # otherwise it may be 3 for the first few iterations.\nend\n\nSeekfolds for a related function.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.mapobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.mapobs","text":"mapobs(f, data; batched=:auto)\n\nLazily map f over the observations in a data container data. Returns a new data container mdata that can be indexed and has a length. Indexing triggers the transformation f.\n\nThe batched keyword argument controls the behavior of mdata[idx] and mdata[idxs] where idx is an integer and idxs is a vector of integers:\n\nbatched=:auto (default). Let f handle the two cases. Calls f(getobs(data, idx)) and f(getobs(data, idxs)).\nbatched=:never. The function f is always called on a single observation. Calls f(getobs(data, idx)) and [f(getobs(data, idx)) for idx in idxs].\nbatched=:always. The function f is always called on a batch of observations. Calls getobs(f(getobs(data, [idx])), 1) and f(getobs(data, idxs)).\n\nExamples\n\njulia> data = (a=[1,2,3], b=[1,2,3]);\n\njulia> mdata = mapobs(data) do x\n (c = x.a .+ x.b, d = x.a .- x.b)\n end\nmapobs(#25, (a = [1, 2, 3], b = [1, 2, 3]); batched=:auto))\n\njulia> mdata[1]\n(c = 2, d = 0)\n\njulia> mdata[1:2]\n(c = [2, 4], d = [0, 0])\n\n\n\n\n\nmapobs(fs, data)\n\nLazily map each function in tuple fs over the observations in data container data. Returns a tuple of transformed data containers.\n\n\n\n\n\nmapobs(namedfs::NamedTuple, data)\n\nMap a NamedTuple of functions over data, turning it into a data container of NamedTuples. Field syntax can be used to select a column of the resulting data container.\n\ndata = 1:10\nnameddata = mapobs((x = sqrt, y = log), data)\ngetobs(nameddata, 10) == (x = sqrt(10), y = log(10))\ngetobs(nameddata.x, 10) == sqrt(10)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.numobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.numobs","text":"numobs(data)\n\nReturn the total number of observations contained in data.\n\nIf data does not have numobs defined, then in the case of Tables.table(data) == true returns the number of rows, otherwise returns length(data).\n\nAuthors of custom data containers should implement Base.length for their type instead of numobs. numobs should only be implemented for types where there is a difference between numobs and Base.length (such as multi-dimensional arrays).\n\ngetobs supports by default nested combinations of array, tuple, named tuples, and dictionaries. \n\nSee also getobs.\n\nExamples\n\n\n# named tuples \nx = (a = [1, 2, 3], b = rand(6, 3))\nnumobs(x) == 3\n\n# dictionaries\nx = Dict(:a => [1, 2, 3], :b => rand(6, 3))\nnumobs(x) == 3\n\nAll internal containers must have the same number of observations:\n\njulia> x = (a = [1, 2, 3, 4], b = rand(6, 3));\n\njulia> numobs(x)\nERROR: DimensionMismatch: All data containers must have the same number of observations.\nStacktrace:\n [1] _check_numobs_error()\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:163\n [2] _check_numobs\n @ ~/.julia/dev/MLUtils/src/observation.jl:130 [inlined]\n [3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})\n @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177\n [4] top-level scope\n @ REPL[35]:1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.normalise","page":"Batching Data – MLUtils.jl","title":"MLUtils.normalise","text":"normalise(x; dims=ndims(x), ϵ=1e-5)\n\nNormalise the array x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. \n\nϵ is a small additive factor added to the denominator for numerical stability.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.obsview","page":"Batching Data – MLUtils.jl","title":"MLUtils.obsview","text":"obsview(data, [indices])\n\nReturns a lazy view of the observations in data that correspond to the given indices. No data will be copied except of the indices. It is similar to constructing an ObsView, but returns a SubArray if the type of data is Array or SubArray. Furthermore, this function may be extended for custom types of data that also want to provide their own subset-type.\n\nIn case data is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of ObsView instead of a ObsView of tuples.\n\nIf instead you want to get the subset of observations corresponding to the given indices in their native type, use getobs.\n\nSee ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.ObsView","page":"Batching Data – MLUtils.jl","title":"MLUtils.ObsView","text":"ObsView(data, [indices])\n\nUsed to represent a subset of some data of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.\n\nThe main purpose for the existence of ObsView is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.\n\nAny data access is delayed until getindex is called, and even getindex returns the result of obsview which in general avoids data movement until getobs is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.\n\nArguments\n\ndata : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).\nindices : Optional. The index or indices of the observation(s) in data that the subset should represent. Can be of type Int or some subtype of AbstractVector.\n\nMethods\n\ngetindex : Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.\nnumobs : Returns the total number observations in the subset.\ngetobs : Returns the underlying data that the ObsView represents at the given relative indices. Note that these indices are in \"subset space\", and in general will not directly correspond to the same indices in the underlying data set.\n\nDetails\n\nFor ObsView to work on some data structure, the desired type MyType must implement the following interface:\n\ngetobs(data::MyType, idx) : Should return the observation(s) indexed by idx. In what form is up to the user. Note that idx can be of type Int or AbstractVector.\nnumobs(data::MyType) : Should return the total number of observations in data\n\nThe following methods can also be provided and are optional:\n\ngetobs(data::MyType) : By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.\nobsview(data::MyType, idx) : If your custom type has its own kind of subset type, you can return it here. An example for such a case are SubArray for representing a subset of some AbstractArray.\ngetobs!(buffer, data::MyType, [idx]) : Inplace version of getobs(data, idx). If this method is provided for MyType, then eachobs can preallocate a buffer that is then reused every iteration. Note: buffer should be equivalent to the return value of getobs(::MyType, ...), since this is how buffer is preallocated by default.\n\nExamples\n\nX, Y = MLUtils.load_iris()\n\n# The iris set has 150 observations and 4 features\n@assert size(X) == (4,150)\n\n# Represents the 80 observations as a ObsView\nv = ObsView(X, 21:100)\n@assert numobs(v) == 80\n@assert typeof(v) <: ObsView\n# getobs indexes into v\n@assert getobs(v, 1:10) == X[:, 21:30]\n\n# Use `obsview` to avoid boxing into ObsView\n# for types that provide a custom \"subset\", such as arrays.\n# Here it instead creates a native SubArray.\nv = obsview(X, 1:100)\n@assert numobs(v) == 100\n@assert typeof(v) <: SubArray\n\n# Also works for tuples of arbitrary length\nsubset = obsview((X, Y), 1:100)\n@assert numobs(subset) == 100\n@assert typeof(subset) <: Tuple # tuple of SubArray\n\n# Use as iterator\nfor x in ObsView(X)\n @assert typeof(x) <: SubArray{Float64,1}\nend\n\n# iterate over each individual labeled observation\nfor (x, y) in ObsView((X, Y))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# same but in random order\nfor (x, y) in ObsView(shuffleobs((X, Y)))\n @assert typeof(x) <: SubArray{Float64,1}\n @assert typeof(y) <: String\nend\n\n# Indexing: take first 10 observations\nx, y = ObsView((X, Y))[1:10]\n\nSee also\n\nobsview, getobs, numobs, splitobs, shuffleobs, kfolds.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/mlutils/#MLUtils.ones_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.ones_like","text":"ones_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also zeros_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.8621633\n 0.5158395\n\njulia> ones_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.82297 0.656143\n 0.701828 0.391335\n\njulia> ones_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.oversample","page":"Batching Data – MLUtils.jl","title":"MLUtils.oversample","text":"oversample(data, classes; fraction=1, shuffle=true)\noversample(data::Tuple; fraction=1, shuffle=true)\n\nGenerate a re-balanced version of data by repeatedly sampling existing observations in such a way that every class will have at least fraction times the number observations of the largest class in classes. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) data.\n\nAs an example, by default (i.e. with fraction = 1) the resulting dataset will be near perfectly balanced. On the other hand, with fraction = 0.5 every class in the resulting data with have at least 50% as many observations as the largest class.\n\nThe classes input is an array with the same length as numobs(data). \n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to true.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# oversample the class \"a\" to match \"b\"\nX_bal, Y_bal = oversample(X, Y)\n\n# this results in a bigger dataset with repeated data\n@assert size(X_bal) == (3,8)\n@assert length(Y_bal) == 8\n\n# now both \"a\", and \"b\" have 4 observations each\n@assert sum(Y_bal .== \"a\") == 4\n@assert sum(Y_bal .== \"b\") == 4\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple and classes is not given, then it will be assumed that the last element of the tuple contains the classes.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(oversample(data, data.Y))\n8×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.376304 0.100022 a\n 2 │ 0.467095 0.185437 b\n 3 │ 0.481957 0.319906 b\n 4 │ 0.336762 0.390811 b\n 5 │ 0.376304 0.100022 a\n 6 │ 0.427064 0.0648339 a\n 7 │ 0.427064 0.0648339 a\n 8 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also undersample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.randobs","text":"randobs(data, [n])\n\nPick a random observation or a batch of n random observations from data. For this function to work, the type of data must implement numobs and getobs.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rand_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.rand_like","text":"rand_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.rand and randn_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> rand_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.780032 0.920552 0.53689\n 0.121451 0.741334 0.5449\n 0.55348 0.138136 0.556404\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> rand_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.429274 0.135379\n 0.718895 0.0098756\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.randn_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.randn_like","text":"randn_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value drawn from a normal distribution. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nThe default random number generator is used, unless a custom one is passed in explicitly as the first argument.\n\nSee also Base.randn and rand_like.\n\nExamples\n\njulia> x = ones(Float32, 2)\n2-element Vector{Float32}:\n 1.0\n 1.0\n\njulia> randn_like(x, (3, 3))\n3×3 Matrix{Float32}:\n -0.385331 0.956231 0.0745102\n 1.43756 -0.967328 2.06311\n 0.0482372 1.78728 -0.902547\n\njulia> using CUDA\n\njulia> CUDA.ones(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0\n 1.0 1.0\n\njulia> randn_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n -0.578527 0.823445\n -1.01338 -0.612053\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.rpad_constant","page":"Batching Data – MLUtils.jl","title":"MLUtils.rpad_constant","text":"rpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)\n\nReturn the given sequence padded with val along the dimensions dims up to a maximum length in each direction specified by n.\n\nExamples\n\njulia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4\n4-element Vector{Int64}:\n 1\n 2\n -1\n -1\n\njulia> rpad_constant([1, 2, 3], 2) # no padding if length is already greater than n\n3-element Vector{Int64}:\n 1\n 2\n 3\n\njulia> rpad_constant([1 2; 3 4], 4; dims=1) # padding along the first dimension\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\njulia> rpad_constant([1 2; 3 4], 4) # padding along all dimensions by default\n4×2 Matrix{Int64}:\n 1 2\n 3 4\n 0 0\n 0 0 \n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.shuffleobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.shuffleobs","text":"shuffleobs([rng], data)\n\nReturn a \"subset\" of data that spans all observations, but has the order of the observations shuffled.\n\nThe values of data itself are not copied. Instead only the indices are shuffled. This function calls obsview to accomplish that, which means that the return value is likely of a different type than data.\n\n# For Arrays the subset will be of type SubArray\n@assert typeof(shuffleobs(rand(4,10))) <: SubArray\n\n# Iterate through all observations in random order\nfor x in eachobs(shuffleobs(X))\n ...\nend\n\nThe optional parameter rng allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See Random in Julia's standard library for more info.\n\nFor this function to work, the type of data must implement numobs and getobs. See ObsView for more information.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.splitobs","page":"Batching Data – MLUtils.jl","title":"MLUtils.splitobs","text":"splitobs(n::Int; at) -> Tuple\n\nCompute the indices for two or more disjoint subsets of the range 1:n with splits given by at.\n\nExamples\n\njulia> splitobs(100, at=0.7)\n(1:70, 71:100)\n\njulia> splitobs(100, at=(0.1, 0.4))\n(1:10, 11:50, 51:100)\n\n\n\n\n\nsplitobs(data; at, shuffle=false) -> Tuple\n\nPartition the data into two or more subsets. When at is a number (between 0 and 1) this specifies the proportion in the first subset. When at is a tuple, each entry specifies the proportion an a subset, with the last having 1-sum(at). In all there are length(at)+1 subsets returned.\n\nIf shuffle=true, randomly permute the observations before splitting.\n\nSupports any datatype implementing the numobs and getobs interfaces – including arrays, tuples & NamedTuples of arrays.\n\nExamples\n\njulia> splitobs(permutedims(1:100); at=0.7) # simple 70%-30% split, of a matrix\n([1 2 … 69 70], [71 72 … 99 100])\n\njulia> data = (x=ones(2,10), n=1:10) # a NamedTuple, consistent last dimension\n(x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:10)\n\njulia> splitobs(data, at=(0.5, 0.3)) # a 50%-30%-20% split, e.g. train/test/validation\n((x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:5), (x = [1.0 1.0 1.0; 1.0 1.0 1.0], n = 6:8), (x = [1.0 1.0; 1.0 1.0], n = 9:10))\n\njulia> train, test = splitobs((permutedims(1.0:100.0), 101:200), at=0.7, shuffle=true); # split a Tuple\n\njulia> vec(test[1]) .+ 100 == test[2]\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unbatch","page":"Batching Data – MLUtils.jl","title":"MLUtils.unbatch","text":"unbatch(x)\n\nReverse of the batch operation, unstacking the last dimension of the array x.\n\nSee also unstack and chunk.\n\nExamples\n\njulia> unbatch([1 3 5 7;\n 2 4 6 8])\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.undersample","page":"Batching Data – MLUtils.jl","title":"MLUtils.undersample","text":"undersample(data, classes; shuffle=true)\n\nGenerate a class-balanced version of data by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data.\n\nThe convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false.\n\nThe output will contain both the resampled data and classes.\n\n# 6 observations with 3 features each\nX = rand(3, 6)\n# 2 classes, severely imbalanced\nY = [\"a\", \"b\", \"b\", \"b\", \"b\", \"a\"]\n\n# subsample the class \"b\" to match \"a\"\nX_bal, Y_bal = undersample(X, Y)\n\n# this results in a smaller dataset\n@assert size(X_bal) == (3,4)\n@assert length(Y_bal) == 4\n\n# now both \"a\", and \"b\" have 2 observations each\n@assert sum(Y_bal .== \"a\") == 2\n@assert sum(Y_bal .== \"b\") == 2\n\nFor this function to work, the type of data must implement numobs and getobs. \n\nNote that if data is a tuple, then it will be assumed that the last element of the tuple contains the targets.\n\njulia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])\n6×3 DataFrames.DataFrame\n│ Row │ X1 │ X2 │ Y │\n├─────┼───────────┼─────────────┼───┤\n│ 1 │ 0.226582 │ 0.0443222 │ a │\n│ 2 │ 0.504629 │ 0.722906 │ b │\n│ 3 │ 0.933372 │ 0.812814 │ b │\n│ 4 │ 0.522172 │ 0.245457 │ b │\n│ 5 │ 0.505208 │ 0.11202 │ b │\n│ 6 │ 0.0997825 │ 0.000341996 │ a │\n\njulia> getobs(undersample(data, data.Y))\n4×3 DataFrame\n Row │ X1 X2 Y \n │ Float64 Float64 Symbol \n─────┼─────────────────────────────\n 1 │ 0.427064 0.0648339 a\n 2 │ 0.376304 0.100022 a\n 3 │ 0.467095 0.185437 b\n 4 │ 0.457043 0.490688 b\n\nSee ObsView for more information on data subsets. See also oversample.\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unsqueeze","page":"Batching Data – MLUtils.jl","title":"MLUtils.unsqueeze","text":"unsqueeze(x; dims)\n\nReturn x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended. dims can be an integer between 1 and ndims(x)+1.\n\nSee also flatten, stack.\n\nExamples\n\njulia> unsqueeze([1 2; 3 4], dims=2)\n2×1×2 Array{Int64, 3}:\n[:, :, 1] =\n 1\n 3\n\n[:, :, 2] =\n 2\n 4\n\n\njulia> xs = [[1, 2], [3, 4], [5, 6]]\n3-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n\njulia> unsqueeze(xs, dims=1)\n1×3 Matrix{Vector{Int64}}:\n [1, 2] [3, 4] [5, 6]\n\n\n\n\n\nunsqueeze(; dims)\n\nReturns a function which, acting on an array, inserts a dimension of size 1 at dims.\n\nExamples\n\njulia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size\n(21, 1, 22, 23)\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.unstack","page":"Batching Data – MLUtils.jl","title":"MLUtils.unstack","text":"unstack(xs; dims)\n\nUnroll the given xs into an array of arrays along the given dimension dims.\n\nSee also stack, unbatch, and chunk.\n\nExamples\n\njulia> unstack([1 3 5 7; 2 4 6 8], dims=2)\n4-element Vector{Vector{Int64}}:\n [1, 2]\n [3, 4]\n [5, 6]\n [7, 8]\n\n\n\n\n\n","category":"function"},{"location":"reference/data/mlutils/#MLUtils.zeros_like","page":"Batching Data – MLUtils.jl","title":"MLUtils.zeros_like","text":"zeros_like(x, [element_type=eltype(x)], [dims=size(x)]))\n\nCreate an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.\n\nSee also ones_like and fill_like.\n\nExamples\n\njulia> x = rand(Float32, 2)\n2-element Vector{Float32}:\n 0.4005432\n 0.36934233\n\njulia> zeros_like(x, (3, 3))\n3×3 Matrix{Float32}:\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n 0.0 0.0 0.0\n\njulia> using CUDA\n\njulia> x = CUDA.rand(2, 2)\n2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.0695155 0.667979\n 0.558468 0.59903\n\njulia> zeros_like(x, Float64)\n2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:\n 0.0 0.0\n 0.0 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#man-callback-helpers","page":"Callback Helpers","title":"Callback Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.throttle","category":"page"},{"location":"reference/training/callbacks/#Flux.throttle","page":"Callback Helpers","title":"Flux.throttle","text":"throttle(f, timeout; leading=true, trailing=false)\n\nReturn a function that when invoked, will only be triggered at most once during timeout seconds.\n\nNormally, the throttled function will run as much as it can, without ever going more than once per wait duration; but if you'd like to disable the execution on the leading edge, pass leading=false. To enable execution on the trailing edge, pass trailing=true.\n\nExamples\n\njulia> a = Flux.throttle(() -> println(\"Flux\"), 2);\n\njulia> for i = 1:4 # a called in alternate iterations\n a()\n sleep(1)\n end\nFlux\nFlux\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Patience-Helpers","page":"Callback Helpers","title":"Patience Helpers","text":"","category":"section"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-loss that decreases for 4 calls, then starts increasing\n# we call this like loss()\nloss = let t = 0\n () -> begin\n t += 1\n (t - 4) ^ 2\n end\nend\n\n# create an early stopping trigger\n# returns true when the loss increases for two consecutive steps\nes = early_stopping(loss, 2; init_score = 9)\n\n# this will stop at the 6th (4 decreasing + 2 increasing calls) epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"The keyword argument distance of early_stopping is a function of the form distance(best_score, score). By default distance is -, which implies that the monitored metric f is expected to be decreasing and minimized. If you use some increasing metric (e.g. accuracy), you can customize the distance function: (best_score, score) -> score - best_score.","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"# create a pseudo-accuracy that increases by 0.01 each time from 0 to 1\n# we call this like acc()\nacc = let v = 0\n () -> v = max(1, v + 0.01)\nend\n\n# create an early stopping trigger for accuracy\nes = early_stopping(acc, 3; delta = (best_score, score) -> score - best_score)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n es() && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"early_stopping and plateau are both built on top of patience. You can use patience to build your own triggers that use a patient counter. For example, if you want to trigger when the loss is below a threshold for several consecutive iterations:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"threshold(f, thresh, delay) = patience(delay) do\n f() < thresh\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Both predicate in patience and f in early_stopping / plateau can accept extra arguments. You can pass such extra arguments to predicate or f through the returned function:","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"trigger = patience((a; b) -> a > b, 3)\n\n# this will iterate until the 10th epoch\nfor epoch in 1:10\n trigger(1; b = 2) && break\nend\n\n# this will stop at the 3rd epoch\nfor epoch in 1:10\n trigger(3; b = 2) && break\nend","category":"page"},{"location":"reference/training/callbacks/","page":"Callback Helpers","title":"Callback Helpers","text":"Flux.patience\nFlux.early_stopping\nFlux.plateau","category":"page"},{"location":"reference/training/callbacks/#Flux.patience","page":"Callback Helpers","title":"Flux.patience","text":"patience(predicate, wait)\n\nReturn a function that internally counts by one when predicate(...) == true, otherwise the count is reset to zero. If the count is greater than or equal to wait, the function returns true, otherwise it returns false.\n\nExamples\n\njulia> loss() = rand();\n\njulia> trigger = Flux.patience(() -> loss() < 1, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.early_stopping","page":"Callback Helpers","title":"Flux.early_stopping","text":"early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)\n\nReturn a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.\n\nExamples\n\njulia> loss = let l = 0\n () -> l += 1\n end; # pseudo loss function that returns increasing values\n\njulia> es = Flux.early_stopping(loss, 3);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n es() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n\n\n\n\n\n","category":"function"},{"location":"reference/training/callbacks/#Flux.plateau","page":"Callback Helpers","title":"Flux.plateau","text":"plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)\n\nReturn a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.\n\nExamples\n\njulia> f = let v = 10\n () -> v = v / abs(v) - v\n end; # -9, 8, -7, 6, ...\n\njulia> trigger = Flux.plateau(f, 3; init_score=10, min_dist=18);\n\n\njulia> for i in 1:10\n @info \"Epoch $i\"\n trigger() && break\n end\n[ Info: Epoch 1\n[ Info: Epoch 2\n[ Info: Epoch 3\n[ Info: Epoch 4\n\n\n\n\n\n","category":"function"},{"location":"guide/training/training/#man-training","page":"Training","title":"Training a Flux Model","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Training refers to the process of slowly adjusting the parameters of a model to make it work better. Besides the model itself, we will need three things:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"An objective function that evaluates how well a model is doing on some input.\nAn optimisation rule which describes how the model's parameters should be adjusted.\nSome training data to use as the input during this process.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Usually the training data is some collection of examples (or batches of examples) which are handled one-by-one. One epoch of training means that each example is used once, something like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise the optimiser for this model:\nopt_state = Flux.setup(rule, model)\n\nfor data in train_set\n # Unpack this element (for supervised training):\n input, label = data\n\n # Calculate the gradient of the objective\n # with respect to the parameters within the model:\n grads = Flux.gradient(model) do m\n result = m(input)\n loss(result, label)\n end\n\n # Update the parameters so as to reduce the objective,\n # according the chosen optimisation rule:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This loop can also be written using the function train!, but it's helpful to understand the pieces first:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\nend","category":"page"},{"location":"guide/training/training/#Model-Gradients","page":"Training","title":"Model Gradients","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Fist recall from the section on taking gradients that Flux.gradient(f, a, b) always calls f(a, b), and returns a tuple (∂f_∂a, ∂f_∂b). In the code above, the function f passed to gradient is an anonymous function with one argument, created by the do block, hence grads is a tuple with one element. Instead of a do block, we could have written:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(m -> loss(m(input), label), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Since the model is some nested set of layers, grads[1] is a similarly nested set of NamedTuples, ultimately containing gradient components. If (for example) θ = model.layers[1].weight[2,3] is one scalar parameter, an entry in a matrix of weights, then the derivative of the loss with respect to it is ∂f_∂θ = grads[1].layers[1].weight[2,3].","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is important that the execution of the model takes place inside the call to gradient, in order for the influence of the model's parameters to be observed by Zygote.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"It is also important that every update! step receives a newly computed gradient, as it will change whenever the model's parameters are changed, and for each new data point.","category":"page"},{"location":"guide/training/training/#Loss-Functions","page":"Training","title":"Loss Functions","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The objective function must return a number representing how far the model is from the desired result. This is termed the loss of the model.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"This number can be produced by any ordinary Julia code, but this must be executed within the call to gradient. For instance, we could define a function","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"loss(y_hat, y) = sum((y_hat .- y).^2)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or write this directly inside the do block above. Many commonly used functions, like mse for mean-squared error or crossentropy for cross-entropy loss, are available from the Flux.Losses module.","category":"page"},{"location":"guide/training/training/#Optimisation-Rules","page":"Training","title":"Optimisation Rules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The simplest kind of optimisation using the gradient is termed gradient descent (or sometimes stochastic gradient descent when, as here, it is not applied to the entire dataset at once).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Gradient descent needs a learning rate which is a small number describing how fast to walk downhill, usually written as the Greek letter \"eta\", η. This is often described as a hyperparameter, to distinguish it from the parameters which are being updated θ = θ - η * ∂loss_∂θ. We want to update all the parameters in the model, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"η = 0.01 # learning rate\n\n# For each parameter array, update\n# according to the corresponding gradient:\nfmap(model, grads[1]) do p, g\n p .= p .- η .* g\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"A slightly more refined version of this loop to update all the parameters is wrapped up as a function update!(opt_state, model, grads[1]). And the learning rate is the only thing stored in the Descent struct.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, there are many other optimisation rules, which adjust the step size and direction in various clever ways. Most require some memory of the gradients from earlier steps, rather than always walking straight downhill – Momentum is the simplest. The function setup creates the necessary storage for this, for a particular model. It should be called once, before training, and returns a tree-like object which is the first argument of update!. Like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Initialise momentum \nopt_state = Flux.setup(Momentum(0.01, 0.9), model)\n\nfor data in train_set\n grads = [...]\n\n # Update both model parameters and optimiser state:\n Flux.update!(opt_state, model, grads[1])\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Many commonly-used optimisation rules, such as Adam, are built-in. These are listed on the optimisers page.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"compat: Implicit-style optimiser state\nThis setup makes another tree-like structure. Old versions of Flux did not do this, and instead stored a dictionary-like structure within the optimiser Adam(0.001). This was initialised on first use of the version of update! for \"implicit\" parameters.","category":"page"},{"location":"guide/training/training/#Datasets-and-Batches","page":"Training","title":"Datasets & Batches","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The loop above iterates through train_set, expecting at each step a tuple (input, label). The very simplest such object is a vector of tuples, such as this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"x = randn(28, 28)\ny = rand(10)\ndata = [(x, y)]","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"or data = [(x, y), (x, y), (x, y)] for the same values three times.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Very often, the initial data is large arrays which you need to slice into examples. To produce one iterator of pairs (x, y), you might want zip:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"X = rand(28, 28, 60_000); # many images, each 28 × 28\nY = rand(10, 60_000)\ndata = zip(eachslice(X; dims=3), eachcol(Y))\n\nfirst(data) isa Tuple{AbstractMatrix, AbstractVector} # true","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Here each iteration will use one matrix x (an image, perhaps) and one vector y. It is very common to instead train on batches of such inputs (or mini-batches, the two words mean the same thing) both for efficiency and for better results. This can be easily done using the DataLoader:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"data = Flux.DataLoader((X, Y), batchsize=32)\n\nx1, y1 = first(data)\nsize(x1) == (28, 28, 32)\nlength(data) == 1875 === 60_000 ÷ 32","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's layers are set up to accept such a batch of input data, and the convolutional layers such as Conv require it. The batch index is always the last dimension.","category":"page"},{"location":"guide/training/training/#Training-Loops","page":"Training","title":"Training Loops","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Simple training loops like the one above can be written compactly using the train! function. Including setup, this reads:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nfor epoch in 1:100\n Flux.train!(model, train_set, opt_state) do m, x, y\n loss(m(x), y)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Or explicitly writing the anonymous function which this do block creates, train!((m,x,y) -> loss(m(x),y), model, train_set, opt_state) is exactly equivalent.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Real training loops often need more flexibility, and the best way to do this is just to write the loop. This is ordinary Julia code, without any need to work through some callback API. Here is an example, in which it may be helpful to note:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The function withgradient is like gradient but also returns the value of the function, for logging or diagnostic use.\nLogging or printing is best done outside of the gradient call, as there is no need to differentiate these commands.\nTo use result for logging purposes, you could change the do block to end with return my_loss(result, label), result, i.e. make the function passed to withgradient return a tuple. The first element is always the loss.\nJulia's break and continue keywords let you exit from parts of the loop.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(), model)\n\nmy_log = []\nfor epoch in 1:100\n losses = Float32[]\n for (i, data) in enumerate(train_set)\n input, label = data\n\n val, grads = Flux.withgradient(model) do m\n # Any code inside here is differentiated.\n # Evaluation of the model and loss must be inside!\n result = m(input)\n my_loss(result, label)\n end\n\n # Save the loss from the forward pass. (Done outside of gradient.)\n push!(losses, val)\n\n # Detect loss of Inf or NaN. Print a warning, and then skip update!\n if !isfinite(val)\n @warn \"loss is $val on item $i\" epoch\n continue\n end\n\n Flux.update!(opt_state, model, grads[1])\n end\n\n # Compute some accuracy, and save details as a NamedTuple\n acc = my_accuracy(model, train_set)\n push!(my_log, (; acc, losses))\n\n # Stop training when some criterion is reached\n if acc > 0.95\n println(\"stopping after $epoch epochs\")\n break\n end\nend","category":"page"},{"location":"guide/training/training/#Regularisation","page":"Training","title":"Regularisation","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The term regularisation covers a wide variety of techniques aiming to improve the result of training. This is often done to avoid overfitting.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Some of these can be implemented by simply modifying the loss function. L₂ regularisation (sometimes called ridge regression) adds to the loss a penalty proportional to θ^2 for every scalar parameter. A very simple model could be implemented as follows:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"grads = Flux.gradient(densemodel) do m\n result = m(input)\n penalty = sum(abs2, m.weight)/2 + sum(abs2, m.bias)/2\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Accessing each individual parameter array by hand won't work well for large models. Instead, we can use Flux.trainables to collect all of them, and then apply a function to each one, and sum the result:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"pen_l2(x::AbstractArray) = sum(abs2, x)/2\n\ngrads = Flux.gradient(model) do m\n result = m(input)\n penalty = sum(pen_l2, Flux.trainables(m))\n my_loss(result, label) + 0.42f0 * penalty\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"However, the gradient of this penalty term is very simple: It is proportional to the original weights. So there is a simpler way to implement exactly the same thing, by modifying the optimiser instead of the loss function. This is done by replacing this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"with this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"decay_opt_state = Flux.setup(OptimiserChain(WeightDecay(0.42), Adam(0.1)), model)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux's optimisers are really modifications applied to the gradient before using it to update the parameters, and OptimiserChain applies two such modifications. The first, WeightDecay adds 0.42 times the original parameter to the gradient, matching the gradient of the penalty above (with the same, unrealistically large, constant). After that, in either case, Adam computes the final update.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same trick works for L₁ regularisation (also called Lasso), where the penalty is pen_l1(x::AbstractArray) = sum(abs, x) instead. This is implemented by SignDecay(0.42).","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"The same OptimiserChain mechanism can be used for other purposes, such as gradient clipping with ClipGrad or ClipNorm.","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Besides L1 / L2 / weight decay, another common and quite different kind of regularisation is provided by the Dropout layer. This turns off some outputs of the previous layer during training. It should switch automatically, but see trainmode! / testmode! to manually enable or disable this layer.","category":"page"},{"location":"guide/training/training/#Learning-Rate-Schedules","page":"Training","title":"Learning Rate Schedules","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Finer control of training, you may wish to alter the learning rate mid-way through training. This can be done with adjust!, like this:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"opt_state = Flux.setup(Adam(0.1), model) # initialise once\n\nfor epoch in 1:1000\n train!([...], state) # Train with η = 0.1 for first 100,\n if epoch == 100 # then change to use η = 0.01 for the rest.\n Flux.adjust!(opt_state, 0.01)\n end\nend","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Other hyper-parameters can also be adjusted, such as Flux.adjust!(opt_state, beta = (0.8, 0.99)). And such modifications can be applied to just one part of the model. For instance, this sets a different learning rate for the encoder and the decoder:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"# Consider some model with two parts:\nbimodel = Chain(enc = [...], dec = [...])\n\n# This returns a tree whose structure matches the model:\nopt_state = Flux.setup(Adam(0.02), bimodel)\n\n# Adjust the learning rate to be used for bimodel.layers.enc\nFlux.adjust!(opt_state.layers.enc, 0.03)","category":"page"},{"location":"guide/training/training/#Freezing-layer-parameters","page":"Training","title":"Freezing layer parameters","text":"","category":"section"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"To completely disable training of some part of the model, use freeze!. This is a temporary modification, reversed by thaw!:","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"Flux.freeze!(opt_state.layers.enc)\n\n# Now training won't update parameters in bimodel.layers.enc\ntrain!(loss, bimodel, data, opt_state)\n\n# Un-freeze the entire model:\nFlux.thaw!(opt_state)","category":"page"},{"location":"guide/training/training/","page":"Training","title":"Training","text":"While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).","category":"page"},{"location":"reference/models/activation/#man-activations","page":"Activation Functions","title":"Activation Functions from NNlib.jl","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"These non-linearities used between layers of your model are exported by the NNlib package.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on. Alternatively, they can be passed to a layer like Dense(784 => 1024, relu) which will handle this broadcasting.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Functions like softmax are sometimes described as activation functions, but not by Flux. They must see all the outputs, and hence cannot be broadcasted. See the next page for details.","category":"page"},{"location":"reference/models/activation/#Alphabetical-Listing","page":"Activation Functions","title":"Alphabetical Listing","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"celu\nelu\ngelu\nhardsigmoid\nhardswish\nhardtanh\nleakyrelu\nlisht\nlogcosh\nlogsigmoid\nmish\nrelu\nrelu6\nrrelu\nselu\nsigmoid\nsigmoid_fast\nsoftplus\nsoftshrink\nsoftsign\nswish\ntanhshrink\ntanh_fast\ntrelu","category":"page"},{"location":"reference/models/activation/#NNlib.celu","page":"Activation Functions","title":"NNlib.celu","text":"celu(x, α=1) = x ≥ 0 ? x : α * (exp(x/α) - 1)\n\nActivation function from \"Continuously Differentiable Exponential Linear Units\".\n\njulia> lineplot(celu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ celu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> celu(-10f0)\n-0.9999546f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.elu","page":"Activation Functions","title":"NNlib.elu","text":"elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)\n\nExponential Linear Unit activation function. See \"Fast and Accurate Deep Network Learning by Exponential Linear Units\". You can also specify the coefficient explicitly, e.g. elu(x, 1).\n\njulia> lineplot(elu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ elu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠔⠒⠋⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠤⠤⠤⠤⠔⠒⠒⠒⠊⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> elu(-10f0)\n-0.9999546f0\n\njulia> elu(-10f0, 2)\n-1.9999092f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.gelu","page":"Activation Functions","title":"NNlib.gelu","text":"gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))\n\nActivation function from \"Gaussian Error Linear Units\".\n\njulia> lineplot(gelu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊│ gelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣤⣤⣤⣤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⡤⡧⠶⠶⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(gelu, -5, 0, height=7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠒⠒⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸│ gelu(x) \n │⠑⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇│ swish(x)\n │⠀⠀⠀⠀⠀⠈⠉⠒⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⢠⡇⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⣄⠀⠀⠀⠀⠀⢠⡞⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⢄⣀⣀⡤⢣⠃⠀⠀│ \n -0.2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠇⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardsigmoid","page":"Activation Functions","title":"NNlib.hardsigmoid","text":"hardσ(x) = max(0, min(1, (x + 3) / 6))\n\nPiecewise linear approximation of sigmoid.\n\njulia> lineplot(hardsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠉⠉⠉⠉⠉⠉⠉⠉│ hardσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⡗⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⠤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardswish","page":"Activation Functions","title":"NNlib.hardswish","text":"hardswish(x) = x * hardσ(x)\n\nHard-Swish activation function. See \"Searching for MobileNetV3\".\n\njulia> lineplot(hardswish, -2, 5, height = 7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠒⠉│ hardswish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠒⠉⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣤⣤⣖⣚⣉⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠉⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot(hardswish, -4, 0, height = 7);\n\njulia> lineplot!(ans, swish)\n ┌────────────────────────────────────────┐ \n 0 │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⢣⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡜│ hardswish(x)\n │⠒⠒⠢⠤⢄⣀⡀⠀⠀⠀⠀⠱⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠎⠀│ swish(x) \n │⠀⠀⠀⠀⠀⠀⠈⠉⠑⠒⠦⢄⣘⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠃⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⡖⠦⢄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⢔⠏⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠣⣄⠀⠉⠑⠒⠦⠤⢄⣀⣀⣀⣀⡠⠤⠖⣊⠕⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠓⠤⡀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀│ \n -0.4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠢⠤⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-4⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> hardswish.(-5:5)'\n1×11 adjoint(::Vector{Float64}) with eltype Float64:\n -0.0 -0.0 -0.0 -0.333333 -0.333333 0.0 0.666667 1.66667 3.0 4.0 5.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.hardtanh","page":"Activation Functions","title":"NNlib.hardtanh","text":"hardtanh(x) = max(-1, min(1, x))\n\nSegment-wise linear approximation of tanh, much cheaper to compute. See \"Large Scale Machine Learning\".\n\nSee also tanh_fast.\n\njulia> lineplot(hardtanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⠔⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ hardtanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠋⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⠔⠋⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x\n\njulia> lineplot(tanh, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠤⠒⠒⠒⠊⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⡷⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠊⠁⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⡠⠤⠤⠤⠖⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.leakyrelu","page":"Activation Functions","title":"NNlib.leakyrelu","text":"leakyrelu(x, a=0.01) = max(a*x, x)\n\nLeaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).\n\njulia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ #42(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⡤⡧⠶⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠤⠤⠒⠒⠋⠉⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⠤⠤⠒⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> leakyrelu(-10f0, 0.2)\n-2.0f0\n\njulia> leakyrelu(-10f0, 0.02)\n-0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.lisht","page":"Activation Functions","title":"NNlib.lisht","text":"lisht(x) = x * tanh(x)\n\nActivation function from \"LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ...\"\n\njulia> lineplot(lisht, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x)\n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ \n │⠀⠀⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⣄⣀⣀⣇⣀⣀⠤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, logcosh)\n ┌────────────────────────────────────────┐ \n 2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x) \n │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│ logcosh(x)\n │⠢⣄⠀⠀⠈⠣⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⠀⠀⣀⠔│ \n f(x) │⠀⠈⠑⠢⣀⠀⠀⠑⢆⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠊⠁⠀⣀⠔⠊⠁⠀│ \n │⠀⠀⠀⠀⠀⠉⠢⢄⡀⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⠔⠋⠀⡠⠔⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠦⣌⡓⢄⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡠⠖⣁⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠪⠷⣦⣄⣀⣀⣇⣀⣀⣤⠶⠕⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logcosh","page":"Activation Functions","title":"NNlib.logcosh","text":"logcosh(x)\n\nReturn log(cosh(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logcosh, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ logcosh(x)\n │⠉⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ \n │⠀⠀⠀⠑⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠑⠦⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⠦⡀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⠦⡀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⢀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠢⢄⣀⣀⣇⣀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.logsigmoid","page":"Activation Functions","title":"NNlib.logsigmoid","text":"logσ(x)\n\nReturn log(σ(x)) which is computed in a numerically stable way.\n\njulia> lineplot(logsigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡧⠤⠔⠒⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ logσ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⢀⡤⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡤⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.mish","page":"Activation Functions","title":"NNlib.mish","text":"mish(x) = x * tanh(softplus(x))\n\nActivation function from \"Mish: A Self Regularized Non-Monotonic Neural Activation Function\".\n\njulia> lineplot(mish, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋│ mish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣧⣔⣊⣁⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu","page":"Activation Functions","title":"NNlib.relu","text":"relu(x) = max(0, x)\n\nRectified Linear Unit activation function.\n\njulia> lineplot(relu, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ relu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡤⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⡠⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.relu6","page":"Activation Functions","title":"NNlib.relu6","text":"relu6(x) = min(max(0, x), 6)\n\nRectified Linear Unit activation function capped at 6. See \"Convolutional Deep Belief Networks\" from CIFAR-10.\n\njulia> lineplot(relu6, -10, 10, height=7)\n ┌────────────────────────────────────────┐ \n 6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠎⠉⠉⠉⠉⠉⠉⠉⠉│ relu6(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡤⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⡠⠎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⠖⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⡧⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-10⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.rrelu","page":"Activation Functions","title":"NNlib.rrelu","text":"rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)\n# where `a` is randomly sampled from uniform distribution `U(lo, hi)`\n\nRandomized Leaky Rectified Linear Unit activation function. See \"Empirical Evaluation of Rectified Activations\" You can also specify the bound explicitly, e.g. rrelu(x, 0.0, 1.0).\n\njulia> lineplot(rrelu, -20, 10, height=7)\n ┌────────────────────────────────────────┐ \n 10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ rrelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡤⠤⣤⣤⢤⣤⣤⠤⠤⠤⢼⠮⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⣰⢀⣆⡄⣄⡄⡠⡰⠦⠷⡜⢢⠷⠳⠢⠊⠉⠉⠀⠀⠁⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠃⠉⠙⠘⠃⠈⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-20⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> extrema(rrelu.(fill(-10f0, 1000)))\n(-3.3316886f0, -1.2548422f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.selu","page":"Activation Functions","title":"NNlib.selu","text":"selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))\n\nλ ≈ 1.05070...\nα ≈ 1.67326...\n\nScaled exponential linear units. See \"Self-Normalizing Neural Networks\".\n\njulia> lineplot(selu, -3, 2, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ selu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⢀⣀⠤⠖⠊⠉⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⡠⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⣉⠭⠛⡏⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡤⠤⠒⠊⠉⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠤⠤⠖⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> selu(-10f0)\n-1.7580194f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid","page":"Activation Functions","title":"NNlib.sigmoid","text":"σ(x) = 1 / (1 + exp(-x))\n\nClassic sigmoid activation function. Unicode σ can be entered as \\sigma then tab, in many editors. The ascii name sigmoid is also exported.\n\nSee also sigmoid_fast.\n\njulia> using UnicodePlots\n\njulia> lineplot(sigmoid, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠒⠒⠋⠉⠉⠉⠉⠉⠉│ σ(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⣀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡔⠋⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠊⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> sigmoid === σ\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.sigmoid_fast","page":"Activation Functions","title":"NNlib.sigmoid_fast","text":"sigmoid_fast(x)\n\nThis is a faster, and very slightly less accurate, version of sigmoid. For `x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.\n\nSee also tanh_fast.\n\njulia> sigmoid(0.2f0)\n0.54983395f0\n\njulia> sigmoid_fast(0.2f0)\n0.54983395f0\n\njulia> hardσ(0.2f0)\n0.53333336f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softplus","page":"Activation Functions","title":"NNlib.softplus","text":"softplus(x) = log(exp(x) + 1)\n\nSee \"Deep Sparse Rectifier Neural Networks\", JMLR 2011.\n\njulia> lineplot(softplus, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠊⠁⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⠤⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⡠⠤⠤⠤⠤⠔⠒⠒⠚⠉⠉⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, relu)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠│ relu(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡴⠞⠋⠁│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣤⡴⠞⠋⠁⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡠⢤⡲⠝⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡧⠤⠒⠊⣉⠥⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣠⣤⣤⣤⣤⣔⣒⣒⣚⣉⣉⣁⣀⣇⠴⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softplus(16f0)\n16.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softshrink","page":"Activation Functions","title":"NNlib.softshrink","text":"softshrink(x, λ=0.5) =\n (x ≥ λ ? x - λ : (-λ ≥ x ? x + λ : 0))\n\nSee \"Softshrink Activation Function\".\n\njulia> lineplot(softshrink, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⠉⠁│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⠒⠋⠁⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⡤⠤⠤⠤⠤⠤⠤⡧⠤⠤⠤⠤⠶⠮⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⢀⣀⠤⠖⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⣀⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanhshrink)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠔⠒⣉⡡│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣀⡤⠤⣒⣋⠥⠤⠒⠊⠉⠁⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⣤⣤⣤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠾⠿⠯⠭⠭⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⢀⣀⡠⠤⠖⢒⣋⠭⠗⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠊⣉⠤⠔⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -2 │⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀\n\njulia> softshrink.((-10f0, 10f0))\n(-9.5f0, 9.5f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.softsign","page":"Activation Functions","title":"NNlib.softsign","text":"softsign(x) = x / (1 + |x|)\n\nSee \"Quadratic Polynomials Learn Better Image Features\" (2009).\n\njulia> lineplot(softsign, -5, 5, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣀⣀⠤⠤⠤⠤⠤│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡔⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⠋⠁⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠒⠒⠒⠒⠒⠊⠉⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> lineplot!(ans, tanh)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⢀⡤⠖⠊⠉⠉⠉⣉⣉⣉⣉⣉⠭⠭⠭⠭⠭│ softsign(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⡔⣃⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanh(x) \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⡞⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡴⠃⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠤⠤⠒⢋⠕⠁⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣒⣒⣒⣒⣒⣊⣉⣉⣉⣉⣁⣀⣀⡠⠤⠒⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> softsign(1f0)\n0.5f0\n\njulia> softsign(100f0)\n0.990099f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.swish","page":"Activation Functions","title":"NNlib.swish","text":"swish(x) = x * σ(x)\n\nSelf-gated activation function. See \"Swish: a Self-Gated Activation Function\".\n\njulia> lineplot(swish, -2, 2, height=7)\n ┌────────────────────────────────────────┐ \n 2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤│ swish(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋⠁⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠖⠋⠁⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⣀⡤⠔⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⣤⡤⡧⠴⠶⠯⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠉⠑⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠉⠉⠉⠉⠁⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanhshrink","page":"Activation Functions","title":"NNlib.tanhshrink","text":"tanhshrink(x) = x - tanh(x)\n\nSee \"Tanhshrink Activation Function\".\n\njulia> lineplot(tanhshrink, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanhshrink(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠊│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⣀⡠⠤⠒⠊⠉⠁⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⣤⡤⠤⠤⠤⠤⠤⠤⡷⠶⠶⠶⠶⠶⠮⠭⠥⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⣀⡠⠴⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⡠⠴⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\njulia> tanhshrink.((-10f0, 10f0))\n(-9.0f0, 9.0f0)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.tanh_fast","page":"Activation Functions","title":"NNlib.tanh_fast","text":"tanh_fast(x)\n\nThis is a faster but slighly less accurate version of tanh.\n\nWhere Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit. \n\nFor x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.\n\nSee also sigmoid_fast.\n\njulia> tanh(0.5f0)\n0.46211717f0\n\njulia> tanh_fast(0.5f0)\n0.46211714f0\n\njulia> hard_tanh(0.5f0)\n0.5f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#NNlib.trelu","page":"Activation Functions","title":"NNlib.trelu","text":"trelu(x, theta=1) = x > theta ? x : 0\n\nThreshold gated rectified linear activation function. See \"Zero-bias autoencoders and the benefits of co-adapting features\"\n\njulia> lineplot(trelu, -2, 4, height=7)\n ┌────────────────────────────────────────┐ \n 4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ trelu(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⣠⠤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n 0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣀⣀⣀⣀⣀⣀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀4⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ \n\n\n\n\n\n","category":"function"},{"location":"reference/models/activation/#One-More","page":"Activation Functions","title":"One More","text":"","category":"section"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Julia's Base.Math also provides tanh, which can be used as an activation function.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.","category":"page"},{"location":"reference/models/activation/","page":"Activation Functions","title":"Activation Functions","text":"julia> using UnicodePlots\n\njulia> lineplot(tanh, -3, 3, height=7)\n ┌────────────────────────────────────────┐ \n 1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⣀⠤⠔⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉│ tanh(x)\n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡰⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n f(x) │⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⡤⡯⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠎⠁⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠊⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⡤⠤⠔⠒⠉⠁⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ \n └────────────────────────────────────────┘ \n ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀ \n ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ","category":"page"},{"location":"ecosystem/#The-Julia-Ecosystem-around-Flux","page":"Ecosystem","title":"The Julia Ecosystem around Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.","category":"page"},{"location":"ecosystem/#Flux-models","page":"Ecosystem","title":"Flux models","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Flux's model-zoo contains examples from many domains.","category":"page"},{"location":"ecosystem/#Computer-vision","page":"Ecosystem","title":"Computer vision","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ObjectDetector.jl provides ready-to-go image detection via YOLO.\nMetalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.\nUNet.jl is a generic UNet implementation.","category":"page"},{"location":"ecosystem/#Natural-language-processing","page":"Ecosystem","title":"Natural language processing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.\nTextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.","category":"page"},{"location":"ecosystem/#Reinforcement-learning","page":"Ecosystem","title":"Reinforcement learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.\nReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.","category":"page"},{"location":"ecosystem/#Graph-learning","page":"Ecosystem","title":"Graph learning","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.\nGeometricFlux.jl is the first graph neural network library for julia. \nNeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.\nSeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.","category":"page"},{"location":"ecosystem/#Time-series","page":"Ecosystem","title":"Time series","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FluxArchitectures.jl is a collection of advanced network architectures for time series forecasting.","category":"page"},{"location":"ecosystem/#Robust-networks","page":"Ecosystem","title":"Robust networks","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Tools-closely-associated-with-Flux","page":"Ecosystem","title":"Tools closely associated with Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Utility tools you're unlikely to have met if you never used Flux!","category":"page"},{"location":"ecosystem/#High-level-training-flows","page":"Ecosystem","title":"High-level training flows","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"FastAI.jl is a Julia port of Python's fast.ai library.\nFluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl\nIgnite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.\nTsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.","category":"page"},{"location":"ecosystem/#Datasets","page":"Ecosystem","title":"Datasets","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"MLDatasets.jl focuses on downloading, unpacking, and accessing benchmark datasets.\nGraphMLDatasets.jl: a library for machine learning datasets on graph.","category":"page"},{"location":"ecosystem/#Plumbing","page":"Ecosystem","title":"Plumbing","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Tools to put data into the right order for creating a model.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Augmentor.jl is a real-time library augmentation library for increasing the number of training images.\nDataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.\nMLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.","category":"page"},{"location":"ecosystem/#Parameters","page":"Ecosystem","title":"Parameters","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"ParameterSchedulers.jl standard scheduling policies for machine learning.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Differentiable-programming","page":"Ecosystem","title":"Differentiable programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Packages based on differentiable programming but not necessarily related to Machine Learning. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.\nDiffEqFlux.jl provides tools for creating Neural Differential Equations.\nFlux3D.jl shows off machine learning on 3D data.\nRayTracer.jl combines ML with computer vision via a differentiable renderer.\nDuckietown.jl Differentiable Duckietown simulator.\nThe Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.\nAtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.\nDiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.","category":"page"},{"location":"ecosystem/#Probabilistic-programming","page":"Ecosystem","title":"Probabilistic programming","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.\nOmega.jl is a research project aimed at causal, higher-order probabilistic programming.\nStheno.jl provides flexible Gaussian processes.","category":"page"},{"location":"ecosystem/#Statistics","page":"Ecosystem","title":"Statistics","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"OnlineStats.jl provides single-pass algorithms for statistics.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Useful-miscellaneous-packages","page":"Ecosystem","title":"Useful miscellaneous packages","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Some useful and random packages!","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.\nMill.jl helps to prototype flexible multi-instance learning models.\nMLMetrics.jl is a utility for scoring models in data science and machine learning.\nTorch.jl exposes torch in Julia.\nValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.\nInvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.\nProgressMeter.jl progress meters for long-running computations.\nTensorBoardLogger.jl easy peasy logging to tensorboard in Julia\nArgParse.jl is a package for parsing command-line arguments to Julia programs.\nParameters.jl types with default field values, keyword constructors and (un-)pack macros.\nBSON.jl is a package for working with the Binary JSON serialisation format.\nDataFrames.jl in-memory tabular data in Julia.\nDrWatson.jl is a scientific project assistant software.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"","category":"page"},{"location":"ecosystem/#Alternatives-to-Flux","page":"Ecosystem","title":"Alternatives to Flux","text":"","category":"section"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"Julia has several other libraries for making neural networks. ","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl) \nKnet.jl is a neural network library built around AutoGrad.jl.\nLux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.","category":"page"},{"location":"ecosystem/","page":"Ecosystem","title":"Ecosystem","text":"compat: Explicit or explicit?\nFlux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word \"explicit\" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/models/functors/#Recursive-transformations-from-Functors.jl","page":"Nested Structures – Functors.jl","title":"Recursive transformations from Functors.jl","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux models are deeply nested structures, and Functors.jl provides tools needed to explore such objects, apply functions to the parameters they contain, and re-build them.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"compat: Flux ≤ 0.14\nAll layers were previously defined with the Functors.@functor macro. This still works, but it is recommended that you use the new Flux.@layer macro instead. Both allow Flux.setup to see the parameters inside, and gpu to move them to the GPU, but Flux.@layer also overloads printing, and offers a way to define trainable at the same time.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Functors.jl has its own notes on basic usage for more details. Additionally, the Advanced Model Building and Customisation page covers the use cases of Functors in greater details.","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux.@layer\nFunctors.@functor\nFunctors.fmap\nFunctors.fmap_with_path\nFunctors.isleaf\nFunctors.children\nFunctors.fcollect\nFunctors.functor\nFunctors.fmapstructure\nFunctors.fmapstructure_with_path\nFunctors.execute\nFunctors.AbstractWalk\nFunctors.ExcludeWalk\nFunctors.CachedWalk","category":"page"},{"location":"reference/models/functors/#Flux.@layer","page":"Nested Structures – Functors.jl","title":"Flux.@layer","text":"@layer Dense\n@layer :expand Chain\n@layer BatchNorm trainable=(β,γ)\n\nThis macro replaces most uses of @functor. Its basic purpose is the same: When you define a new layer, this tells Flux to explore inside it to see the parameters it trains, and also to move them to the GPU, change precision, etc.\n\nLike @functor, this assumes your struct has the default constructor, to enable re-building. If you define an inner constructor (i.e. a function within the struct block) things may break.\n\nThe keyword trainable allows you to limit this exploration, instead of visiting all fieldnames(T). Note that it is never necessary to tell Flux to ignore non-array objects such as functions or sizes.\n\nThe macro also handles overloads of show for pretty printing.\n\nBy default, it adds methods to 3-arg Base.show to treat your layer much like Dense or Conv.\nIf your layer is a container, more like Chain or Parallel, then :expand makes show unfold its contents.\nTo disable all show overloads, there is an :ignore option too.\n\n(You probably still want to define 2-arg show(io::IO, x::Layer), the macro does not touch this.)\n\nNote that re-running the macro with different options may not remove all methods, you will need to restart.\n\nExample\n\njulia> struct Trio; a; b; c end\n\njulia> tri = Trio(Dense([1.1 2.2], [0.0], tanh), Dense(hcat(3.3), false), Dropout(0.4))\nTrio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4))\n\njulia> Flux.destructure(tri) # parameters are not yet visible to Flux\n(Bool[], Restructure(Trio, ..., 0))\n\njulia> Flux.@layer :expand Trio\n\njulia> Flux.destructure(tri) # now gpu, params, train!, etc will see inside too\n([1.1, 2.2, 0.0, 3.3], Restructure(Trio, ..., 4))\n\njulia> tri # and layer is printed like Chain\nTrio(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1; bias=false), # 1 parameters\n Dropout(0.4),\n) # Total: 3 arrays, 4 parameters, 224 bytes.\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.@functor","page":"Nested Structures – Functors.jl","title":"Functors.@functor","text":"@functor T\n@functor T (x,)\n\nAdds methods to functor allowing recursion into objects of type T, and reconstruction. Assumes that T has a constructor accepting all of its fields, which is true unless you have provided an inner constructor which does not.\n\nBy default all fields of T are considered children; this can be restricted be restructed by providing a tuple of field names.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> Functors.children(Foo(1,2))\n(x = 1, y = 2)\n\njulia> _, re = Functors.functor(Foo(1,2));\n\njulia> re((10, 20))\nFoo(10, 20)\n\njulia> struct TwoThirds a; b; c; end\n\njulia> @functor TwoThirds (a, c)\n\njulia> ch2, re3 = Functors.functor(TwoThirds(10,20,30));\n\njulia> ch2\n(a = 10, c = 30)\n\njulia> re3((\"ten\", \"thirty\"))\nTwoThirds(\"ten\", 20, \"thirty\")\n\njulia> fmap(x -> 10x, TwoThirds(Foo(1,2), Foo(3,4), 56))\nTwoThirds(Foo(10, 20), Foo(3, 4), 560)\n\n\n\n\n\n","category":"macro"},{"location":"reference/models/functors/#Functors.fmap","page":"Nested Structures – Functors.jl","title":"Functors.fmap","text":"fmap(f, x, ys...; exclude = Functors.isleaf, walk = Functors.DefaultWalk(), [prune])\n\nA structure and type preserving map.\n\nBy default it transforms every leaf node (identified by exclude, default isleaf) by applying f, and otherwise traverses x recursively using functor. Optionally, it may also be associated with objects ys with the same tree structure. In that case, f is applied to the corresponding leaf nodes in x and ys.\n\nSee also fmap_with_path and fmapstructure.\n\nExamples\n\njulia> fmap(string, (x=1, y=(2, 3)))\n(x = \"1\", y = (\"2\", \"3\"))\n\njulia> nt = (a = [1,2], b = [23, (45,), (x=6//7, y=())], c = [8,9]);\n\njulia> fmap(println, nt)\n[1, 2]\n23\n45\n6//7\n()\n[8, 9]\n(a = nothing, b = Any[nothing, (nothing,), (x = nothing, y = nothing)], c = nothing)\n\njulia> fmap(println, nt; exclude = x -> x isa Array)\n[1, 2]\nAny[23, (45,), (x = 6//7, y = ())]\n[8, 9]\n(a = nothing, b = nothing, c = nothing)\n\njulia> twice = [1, 2]; # println only acts once on this\n\njulia> fmap(println, (i = twice, ii = 34, iii = [5, 6], iv = (twice, 34), v = 34.0))\n[1, 2]\n34\n[5, 6]\n34\n34.0\n(i = nothing, ii = nothing, iii = nothing, iv = (nothing, nothing), v = nothing)\n\njulia> d1 = Dict(\"x\" => [1,2], \"y\" => 3);\n\njulia> d2 = Dict(\"x\" => [4,5], \"y\" => 6, \"z\" => \"an_extra_value\");\n\njulia> fmap(+, d1, d2) == Dict(\"x\" => [5, 7], \"y\" => 9) # Note that \"z\" is ignored\ntrue\n\nMutable objects which appear more than once are only handled once (by caching f(x) in an IdDict). Thus the relationship x.i === x.iv[1] will be preserved. An immutable object which appears twice is not stored in the cache, thus f(34) will be called twice, and the results will agree only if f is pure.\n\nBy default, Tuples, NamedTuples, and some other container-like types in Base have children to recurse into. Arrays of numbers do not. To enable recursion into new types, you must provide a method of functor, which can be done using the macro @functor:\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> struct Bar; x; end\n\njulia> @functor Bar\n\njulia> m = Foo(Bar([1,2,3]), (4, 5, Bar(Foo(6, 7))));\n\njulia> fmap(x -> 10x, m)\nFoo(Bar([10, 20, 30]), (40, 50, Bar(Foo(60, 70))))\n\njulia> fmap(string, m)\nFoo(Bar(\"[1, 2, 3]\"), (\"4\", \"5\", Bar(Foo(\"6\", \"7\"))))\n\njulia> fmap(string, m, exclude = v -> v isa Bar)\nFoo(\"Bar([1, 2, 3])\", (4, 5, \"Bar(Foo(6, 7))\"))\n\nTo recurse into custom types without reconstructing them afterwards, use fmapstructure.\n\nFor advanced customization of the traversal behaviour, pass a custom walk function that subtypes Functors.AbstractWalk. The call fmap(f, x, ys...; walk = mywalk) will wrap mywalk in ExcludeWalk then CachedWalk. Here, ExcludeWalk is responsible for applying f at excluded nodes. For a low-level interface for executing a user-constructed walk, see execute.\n\njulia> struct MyWalk <: Functors.AbstractWalk end\n\njulia> (::MyWalk)(recurse, x) = x isa Bar ? \"hello\" :\n Functors.DefaultWalk()(recurse, x)\n\njulia> fmap(x -> 10x, m; walk = MyWalk())\nFoo(\"hello\", (40, 50, \"hello\"))\n\nThe behaviour when the same node appears twice can be altered by giving a value to the prune keyword, which is then used in place of all but the first:\n\njulia> twice = [1, 2];\n\njulia> fmap(float, (x = twice, y = [1,2], z = twice); prune = missing)\n(x = [1.0, 2.0], y = [1.0, 2.0], z = missing)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmap_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmap_with_path","text":"fmap_with_path(f, x, ys...; exclude = isleaf, walk = DefaultWalkWithPath(), [prune])\n\nLike fmap, but also passes a KeyPath to f for each node in the recursion. The KeyPath is a tuple of the indices used to reach the current node from the root of the recursion. The KeyPath is constructed by the walk function, and can be used to reconstruct the path to the current node from the root of the recursion.\n\nf has to accept two arguments: the associated KeyPath and the value of the current node.\n\nexclude also receives the KeyPath as its first argument and a node as its second. It should return true if the recursion should not continue on its children and f applied to it.\n\nprune is used to control the behaviour when the same node appears twice, see fmap for more information.\n\nExamples\n\njulia> x = ([1, 2, 3], 4, (a=5, b=Dict(\"A\"=>6, \"B\"=>7), c=Dict(\"C\"=>8, \"D\"=>9)));\n\njulia> exclude(kp, x) = kp == KeyPath(3, :c) || Functors.isleaf(x);\n\njulia> fmap_with_path((kp, x) -> x isa Dict ? nothing : x.^2, x; exclude = exclude)\n([1, 4, 9], 16, (a = 25, b = Dict(\"B\" => 49, \"A\" => 36), c = nothing))\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.isleaf","page":"Nested Structures – Functors.jl","title":"Functors.isleaf","text":"Functors.isleaf(x)\n\nReturn true if x has no children according to functor.\n\nExamples\n\njulia> Functors.isleaf(1)\ntrue\n\njulia> Functors.isleaf([2, 3, 4])\ntrue\n\njulia> Functors.isleaf([\"five\", [6, 7]])\nfalse\n\njulia> Functors.isleaf([])\nfalse\n\njulia> Functors.isleaf((8, 9))\nfalse\n\njulia> Functors.isleaf(())\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.children","page":"Nested Structures – Functors.jl","title":"Functors.children","text":"Functors.children(x)\n\nReturn the children of x as defined by functor. Equivalent to functor(x)[1].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fcollect","page":"Nested Structures – Functors.jl","title":"Functors.fcollect","text":"fcollect(x; exclude = v -> false)\n\nTraverse x by recursing each child of x as defined by functor and collecting the results into a flat array, ordered by a breadth-first traversal of x, respecting the iteration order of children calls.\n\nDoesn't recurse inside branches rooted at nodes v for which exclude(v) == true. In such cases, the root v is also excluded from the result. By default, exclude always yields false.\n\nSee also children.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> struct Bar; x; end\n\njulia> @functor Bar\n\njulia> struct TypeWithNoChildren; x; y; end\n\njulia> m = Foo(Bar([1,2,3]), TypeWithNoChildren(:a, :b))\nFoo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n\njulia> fcollect(m)\n4-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n [1, 2, 3]\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> v isa Bar)\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n TypeWithNoChildren(:a, :b)\n\njulia> fcollect(m, exclude = v -> Functors.isleaf(v))\n2-element Vector{Any}:\n Foo(Bar([1, 2, 3]), TypeWithNoChildren(:a, :b))\n Bar([1, 2, 3])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.functor","page":"Nested Structures – Functors.jl","title":"Functors.functor","text":"Functors.functor(x) = functor(typeof(x), x)\n\nReturns a tuple containing, first, a NamedTuple of the children of x (typically its fields), and second, a reconstruction funciton. This controls the behaviour of fmap.\n\nMethods should be added to functor(::Type{T}, x) for custom types, usually using the macro @functor.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure","text":"fmapstructure(f, x, ys...; exclude = isleaf, [prune])\n\nLike fmap, but doesn't preserve the type of custom structs. Instead, it returns a NamedTuple (or a Tuple, or an array), or a nested set of these.\n\nUseful for when the output must not contain custom structs.\n\nSee also fmap and fmapstructure_with_path.\n\nExamples\n\njulia> struct Foo; x; y; end\n\njulia> @functor Foo\n\njulia> m = Foo([1,2,3], [4, (5, 6), Foo(7, 8)]);\n\njulia> fmapstructure(x -> 2x, m)\n(x = [2, 4, 6], y = Any[8, (10, 12), (x = 14, y = 16)])\n\njulia> fmapstructure(println, m)\n[1, 2, 3]\n4\n5\n6\n7\n8\n(x = nothing, y = Any[nothing, (nothing, nothing), (x = nothing, y = nothing)])\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.fmapstructure_with_path","page":"Nested Structures – Functors.jl","title":"Functors.fmapstructure_with_path","text":"fmapstructure_with_path(f, x, ys...; [exclude, prune])\n\nLike fmap_with_path, but doesn't preserve the type of custom structs. Instead, it returns a named tuple, a tuple, an array, a dict, or a nested set of these.\n\nSee also fmapstructure.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.execute","page":"Nested Structures – Functors.jl","title":"Functors.execute","text":"execute(walk, x, ys...)\n\nExecute a walk that recursively calls itself, starting at a node x in a Functors tree, as well as optional associated nodes ys... in other Functors trees. Any custom walk function that subtypes Functors.AbstractWalk is permitted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Functors.AbstractWalk","page":"Nested Structures – Functors.jl","title":"Functors.AbstractWalk","text":"AbstractWalk\n\nAny walk for use with fmap should inherit from this type. A walk subtyping AbstractWalk must satisfy the walk function interface:\n\nstruct MyWalk <: AbstractWalk end\n\nfunction (::MyWalk)(recurse, x, ys...)\n # implement this\nend\n\nThe walk function is called on a node x in a Functors tree. It may also be passed associated nodes ys... in other Functors trees. The walk function recurses further into (x, ys...) by calling recurse on the child nodes. The choice of which nodes to recurse and in what order is custom to the walk.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.ExcludeWalk","page":"Nested Structures – Functors.jl","title":"Functors.ExcludeWalk","text":"ExcludeWalk(walk, fn, exclude)\n\nA walk that recurses nodes (x, ys...) according to walk, except when exclude(x) is true. Then, fn(x, ys...) is applied instead of recursing further.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Functors.CachedWalk","page":"Nested Structures – Functors.jl","title":"Functors.CachedWalk","text":"CachedWalk(walk[; prune])\n\nA walk that recurses nodes (x, ys...) according to walk and storing the output of the recursion in a cache indexed by x (based on object ID). Whenever the cache already contains x, either:\n\nprune is specified, then it is returned, or\nprune is unspecified, and the previously cached recursion of (x, ys...) returned.\n\nTypically wraps an existing walk for use with fmap.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/functors/#Moving-models,-or-data,-to-the-GPU","page":"Nested Structures – Functors.jl","title":"Moving models, or data, to the GPU","text":"","category":"section"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"Flux provides some convenience functions based on fmap. Some (f16, f32, f64) change the precision of all arrays in a model. Others are used for moving a model to of from GPU memory:","category":"page"},{"location":"reference/models/functors/","page":"Nested Structures – Functors.jl","title":"Nested Structures – Functors.jl","text":"cpu\ngpu(::Any)\ngpu(::Flux.DataLoader)","category":"page"},{"location":"reference/models/functors/#Flux.cpu","page":"Nested Structures – Functors.jl","title":"Flux.cpu","text":"cpu(m)\n\nCopies m onto the CPU, the opposite of gpu. Recurses into structs marked @functor.\n\nExample\n\njulia> m_gpu = Dense(CUDA.randn(2, 5))\nDense(5 => 2) # 12 parameters\n\njulia> m_gpu.bias # matches the given weight matrix\n2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.0\n 0.0\n\njulia> m = m_gpu |> cpu\nDense(5 => 2) # 12 parameters\n\njulia> m.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/functors/#Flux.gpu-Tuple{Any}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(m)\n\nCopies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time).\n\nOn arrays, this calls CUDA's cu, which also changes arrays with Float64 elements to Float32 while copying them to the device (same for AMDGPU). To act on arrays within a struct, the struct type must be marked with @functor.\n\nUse cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.\n\nSee the CUDA.jl docs to help identify the current device.\n\nExample\n\njulia> m = Dense(rand(2, 3)) # constructed with Float64 weight matrix\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m.weight)\nMatrix{Float64} (alias for Array{Float64, 2})\n\njulia> m_gpu = gpu(m) # can equivalently be written m_gpu = m |> gpu\nDense(3 => 2) # 8 parameters\n\njulia> typeof(m_gpu.weight)\nCUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}\n\n\n\n\n\n","category":"method"},{"location":"reference/models/functors/#Flux.gpu-Tuple{DataLoader}","page":"Nested Structures – Functors.jl","title":"Flux.gpu","text":"gpu(data::DataLoader)\ncpu(data::DataLoader)\n\nTransforms a given DataLoader to apply gpu or cpu to each batch of data, when iterated over. (If no GPU is available, this does nothing.)\n\nExample\n\njulia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 Matrix{Float64}, y = 3-element StepRange{Char, Int64})\n\njulia> first(dl)\n(x = [1.0 1.0 1.0; 1.0 1.0 1.0], y = 'a':1:'c')\n\njulia> c_dl = gpu(dl)\n4-element DataLoader(::MLUtils.MappedData{:auto, typeof(gpu), NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}}, batchsize=3)\n with first element:\n (; x = 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element StepRange{Char, Int64})\n\njulia> first(c_dl).x\n2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\nFor large datasets, this is preferred over moving all the data to the GPU before creating the DataLoader, like this:\n\njulia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)\n4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)\n with first element:\n (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})\n\nwarning: Warning\nThis only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.\n\n\n\n\n\n","category":"method"},{"location":"reference/models/losses/#man-losses","page":"Loss Functions","title":"Loss Functions","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux provides a large number of common loss functions used for training machine learning models. They are grouped together in the Flux.Losses module.","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Loss functions for supervised learning typically expect as inputs a target y, and a prediction ŷ from your model. In Flux's convention, the order of the arguments is the following","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y)","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Most loss functions in Flux have an optional argument agg, denoting the type of aggregation performed over the batch:","category":"page"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"loss(ŷ, y) # defaults to `mean`\nloss(ŷ, y, agg=sum) # use `sum` for reduction\nloss(ŷ, y, agg=x->sum(x, dims=2)) # partial reduction\nloss(ŷ, y, agg=x->mean(w .* x)) # weighted mean\nloss(ŷ, y, agg=identity) # no aggregation.","category":"page"},{"location":"reference/models/losses/#Function-listing","page":"Loss Functions","title":"Function listing","text":"","category":"section"},{"location":"reference/models/losses/","page":"Loss Functions","title":"Loss Functions","text":"Flux.Losses.mae\nFlux.Losses.mse\nFlux.Losses.msle\nFlux.Losses.huber_loss\nFlux.Losses.label_smoothing\nFlux.Losses.crossentropy\nFlux.Losses.logitcrossentropy\nFlux.Losses.binarycrossentropy\nFlux.Losses.logitbinarycrossentropy\nFlux.Losses.kldivergence\nFlux.Losses.poisson_loss\nFlux.Losses.hinge_loss\nFlux.Losses.squared_hinge_loss\nFlux.Losses.dice_coeff_loss\nFlux.Losses.tversky_loss\nFlux.Losses.binary_focal_loss\nFlux.Losses.focal_loss\nFlux.Losses.siamese_contrastive_loss","category":"page"},{"location":"reference/models/losses/#Flux.Losses.mae","page":"Loss Functions","title":"Flux.Losses.mae","text":"mae(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean absolute error:\n\nagg(abs.(ŷ .- y))\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> Flux.mae(y_model, 1:3)\n0.10000000000000009\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.mse","page":"Loss Functions","title":"Flux.Losses.mse","text":"mse(ŷ, y; agg = mean)\n\nReturn the loss corresponding to mean square error:\n\nagg((ŷ .- y) .^ 2)\n\nSee also: mae, msle, crossentropy.\n\nExample\n\njulia> y_model = [1.1, 1.9, 3.1];\n\njulia> y_true = 1:3;\n\njulia> Flux.mse(y_model, y_true)\n0.010000000000000018\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.msle","page":"Loss Functions","title":"Flux.Losses.msle","text":"msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nThe loss corresponding to mean squared logarithmic errors, calculated as\n\nagg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)\n\nThe ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.\n\nExample\n\njulia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)\n0.009084041f0\n\njulia> Flux.msle(Float32[0.9, 1.8, 2.7], 1:3)\n0.011100831f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.huber_loss","page":"Loss Functions","title":"Flux.Losses.huber_loss","text":"huber_loss(ŷ, y; delta = 1, agg = mean)\n\nReturn the mean of the Huber loss given the prediction ŷ and true values y.\n\n | 0.5 * |ŷ - y|^2, for |ŷ - y| <= δ\nHuber loss = |\n | δ * (|ŷ - y| - 0.5 * δ), otherwise\n\nExample\n\njulia> ŷ = [1.1, 2.1, 3.1];\n\njulia> Flux.huber_loss(ŷ, 1:3) # default δ = 1 > |ŷ - y|\n0.005000000000000009\n\njulia> Flux.huber_loss(ŷ, 1:3, delta=0.05) # changes behaviour as |ŷ - y| > δ\n0.003750000000000005\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.label_smoothing","page":"Loss Functions","title":"Flux.Losses.label_smoothing","text":"label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)\n\nReturns smoothed labels, meaning the confidence on label values are relaxed.\n\nWhen y is given as one-hot vector or batch of one-hot, its calculated as\n\ny .* (1 - α) .+ α / size(y, dims)\n\nwhen y is given as a number or batch of numbers for binary classification, its calculated as\n\ny .* (1 - α) .+ α / 2\n\nin which case the labels are squeezed towards 0.5.\n\nα is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.\n\ndims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.\n\nExample\n\njulia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)\n2×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ ⋅ ⋅ 1 ⋅ 1\n 1 1 1 ⋅ 1 ⋅\n\njulia> y_smoothed = Flux.label_smoothing(y, 0.2f0)\n2×6 Matrix{Float32}:\n 0.1 0.1 0.1 0.9 0.1 0.9\n 0.9 0.9 0.9 0.1 0.9 0.1\n\njulia> y_sim = softmax(y .* log(2f0))\n2×6 Matrix{Float32}:\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n\njulia> y_dis = vcat(y_sim[2,:]', y_sim[1,:]')\n2×6 Matrix{Float32}:\n 0.666667 0.666667 0.666667 0.333333 0.666667 0.333333\n 0.333333 0.333333 0.333333 0.666667 0.333333 0.666667\n\njulia> Flux.crossentropy(y_sim, y) < Flux.crossentropy(y_sim, y_smoothed)\ntrue\n\njulia> Flux.crossentropy(y_dis, y) > Flux.crossentropy(y_dis, y_smoothed)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.crossentropy","page":"Loss Functions","title":"Flux.Losses.crossentropy","text":"crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)\n\nReturn the cross entropy between the given probability distributions; calculated as\n\nagg(-sum(y .* log.(ŷ .+ ϵ); dims))\n\nCross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction ŷ is supposed to sum to one across dims, as would be the case with the output of a softmax operation.\n\nFor numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .\n\nUse label_smoothing to smooth the true labels as preprocessing before computing the loss.\n\nSee also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.\n\nExample\n\njulia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)\n3×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ ⋅ 1\n ⋅ 1 ⋅ 1 ⋅\n ⋅ ⋅ 1 ⋅ ⋅\n\njulia> y_model = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> sum(y_model; dims=1)\n1×5 Matrix{Float32}:\n 1.0 1.0 1.0 1.0 1.0\n\njulia> Flux.crossentropy(y_model, y_label)\n1.6076053f0\n\njulia> 5 * ans ≈ Flux.crossentropy(y_model, y_label; agg=sum)\ntrue\n\njulia> y_smooth = Flux.label_smoothing(y_label, 0.15f0)\n3×5 Matrix{Float32}:\n 0.9 0.05 0.05 0.05 0.9\n 0.05 0.9 0.05 0.9 0.05\n 0.05 0.05 0.9 0.05 0.05\n\njulia> Flux.crossentropy(y_model, y_smooth)\n1.5776052f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitcrossentropy","page":"Loss Functions","title":"Flux.Losses.logitcrossentropy","text":"logitcrossentropy(ŷ, y; dims = 1, agg = mean)\n\nReturn the cross entropy calculated by\n\nagg(-sum(y .* logsoftmax(ŷ; dims); dims))\n\nThis is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.\n\nSee also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.\n\nExample\n\njulia> y_label = Flux.onehotbatch(collect(\"abcabaa\"), 'a':'c')\n3×7 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 1\n ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n\njulia> y_model = reshape(vcat(-9:0, 0:9, 7.5f0), 3, 7)\n3×7 Matrix{Float32}:\n -9.0 -6.0 -3.0 0.0 2.0 5.0 8.0\n -8.0 -5.0 -2.0 0.0 3.0 6.0 9.0\n -7.0 -4.0 -1.0 1.0 4.0 7.0 7.5\n\njulia> Flux.logitcrossentropy(y_model, y_label)\n1.5791205f0\n\njulia> Flux.crossentropy(softmax(y_model), y_label)\n1.5791197f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binarycrossentropy","page":"Loss Functions","title":"Flux.Losses.binarycrossentropy","text":"binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the binary cross-entropy loss, computed as\n\nagg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))\n\nWhere typically, the prediction ŷ is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.\n\nUse label_smoothing to smooth the y value as preprocessing before computing the loss.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1]\n3-element Vector{Bool}:\n 1\n 0\n 1\n\njulia> y_prob = softmax(reshape(vcat(1:3, 3:5), 2, 3) .* 1f0)\n2×3 Matrix{Float32}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binarycrossentropy(y_prob[2,:], y_bin)\n0.43989f0\n\njulia> all(p -> 0 < p < 1, y_prob[2,:]) # else DomainError\ntrue\n\njulia> y_hot = Flux.onehotbatch(y_bin, 0:1)\n2×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n\njulia> Flux.crossentropy(y_prob, y_hot)\n0.43989f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.logitbinarycrossentropy","page":"Loss Functions","title":"Flux.Losses.logitbinarycrossentropy","text":"logitbinarycrossentropy(ŷ, y; agg = mean)\n\nMathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.\n\nSee also: crossentropy, logitcrossentropy.\n\nExamples\n\njulia> y_bin = Bool[1,0,1];\n\njulia> y_model = Float32[2, -1, pi]\n3-element Vector{Float32}:\n 2.0\n -1.0\n 3.1415927\n\njulia> Flux.logitbinarycrossentropy(y_model, y_bin)\n0.160832f0\n\njulia> Flux.binarycrossentropy(sigmoid.(y_model), y_bin)\n0.16083185f0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.kldivergence","page":"Loss Functions","title":"Flux.Losses.kldivergence","text":"kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))\n\nReturn the Kullback-Leibler divergence between the given probability distributions.\n\nThe KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.\n\nExample\n\njulia> p1 = [1 0; 0 1]\n2×2 Matrix{Int64}:\n 1 0\n 0 1\n\njulia> p2 = fill(0.5, 2, 2)\n2×2 Matrix{Float64}:\n 0.5 0.5\n 0.5 0.5\n\njulia> Flux.kldivergence(p2, p1) ≈ log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p1; agg = sum) ≈ 2log(2)\ntrue\n\njulia> Flux.kldivergence(p2, p2; eps = 0) # about -2e-16 with the regulator\n0.0\n\njulia> Flux.kldivergence(p1, p2; eps = 0) # about 17.3 with the regulator\nInf\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.poisson_loss","page":"Loss Functions","title":"Flux.Losses.poisson_loss","text":"poisson_loss(ŷ, y; agg = mean)\n\nReturn how much the predicted distribution ŷ diverges from the expected Poisson distribution y; calculated as -\n\nsum(ŷ .- y .* log.(ŷ)) / size(y, 2)\n\nMore information..\n\nExample\n\njulia> y_model = [1, 3, 3]; # data should only take integral values\n\njulia> Flux.poisson_loss(y_model, 1:3)\n0.5023128522198171\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.hinge_loss","page":"Loss Functions","title":"Flux.Losses.hinge_loss","text":"hinge_loss(ŷ, y; agg = mean)\n\nReturn the hinge_loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum(max.(0, 1 .- ŷ .* y)) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: squared_hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.hinge_loss(y_pred, y_true)\n0.55\n\njulia> Flux.hinge_loss(y_pred[1], y_true[1]) != 0 # same sign but |ŷ| < 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[end], y_true[end]) == 0 # same sign but |ŷ| >= 1\ntrue\n\njulia> Flux.hinge_loss(y_pred[2], y_true[2]) != 0 # opposite signs\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.squared_hinge_loss","page":"Loss Functions","title":"Flux.Losses.squared_hinge_loss","text":"squared_hinge_loss(ŷ, y)\n\nReturn the squared hinge_loss loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as\n\nsum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)\n\nUsually used with classifiers like Support Vector Machines. See also: hinge_loss\n\nExample\n\njulia> y_true = [1, -1, 1, 1];\n\njulia> y_pred = [0.1, 0.3, 1, 1.5];\n\njulia> Flux.squared_hinge_loss(y_pred, y_true)\n0.625\n\njulia> Flux.squared_hinge_loss(y_pred[1], y_true[1]) != 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[end], y_true[end]) == 0\ntrue\n\njulia> Flux.squared_hinge_loss(y_pred[2], y_true[2]) != 0\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.dice_coeff_loss","page":"Loss Functions","title":"Flux.Losses.dice_coeff_loss","text":"dice_coeff_loss(ŷ, y; smooth = 1)\n\nReturn a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:\n\n1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)\n\nExample\n\njulia> y_pred = [1.1, 2.1, 3.1];\n\njulia> Flux.dice_coeff_loss(y_pred, 1:3)\n0.000992391663909964\n\njulia> 1 - Flux.dice_coeff_loss(y_pred, 1:3) # ~ F1 score for image segmentation\n0.99900760833609\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.tversky_loss","page":"Loss Functions","title":"Flux.Losses.tversky_loss","text":"tversky_loss(ŷ, y; beta = 0.7)\n\nReturn the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:\n\n1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.binary_focal_loss","page":"Loss Functions","title":"Flux.Losses.binary_focal_loss","text":"binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nFor gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.\n\nSee also: Losses.focal_loss for multi-class setting\n\nExample\n\njulia> y = [0 1 0\n 1 0 1]\n2×3 Matrix{Int64}:\n 0 1 0\n 1 0 1\n\njulia> ŷ = [0.268941 0.5 0.268941\n 0.731059 0.5 0.731059]\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.268941\n 0.731059 0.5 0.731059\n\njulia> Flux.binary_focal_loss(ŷ, y) ≈ 0.0728675615927385\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.focal_loss","page":"Loss Functions","title":"Flux.Losses.focal_loss","text":"focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))\n\nReturn the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).\n\nThe modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.\n\nExample\n\njulia> y = [1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0]\n3×5 Matrix{Int64}:\n 1 0 0 0 1\n 0 1 0 1 0\n 0 0 1 0 0\n\njulia> ŷ = softmax(reshape(-7:7, 3, 5) .* 1f0)\n3×5 Matrix{Float32}:\n 0.0900306 0.0900306 0.0900306 0.0900306 0.0900306\n 0.244728 0.244728 0.244728 0.244728 0.244728\n 0.665241 0.665241 0.665241 0.665241 0.665241\n\njulia> Flux.focal_loss(ŷ, y) ≈ 1.1277571935622628\ntrue\n\nSee also: Losses.binary_focal_loss for binary (not one-hot) labels\n\n\n\n\n\n","category":"function"},{"location":"reference/models/losses/#Flux.Losses.siamese_contrastive_loss","page":"Loss Functions","title":"Flux.Losses.siamese_contrastive_loss","text":"siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)\n\nReturn the contrastive loss which can be useful for training Siamese Networks. It is given by\n\nagg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)\n\nSpecify margin to set the baseline for distance at which pairs are dissimilar.\n\nExample\n\njulia> ŷ = [0.5, 1.5, 2.5];\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3)\n-4.833333333333333\n\njulia> Flux.siamese_contrastive_loss(ŷ, 1:3, margin = 2)\n-4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Built-in-Layer-Types","page":"Built-in Layers","title":"Built-in Layer Types","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"If you started at the beginning of the guide, then you have already met the basic Dense layer, and seen Chain for combining layers. These core layers form the foundation of almost all neural networks.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The Dense exemplifies several features:","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"It contains an an activation function, which is broadcasted over the output. Because this broadcast can be fused with other operations, doing so is more efficient than applying the activation function separately.\nIt take an init keyword, which accepts a function acting like rand. That is, init(2,3,4) should create an array of this size. Flux has many such functions built-in. All make a CPU array, moved later with gpu if desired.\nThe bias vector is always initialised Flux.zeros32. The keyword bias=false will turn this off, i.e. keeping the bias permanently zero.\nIt is annotated with @layer, which means that Flux.setup will see the contents, and gpu will move their arrays to the GPU.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"By contrast, Chain itself contains no parameters, but connects other layers together. The section on dataflow layers introduces others like this.","category":"page"},{"location":"reference/models/layers/#Fully-Connected","page":"Built-in Layers","title":"Fully Connected","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Dense\nFlux.Bilinear\nFlux.Scale","category":"page"},{"location":"reference/models/layers/#Flux.Dense","page":"Built-in Layers","title":"Flux.Dense","text":"Dense(in => out, σ=identity; bias=true, init=glorot_uniform)\nDense(W::AbstractMatrix, [bias, σ])\n\nCreate a traditional fully connected layer, whose forward pass is given by:\n\ny = σ.(W * x .+ bias)\n\nThe input x should be a vector of length in, or batch of vectors represented as an in × N matrix, or any array with size(x,1) == in. The out y will be a vector of length out, or a batch with size(y) == (out, size(x)[2:end]...)\n\nKeyword bias=false will switch off trainable bias for the layer. The initialisation of the weight matrix is W = init(out, in), calling the function given to keyword init, with default glorot_uniform. The weight matrix and/or the bias vector (of length out) may also be provided explicitly.\n\nExamples\n\njulia> model = Dense(5 => 2)\nDense(5 => 2) # 12 parameters\n\njulia> model(rand32(5, 64)) |> size\n(2, 64)\n\njulia> model(rand32(5, 6, 4, 64)) |> size # treated as three batch dimensions\n(2, 6, 4, 64)\n\njulia> model2 = Dense(ones(2, 5), false, tanh) # using provided weight matrix\nDense(5 => 2, tanh; bias=false) # 10 parameters\n\njulia> model2(ones(5))\n2-element Vector{Float64}:\n 0.9999092042625951\n 0.9999092042625951\n\njulia> Flux.trainables(model2) # no trainable bias\n1-element Vector{AbstractArray}:\n [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Bilinear","page":"Built-in Layers","title":"Flux.Bilinear","text":"Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)\nBilinear(W::AbstractArray, [bias, σ])\n\nCreates a layer which is fully connected between two inputs and the output, and otherwise similar to Dense. Its output, given vectors x & y, is another vector z with, for all i ∈ 1:out:\n\nz[i] = σ(x' * W[i,:,:] * y + bias[i])\n\nIf x and y are matrices, then each column of the output z = B(x, y) is of this form, with B the Bilinear layer.\n\nIf the second input y is not given, it is taken to be equal to x, i.e. B(x) == B(x, x)\n\nThe two inputs may also be provided as a tuple, B((x, y)) == B(x, y), which is accepted as the input to a Chain.\n\nIf the two input sizes are the same, in1 == in2, then you may write Bilinear(in => out, σ).\n\nThe initialisation works as for Dense layer, with W = init(out, in1, in2). By default the bias vector is zeros(Float32, out), option bias=false will switch off trainable bias. Either of these may be provided explicitly.\n\nExamples\n\njulia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);\n\njulia> B = Flux.Bilinear((5, 5) => 7)\nBilinear(5 => 7) # 182 parameters\n\njulia> B(x) |> size # interactions based on one input\n(7, 32)\n\njulia> B(x,y) == B((x,y)) # two inputs, may be given as a tuple\ntrue\n\njulia> sc = SkipConnection(\n Chain(Dense(5 => 20, tanh), Dense(20 => 9, tanh)),\n Flux.Bilinear((9, 5) => 3, bias=false),\n ); # used as the recombinator, with skip as the second input\n\njulia> sc(x) |> size\n(3, 32)\n\njulia> Flux.Bilinear(rand(4,8,16), false, tanh) # first dim of weight is the output\nBilinear((8, 16) => 4, tanh; bias=false) # 512 parameters\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Scale","page":"Built-in Layers","title":"Flux.Scale","text":"Scale(size::Integer..., σ=identity; bias=true, init=ones32)\nScale(scale::AbstractArray, [bias, σ])\n\nCreate an element-wise layer, whose forward pass is given by:\n\ny = σ.(scale .* x .+ bias)\n\nThis uses .* instead of matrix multiplication * of Dense.\n\nThe learnable scale & bias are initialised init(size...) and zeros32(size...), with init=ones32 by default. You may specify the function init, turn off trainable bias with bias=false, or provide the array(s) explicitly.\n\nUsed by LayerNorm with affine=true.\n\nExamples\n\njulia> a = Flux.Scale(2)\nScale(2) # 4 parameters\n\njulia> Flux.trainables(a)\n2-element Vector{AbstractArray}:\n Float32[1.0, 1.0]\n Float32[0.0, 0.0]\n\njulia> a([1 2 3])\n2×3 Matrix{Float32}:\n 1.0 2.0 3.0\n 1.0 2.0 3.0\n\njulia> b = Flux.Scale(Float32[1 2 3 4], false, abs2)\nScale(1, 4, abs2; bias=false) # 4 parameters\n\njulia> b([1, 10])\n2×4 Matrix{Float32}:\n 1.0 4.0 9.0 16.0\n 100.0 400.0 900.0 1600.0\n\njulia> Flux.trainables(b)\n1-element Vector{AbstractArray}:\n Float32[1.0 2.0 3.0 4.0]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.","category":"page"},{"location":"reference/models/layers/#Convolution-Models","page":"Built-in Layers","title":"Convolution Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are used to build convolutional neural networks (CNNs).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Conv\nConv(weight::AbstractArray)\nConvTranspose\nConvTranspose(weight::AbstractArray)\nCrossCor\nCrossCor(weight::AbstractArray)\nDepthwiseConv\nSamePad\nFlux.flatten","category":"page"},{"location":"reference/models/layers/#Flux.Conv","page":"Built-in Layers","title":"Flux.Conv","text":"Conv(filter, in => out, σ = identity;\n stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])\n\nStandard convolutional layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nImage data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.\n\nTo take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:\n\nfilter should be a tuple of N integers.\nKeywords stride and dilation should each be either single integer, or a tuple with N integers.\nKeyword pad specifies the number of elements added to the borders of the data array. It can be\na single integer for equal padding all around,\na tuple of N integers, to apply the same padding at begin/end of each spatial dimension,\na tuple of 2*N integers, for asymmetric padding, or\nthe singleton SamePad(), to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.\nKeyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.\n\nKeywords to control initialization of the layer:\n\ninit - Function used to generate initial weights. Defaults to glorot_uniform.\nbias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).\n\nSee also ConvTranspose, DepthwiseConv, CrossCor.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = Conv((5,5), 3 => 7, relu; bias = false)\nConv((5, 5), 3 => 7, relu, bias=false) # 525 parameters\n\njulia> layer(xs) |> size\n(96, 96, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2)(xs) |> size\n(48, 48, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, pad = SamePad())(xs) |> size\n(50, 50, 7, 50)\n\njulia> Conv((1,1), 3 => 7; pad = (20,10,0,0))(xs) |> size\n(130, 100, 7, 50)\n\njulia> Conv((5,5), 3 => 7; stride = 2, dilation = 4)(xs) |> size\n(42, 42, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Conv-Tuple{AbstractArray}","page":"Built-in Layers","title":"Flux.Conv","text":"Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nConstructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).\n\njulia> weight = rand(3, 4, 5);\n\njulia> bias = zeros(5);\n\njulia> layer = Conv(weight, bias, sigmoid) # expects 1 spatial dimension\nConv((3,), 4 => 5, σ) # 65 parameters\n\njulia> layer(randn(100, 4, 64)) |> size\n(98, 5, 64)\n\njulia> Flux.params(layer) |> length\n2\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.ConvTranspose","page":"Built-in Layers","title":"Flux.ConvTranspose","text":"ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])\n\nStandard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.\n\nNote that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.\n\nTo conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = ConvTranspose((5,5), 3 => 7, relu)\nConvTranspose((5, 5), 3 => 7, relu) # 532 parameters\n\njulia> layer(xs) |> size\n(104, 104, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2)(xs) |> size\n(203, 203, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=2, outpad=1)(xs) |> size\n(204, 204, 7, 50)\n\njulia> ConvTranspose((5,5), 3 => 7, stride=3, pad=SamePad())(xs) |> size\n(300, 300, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.ConvTranspose-Tuple{AbstractArray}","page":"Built-in Layers","title":"Flux.ConvTranspose","text":"ConvTranspose(weight::AbstractArray, [bias, activation; stride, pad, outpad, dilation, groups])\n\nConstructs a ConvTranspose layer with the given weight and bias. Accepts the same keywords and has the same defaults as ConvTranspose(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).\n\nExamples\n\njulia> weight = rand(3, 4, 5);\n\njulia> bias = zeros(4);\n\njulia> layer = ConvTranspose(weight, bias, sigmoid)\nConvTranspose((3,), 5 => 4, σ) # 64 parameters\n\njulia> layer(randn(100, 5, 64)) |> size # transposed convolution will increase the dimension size (upsampling)\n(102, 4, 64)\n\njulia> Flux.params(layer) |> length\n2\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.CrossCor","page":"Built-in Layers","title":"Flux.CrossCor","text":"CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\n\nStandard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.\n\nParameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.\n\nSee also Conv for more detailed description of keywords.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)\nCrossCor((5, 5), 3 => 6, relu, bias=false) # 450 parameters\n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> CrossCor((5,5), 3 => 7, stride=3, pad=(2,0))(xs) |> size\n(34, 32, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.CrossCor-Tuple{AbstractArray}","page":"Built-in Layers","title":"Flux.CrossCor","text":"CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nConstructs a CrossCor layer with the given weight and bias. Accepts the same keywords and has the same defaults as CrossCor(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).\n\nExamples\n\njulia> weight = rand(3, 4, 5);\n\njulia> bias = zeros(5);\n\njulia> layer = CrossCor(weight, bias, relu)\nCrossCor((3,), 4 => 5, relu) # 65 parameters\n\njulia> layer(randn(100, 4, 64)) |> size\n(98, 5, 64)\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.DepthwiseConv","page":"Built-in Layers","title":"Flux.DepthwiseConv","text":"DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])\nDepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])\n\nReturn a depthwise convolutional layer, that is a Conv layer with number of groups equal to the number of input channels.\n\nSee Conv for a description of the arguments.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images\n\njulia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)\nConv((5, 5), 3 => 6, relu, groups=3, bias=false) # 150 parameters \n\njulia> layer(xs) |> size\n(96, 96, 6, 50)\n\njulia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size\n(50, 50, 9, 50)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.SamePad","page":"Built-in Layers","title":"Flux.SamePad","text":"SamePad()\n\nPassed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).\n\nSee also Conv, MaxPool.\n\nExamples\n\njulia> xs = rand32(100, 100, 3, 50); # a batch of images\n\njulia> layer = Conv((2,2), 3 => 7, pad=SamePad())\nConv((2, 2), 3 => 7, pad=(1, 0, 1, 0)) # 91 parameters\n\njulia> layer(xs) |> size # notice how the dimensions stay the same with this padding\n(100, 100, 7, 50)\n\njulia> layer2 = Conv((2,2), 3 => 7)\nConv((2, 2), 3 => 7) # 91 parameters\n\njulia> layer2(xs) |> size # the output dimension changes as the padding was not \"same\"\n(99, 99, 7, 50)\n\njulia> layer3 = Conv((5, 5), 3 => 7, stride=2, pad=SamePad())\nConv((5, 5), 3 => 7, pad=2, stride=2) # 532 parameters\n\njulia> layer3(xs) |> size # output size = `ceil(input_size/stride)` = 50\n(50, 50, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.flatten","page":"Built-in Layers","title":"Flux.flatten","text":"flatten(x)\n\nSame as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#MultiHeadAttention","page":"Built-in Layers","title":"MultiHeadAttention","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"MultiHeadAttention","category":"page"},{"location":"reference/models/layers/#Flux.MultiHeadAttention","page":"Built-in Layers","title":"Flux.MultiHeadAttention","text":"MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])\n\nThe multi-head dot-product attention layer used in Transformer architectures [1].\n\nReturns the transformed input sequence and the attention scores.\n\n[1] Vaswani et al. \"Attention is all you need.\" Advances in Neural Information Processing Systems. 2017.\n\nArguments\n\ndims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.\nnheads: number of heads. Default 8.\ninit: weight initializer for the Dense layers. Default glorot_uniform.\nbias : whether pointwise QKVO dense transforms use bias. Default false.\ndropout_prob: dropout probability for the attention scores. Default 0.0.\n\nForward\n\n(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])\n\nThe arguments of the forward pass are:\n\nq_in: Input query array of size (q_in_dim, q_len, batch_size).\nk_in: Input key array of size (k_in_dim, kv_len, batch_size).\nv_in: Input value array of size (v_in_dim, kv_len, batch_size).\nbias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.\nmask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.\n\nAlternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).\n\nSee also NNlib.dot_product_attention.\n\nExamples\n\nmha = MultiHeadAttention(64, nheads = 8)\nq = rand(Float32, (64, 10, 32))\nk = rand(Float32, (64, 20, 32))\nv = rand(Float32, (64, 20, 32))\ny, α = mha(q, k, v) \n# [y] = [64, 10, 32]\n# [α] = [20, 10, 8, 32]\n\nmha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)\ny, α = mha(q) # self-attention\n# [y] = [1024, 10, 32]\n# [α] = [10, 10, 8, 32]\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Pooling","page":"Built-in Layers","title":"Pooling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"AdaptiveMaxPool\nMaxPool\nGlobalMaxPool\nAdaptiveMeanPool\nMeanPool\nGlobalMeanPool","category":"page"},{"location":"reference/models/layers/#Flux.AdaptiveMaxPool","page":"Built-in Layers","title":"Flux.AdaptiveMaxPool","text":"AdaptiveMaxPool(out::NTuple)\n\nAdaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMaxPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MaxPool","page":"Built-in Layers","title":"Flux.MaxPool","text":"MaxPool(window::NTuple; pad=0, stride=window)\n\nMax pooling layer, which replaces all pixels in a block of size window with one.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7, pad=2), # 532 parameters\n MaxPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(100, 100, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\njulia> layer = MaxPool((5,), pad=2, stride=(3,)) # one-dimensional window\nMaxPool((5,), pad=2, stride=3)\n\njulia> layer(rand(Float32, 100, 7, 50)) |> size\n(34, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMaxPool","page":"Built-in Layers","title":"Flux.GlobalMaxPool","text":"GlobalMaxPool()\n\nGlobal max pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.\n\nSee also MaxPool, GlobalMeanPool.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\njulia> GlobalMaxPool()(rand(3,5,7)) |> size # preserves 2 dimensions\n(1, 5, 7)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AdaptiveMeanPool","page":"Built-in Layers","title":"Flux.AdaptiveMeanPool","text":"AdaptiveMeanPool(out::NTuple)\n\nAdaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).\n\nSee also MaxPool, AdaptiveMaxPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images\n\njulia> AdaptiveMeanPool((25, 25))(xs) |> size\n(25, 25, 3, 50)\n\njulia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.MeanPool","page":"Built-in Layers","title":"Flux.MeanPool","text":"MeanPool(window::NTuple; pad=0, stride=window)\n\nMean pooling layer, averaging all pixels in a block of size window.\n\nExpects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).\n\nBy default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().\n\nSee also Conv, MaxPool, AdaptiveMeanPool.\n\nExamples\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))\nChain(\n Conv((5, 5), 3 => 7), # 532 parameters\n MeanPool((5, 5), pad=2),\n)\n\njulia> m[1](xs) |> size\n(96, 96, 7, 50)\n\njulia> m(xs) |> size\n(20, 20, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GlobalMeanPool","page":"Built-in Layers","title":"Flux.GlobalMeanPool","text":"GlobalMeanPool()\n\nGlobal mean pooling layer.\n\nTransforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.\n\njulia> xs = rand(Float32, 100, 100, 3, 50);\n\njulia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());\n\njulia> m(xs) |> size\n(1, 1, 7, 50)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Upsampling","page":"Built-in Layers","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The opposite of pooling, these layers increase the size of an array. They have no trainable parameters. ","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Upsample\nPixelShuffle","category":"page"},{"location":"reference/models/layers/#Flux.Upsample","page":"Built-in Layers","title":"Flux.Upsample","text":"Upsample(mode = :nearest; [scale, size]) \nUpsample(scale, mode = :nearest)\n\nAn upsampling layer. One of two keywords must be given:\n\nIf scale is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword size accepts a tuple, to directly specify the leading dimensions of the output.\n\nCurrently supported upsampling modes and corresponding NNlib's methods are:\n\n:nearest -> NNlib.upsample_nearest \n:bilinear -> NNlib.upsample_bilinear\n:trilinear -> NNlib.upsample_trilinear\n\nExamples\n\njulia> m = Upsample(scale = (2, 3))\nUpsample(:nearest, scale = (2, 3))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 6, 1, 1)\n\njulia> m = Upsample(:bilinear, size = (4, 5))\nUpsample(:bilinear, size = (4, 5))\n\njulia> m(ones(2, 2, 1, 1)) |> size\n(4, 5, 1, 1)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PixelShuffle","page":"Built-in Layers","title":"Flux.PixelShuffle","text":"PixelShuffle(r::Int)\n\nPixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.\n\nSee NNlib.pixel_shuffle.\n\nExamples\n\njulia> p = PixelShuffle(2);\n\njulia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]\n2×2×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 4.1\n 5.1 6.1\n\n[:, :, 2, 1] =\n 3.2 4.2\n 5.2 6.2\n\n[:, :, 3, 1] =\n 3.3 4.3\n 5.3 6.3\n\n[:, :, 4, 1] =\n 3.4 4.4\n 5.4 6.4\n\njulia> p(xs)\n4×4×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 3.1 3.3 4.1 4.3\n 3.2 3.4 4.2 4.4\n 5.1 5.3 6.1 6.3\n 5.2 5.4 6.2 6.4\n\njulia> xs = [3row + col + channel/10 for row in 1:2, col in 1:3, channel in 1:4, n in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 5.1 6.1\n 7.1 8.1 9.1\n\n[:, :, 2, 1] =\n 4.2 5.2 6.2\n 7.2 8.2 9.2\n\n[:, :, 3, 1] =\n 4.3 5.3 6.3\n 7.3 8.3 9.3\n\n[:, :, 4, 1] =\n 4.4 5.4 6.4\n 7.4 8.4 9.4\n\njulia> p(xs)\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 4.1 4.3 5.1 5.3 6.1 6.3\n 4.2 4.4 5.2 5.4 6.2 6.4\n 7.1 7.3 8.1 8.3 9.1 9.3\n 7.2 7.4 8.2 8.4 9.2 9.4\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Embedding-Vectors","page":"Built-in Layers","title":"Embedding Vectors","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Flux.Embedding\nFlux.EmbeddingBag","category":"page"},{"location":"reference/models/layers/#Flux.Embedding","page":"Built-in Layers","title":"Flux.Embedding","text":"Embedding(in => out; init=randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.\n\nThis layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.\n\nFor indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).\n\nExamples\n\njulia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))\nEmbedding(26 => 4) # 104 parameters\n\njulia> emb(2) # one column of e.weight (here not random!)\n4-element Vector{Float32}:\n 0.0\n 22.0\n 0.0\n 0.0\n\njulia> emb([3, 1, 20, 14, 4, 15, 7]) # vocabulary indices, in 1:26\n4×7 Matrix{Float32}:\n 0.0 22.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n 22.0 0.0 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 22.0 0.0 0.0\n\njulia> ans == emb(Flux.onehotbatch(\"cat&dog\", 'a':'z', 'n'))\ntrue\n\njulia> emb(rand(1:26, (10, 1, 12))) |> size # three batch dimensions\n(4, 10, 1, 12)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.EmbeddingBag","page":"Built-in Layers","title":"Flux.EmbeddingBag","text":"EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)\n\nA lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a \"bag\". Their individual embedding vectors are reduced to one, using mean or some other function.\n\nInstead of acting on one \"bag\", such as x::Vector{Int}, the layer can also act on several:\n\nActing on a vector of \"bags\", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).\nAny higher-rank array of integers is interpreted as a collection of \"bags\" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all \"bags\" have the same length.\nA vector of \"bags\" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.\n\nThe \"bag\" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.\n\nExamples\n\njulia> vocab_size = 26; # embed into 3 dimensions, with non-random vectors:\n\njulia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))\nEmbeddingBag(26 => 3) # 78 parameters\n\njulia> eb([2]) # one bag of 1 item\n3-element Vector{Float32}:\n 0.0\n 100.0\n 0.0\n\njulia> eb([3,3,1]) # one bag of 3 items, one mean embedding\n3-element Vector{Float32}:\n 33.333332\n 0.0\n 66.666664\n\njulia> eb([[3,1,3], [2,1]]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 50.0\n 0.0 50.0\n 66.6667 0.0\n\njulia> eb([1 1 1 1; 1 2 3 4]) # 4 bags each of 2 items, eachcol([1 1 1 1; 1 2 3 4])\n3×4 Matrix{Float32}:\n 100.0 50.0 50.0 50.0\n 0.0 50.0 0.0 0.0\n 0.0 0.0 50.0 0.0\n\njulia> eb(rand(1:26, 10, 5, 5)) |> size # 25 bags each of 10 items\n(3, 5, 5)\n\nAnother way to specify \"many bags of many items\" is to provide a vector data (each in 1:in) and a vector at stating where to split that up into \"bags\". The first bag starts with data[at[1]], the second at data[at[2]], and so on, with no overlaps and nothing left out (thus it requires at[1]==1).\n\njulia> data = [11, 1, 12, 2, 13, 3, 14];\n\njulia> data[1:3], data[4:end]\n([11, 1, 12], [2, 13, 3, 14])\n\njulia> eb(data, [1, 4]) # two bags, of 3 and 4 items\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 0.0 25.0\n 0.0 25.0\n\nFinally, each bag may also be also be represented as a OneHotMatrix.\n\njulia> eb(Flux.onehotbatch(\"bba\", 'a':'z')) # same as [2,2,1], one bag of 3 items\n3-element Vector{Float32}:\n 33.333332\n 66.666664\n 0.0\n\njulia> eb([Flux.onehotbatch(\"bba\", 'a':'z'), Flux.onehotbatch(\"cc\", 'a':'z')]) # two bags\n3×2 Matrix{Float32}:\n 33.3333 0.0\n 66.6667 0.0\n 0.0 100.0\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#man-dataflow-layers","page":"Built-in Layers","title":"Dataflow Layers, or Containers","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Chain\nFlux.activations\nMaxout\nSkipConnection\nParallel\nPairwiseFusion","category":"page"},{"location":"reference/models/layers/#Flux.Chain","page":"Built-in Layers","title":"Flux.Chain","text":"Chain(layers...)\nChain(name = layer, ...)\n\nCollects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, m[2] or m[1:end-1], and if names are given, m[:name] == m[1] etc.\n\nExamples\n\njulia> m = Chain(x -> x^2, x -> x+1);\n\njulia> m(5) == 26\ntrue\n\njulia> m = Chain(Dense(10 => 5, tanh), Dense(5 => 2));\n\njulia> x = rand32(10, 32);\n\njulia> m(x) == m[2](m[1](x))\ntrue\n\njulia> m2 = Chain(enc = Chain(Flux.flatten, Dense(10 => 5, tanh)), \n dec = Dense(5 => 2));\n\njulia> m2(x) == (m2[:dec] ∘ m2[:enc])(x)\ntrue\n\nFor large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.activations","page":"Built-in Layers","title":"Flux.activations","text":"activations(c::Chain, input)\n\nLike calling a Chain, but saves the result of each layer as an output.\n\nExamples\n\njulia> using Flux: activations\n\njulia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);\n\njulia> activations(c, 1)\n(2, 4, 64)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.Maxout","page":"Built-in Layers","title":"Flux.Maxout","text":"Maxout(layers...)\nMaxout(f, n_alts)\n\nThis contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.\n\nInstead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.\n\nMaxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio \"Maxout Networks\" https://arxiv.org/abs/1302.4389.\n\nSee also Parallel to reduce with other operators.\n\nExamples\n\njulia> m = Maxout(x -> abs2.(x), x -> x .* 3);\n\njulia> m([-2 -1 0 1 2])\n1×5 Matrix{Int64}:\n 4 1 0 3 6\n\njulia> m3 = Maxout(() -> Dense(5 => 7, tanh), 3)\nMaxout(\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n Dense(5 => 7, tanh), # 42 parameters\n) # Total: 6 arrays, 126 parameters, 888 bytes.\n\njulia> Flux.outputsize(m3, (5, 11))\n(7, 11)\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.SkipConnection","page":"Built-in Layers","title":"Flux.SkipConnection","text":"SkipConnection(layer, connection)\n\nCreate a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, \"skipped\" input.\n\nThe simplest \"ResNet\"-type connection is just SkipConnection(layer, +). Here is a more complicated example:\n\njulia> m = Conv((3,3), 4 => 7, pad=(1,1));\n\njulia> x = ones(Float32, 5, 5, 4, 10);\n\njulia> size(m(x)) == (5, 5, 7, 10)\ntrue\n\njulia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));\n\njulia> size(sm(x)) == (5, 5, 11, 10)\ntrue\n\nSee also Parallel, Maxout.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Parallel","page":"Built-in Layers","title":"Flux.Parallel","text":"Parallel(connection, layers...)\nParallel(connection; name = layer, ...)\n\nCreate a layer which passes an input array to each path in layers, before reducing the output with connection.\n\nCalled with one input x, this is equivalent to connection([l(x) for l in layers]...). If called with multiple inputs, one is passed to each layer, thus Parallel(+, f, g)(x, y) = f(x) + g(y).\n\nLike Chain, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: m[1] == m[:name] is the first layer.\n\nSee also SkipConnection which is Parallel with one identity, and Maxout which reduces by broadcasting max.\n\nExamples\n\njulia> model = Chain(Dense(3 => 5),\n Parallel(vcat, Dense(5 => 4), Chain(Dense(5 => 7), Dense(7 => 4))),\n Dense(8 => 17));\n\njulia> model(rand32(3)) |> size\n(17,)\n\njulia> model2 = Parallel(+; α = Dense(10 => 2, tanh), β = Dense(5 => 2))\nParallel(\n +,\n α = Dense(10 => 2, tanh), # 22 parameters\n β = Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 34 parameters, 392 bytes.\n\njulia> model2(rand32(10), rand32(5)) |> size\n(2,)\n\njulia> model2[:α](rand32(10)) |> size\n(2,)\n\njulia> model2[:β] == model2[2]\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.PairwiseFusion","page":"Built-in Layers","title":"Flux.PairwiseFusion","text":"PairwiseFusion(connection, layers...)\n\nArguments\n\nconnection: A function taking 2 inputs and combining them into a single output \nlayers: The layers whose outputs are combined\n\nInputs\n\nThis layer behaves differently based on input type:\n\nIf input x is a tuple of length N (or the input is xs with N x's), matching the number of layers, \n\nthen each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:\n\nx1 → layer1 → y1 ↘\n connection → layer2 → y2 ↘\n x2 ↗ connection → layer3 → y3\n x3 ↗\n\n... or written as:\n\ny1 = layer1(x1)\ny2 = layer2(connection(y1, x2))\ny3 = layer3(connection(y2, x3))\n\nWith just one input, each layer receives the same x combined with the previous output. Thus y = PairwiseFusion(connection, layers...)(x) obeys:\n\ny[1] == layers[1](x)\nfor i in 2:length(layers)\n y[i] == connection(layers[i](y[i-1]), x)\nend\n\nReturns\n\nA tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Recurrent-Models","page":"Built-in Layers","title":"Recurrent Models","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"RNN\nLSTM\nGRU\nGRUv3\nFlux.Recur\nFlux.reset!","category":"page"},{"location":"reference/models/layers/#Flux.RNN","page":"Built-in Layers","title":"Flux.RNN","text":"RNN(in => out, σ = tanh)\n\nThe most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.\n\nThe arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(RNNCell(a...)), and so RNNs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nExamples\n\njulia> r = RNN(3 => 5)\nRecur(\n RNNCell(3 => 5, tanh), # 50 parameters\n) # Total: 4 trainable arrays, 50 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 432 bytes.\n\njulia> r(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(r);\n\njulia> r(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the following example:julia> r = RNN(3 => 5)\nRecur(\n RNNCell(3 => 5, tanh), # 50 parameters\n) # Total: 4 trainable arrays, 50 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 432 bytes.\n\njulia> r.state |> size\n(5, 1)\n\njulia> r(rand(Float32, 3)) |> size\n(5,)\n\njulia> r.state |> size\n(5, 1)\n\njulia> r(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\njulia> r.state |> size # state shape has changed\n(5, 10)\n\njulia> r(rand(Float32, 3)) |> size # erroneously outputs a length 5*10 = 50 vector.\n(50,)\n\nNote:\n\nRNNCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type, but if Wh is dxd, then Wi should be of shape dxN.\n\njulia> using LinearAlgebra\n\njulia> r = Flux.Recur(Flux.RNNCell(tanh, rand(5, 4), Tridiagonal(rand(5, 5)), rand(5), rand(5, 1)))\n\njulia> r(rand(4, 10)) |> size # batch size of 10\n(5, 10)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.LSTM","page":"Built-in Layers","title":"Flux.LSTM","text":"LSTM(in => out)\n\nLong Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.\n\nThe arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(LSTMCell(a...)), and so LSTMs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nSee this article for a good overview of the internals.\n\nExamples\n\njulia> l = LSTM(3 => 5)\nRecur(\n LSTMCell(3 => 5), # 190 parameters\n) # Total: 5 trainable arrays, 190 parameters,\n # plus 2 non-trainable, 10 parameters, summarysize 1.062 KiB.\n\njulia> l(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(l);\n\njulia> l(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.\n\nNote:\n\nLSTMCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.GRU","page":"Built-in Layers","title":"Flux.GRU","text":"GRU(in => out)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.\n\nThe integer arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(GRUCell(a...)), and so GRUs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nSee this article for a good overview of the internals.\n\nExamples\n\njulia> g = GRU(3 => 5)\nRecur(\n GRUCell(3 => 5), # 140 parameters\n) # Total: 4 trainable arrays, 140 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 792 bytes.\n\njulia> g(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(g);\n\njulia> g(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.\n\nNote:\n\nGRUCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.GRUv3","page":"Built-in Layers","title":"Flux.GRUv3","text":"GRUv3(in => out)\n\nGated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.\n\nThe arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.\n\nThis constructor is syntactic sugar for Recur(GRUv3Cell(a...)), and so GRUv3s are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.\n\nSee this article for a good overview of the internals.\n\nExamples\n\njulia> g = GRUv3(3 => 5)\nRecur(\n GRUv3Cell(3 => 5), # 140 parameters\n) # Total: 5 trainable arrays, 140 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 848 bytes.\n\njulia> g(rand(Float32, 3)) |> size\n(5,)\n\njulia> Flux.reset!(g);\n\njulia> g(rand(Float32, 3, 10)) |> size # batch size of 10\n(5, 10)\n\nwarning: Batch size changes\nFailing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.\n\nNote:\n\nGRUv3Cells can be constructed directly by specifying the non-linear function, the Wi, Wh, and Wh_h internal matrices, a bias vector b, and a learnable initial state state0. The Wi, Wh, and Wh_h matrices do not need to be the same type. See the example in RNN.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Flux.Recur","page":"Built-in Layers","title":"Flux.Recur","text":"Recur(cell)\n\nRecur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:\n\nh, y = cell(h, x...)\n\nFor example, here's a recurrent network that keeps a running total of its inputs:\n\nExamples\n\njulia> accum(h, x) = (h + x, x)\naccum (generic function with 1 method)\n\njulia> rnn = Flux.Recur(accum, 0)\nRecur(accum)\n\njulia> rnn(2) \n2\n\njulia> rnn(3)\n3\n\njulia> rnn.state\n5\n\nFolding over a 3d Array of dimensions (features, batch, time) is also supported:\n\njulia> accum(h, x) = (h .+ x, x)\naccum (generic function with 1 method)\n\njulia> rnn = Flux.Recur(accum, zeros(Int, 1, 1))\nRecur(accum)\n\njulia> rnn([2])\n1-element Vector{Int64}:\n 2\n\njulia> rnn([3])\n1-element Vector{Int64}:\n 3\n\njulia> rnn.state\n1×1 Matrix{Int64}:\n 5\n\njulia> out = rnn(reshape(1:10, 1, 1, :)); # apply to a sequence of (features, batch, time)\n\njulia> out |> size\n(1, 1, 10)\n\njulia> vec(out)\n10-element Vector{Int64}:\n 1\n 2\n 3\n 4\n 5\n 6\n 7\n 8\n 9\n 10\n\njulia> rnn.state\n1×1 Matrix{Int64}:\n 60\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.reset!","page":"Built-in Layers","title":"Flux.reset!","text":"reset!(rnn)\n\nReset the hidden state of a recurrent layer back to its original value.\n\nAssuming you have a Recur layer rnn, this is roughly equivalent to:\n\nrnn.state = hidden(rnn.cell)\n\nExamples\n\njulia> r = Flux.RNNCell(relu, ones(1,1), zeros(1,1), ones(1,1), zeros(1,1)); # users should use the RNN wrapper struct instead\n\njulia> y = Flux.Recur(r, ones(1,1));\n\njulia> y.state\n1×1 Matrix{Float64}:\n 1.0\n\njulia> y(ones(1,1)) # relu(1*1 + 1)\n1×1 Matrix{Float64}:\n 2.0\n\njulia> y.state\n1×1 Matrix{Float64}:\n 2.0\n\njulia> Flux.reset!(y)\n1×1 Matrix{Float64}:\n 0.0\n\njulia> y.state\n1×1 Matrix{Float64}:\n 0.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Normalisation-and-Regularisation","page":"Built-in Layers","title":"Normalisation & Regularisation","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"BatchNorm\nDropout\nAlphaDropout\nLayerNorm\nInstanceNorm\nGroupNorm\nFlux.normalise","category":"page"},{"location":"reference/models/layers/#Flux.BatchNorm","page":"Built-in Layers","title":"Flux.BatchNorm","text":"BatchNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=true, track_stats=true, active=nothing,\n eps=1f-5, momentum= 0.1f0)\n\nBatch Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N dimensions, call the N-1th the channel dimension. For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.\n\nBatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nAfter normalisation, elementwise activation λ is applied.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nUse testmode! during inference.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = BatchNorm(3);\n\njulia> Flux.trainmode!(m);\n\njulia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.Dropout","page":"Built-in Layers","title":"Flux.Dropout","text":"Dropout(p; [dims, rng, active])\n\nLayer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.\n\nWhile training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.\n\nBy default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.\n\nBy default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).\n\nKeyword rng lets you specify a custom random number generator. (Only supported on the CPU.)\n\nExamples\n\njulia> m = Chain(Dense(ones(3,2)), Dropout(0.4))\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4),\n)\n\njulia> m(ones(2, 7)) # test mode, no effect\n3×7 Matrix{Float64}:\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n 2.0 2.0 2.0 2.0 2.0 2.0 2.0\n\njulia> Flux.trainmode!(m) # equivalent to use within gradient\nChain(\n Dense(2 => 3), # 9 parameters\n Dropout(0.4, active=true),\n)\n\njulia> m(ones(2, 7))\n3×7 Matrix{Float64}:\n 0.0 0.0 3.33333 0.0 0.0 0.0 0.0\n 3.33333 0.0 3.33333 0.0 3.33333 0.0 3.33333\n 3.33333 3.33333 0.0 3.33333 0.0 0.0 3.33333\n\njulia> y = m(ones(2, 10_000));\n\njulia> using Statistics\n\njulia> mean(y) # is about 2.0, same as in test mode\n1.9989999999999961\n\njulia> mean(iszero, y) # is about 0.4\n0.4003\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.AlphaDropout","page":"Built-in Layers","title":"Flux.AlphaDropout","text":"AlphaDropout(p; [rng, active])\n\nA dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.\n\nDoes nothing to the input once testmode! is true.\n\nExamples\n\njulia> using Statistics\n\njulia> x = randn32(1000,1);\n\njulia> m = Chain(Dense(1000 => 1000, selu), AlphaDropout(0.2));\n\njulia> Flux.trainmode!(m);\n\njulia> y = m(x);\n\njulia> isapprox(std(x), std(y), atol=0.2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.LayerNorm","page":"Built-in Layers","title":"Flux.LayerNorm","text":"LayerNorm(size..., λ=identity; affine=true, eps=1f-5)\n\nA normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.\n\nIn the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.\n\nIf affine=true, it also applies a learnable shift and rescaling using the Scale layer.\n\nSee also BatchNorm, InstanceNorm, GroupNorm, and normalise.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = LayerNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.InstanceNorm","page":"Built-in Layers","title":"Flux.InstanceNorm","text":"InstanceNorm(channels::Integer, λ=identity;\n initβ=zeros32, initγ=ones32,\n affine=false, track_stats=false,\n eps=1f-5, momentum=0.1f0)\n\nInstance Normalization layer. channels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nInstanceNorm computes the mean and variance for each D_1×...×D_{N-2}×1×1 input slice and normalises the input accordingly.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nIf track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.\n\nWarning: the defaults for affine and track_stats used to be true in previous Flux versions (< v0.12).\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels\n\njulia> m = InstanceNorm(3);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.GroupNorm","page":"Built-in Layers","title":"Flux.GroupNorm","text":"GroupNorm(channels::Int, G::Int, λ = identity;\n initβ = zeros32,\n initγ = ones32,\n affine = true,\n eps = 1f-5,\n momentum = 0.1f0)\n\nGroup Normalization layer.\n\nchs is the number of channels, the channel dimension of your input. For an array of N dimensions, the N-1th index is the channel dimension.\n\nG is the number of groups along which the statistics are computed. The number of channels must be an integer multiple of the number of groups.\n\nchannels should be the size of the channel dimension in your data (see below).\n\nGiven an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.\n\nIf affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.\n\nExamples\n\njulia> using Statistics\n\njulia> xs = rand(3, 3, 4, 2); # a batch of 2 images, each having 4 channels\n\njulia> m = GroupNorm(4, 2);\n\njulia> y = m(xs);\n\njulia> isapprox(std(y[:, :, 1:2, 1]), 1, atol=0.1) && std(xs[:, :, 1:2, 1]) != std(y[:, :, 1:2, 1])\ntrue\n\njulia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])\ntrue\n\n\n\n\n\n","category":"type"},{"location":"reference/models/layers/#Flux.normalise","page":"Built-in Layers","title":"Flux.normalise","text":"normalise(x; dims=ndims(x), eps=1e-5)\n\nNormalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the denominator for numerical stability.\n\nExamples\n\njulia> using Statistics\n\njulia> x = [90, 100, 110, 130, 70];\n\njulia> mean(x), std(x; corrected=false)\n(100.0, 20.0)\n\njulia> y = Flux.normalise(x)\n5-element Vector{Float64}:\n -0.49999975000012503\n 0.0\n 0.49999975000012503\n 1.499999250000375\n -1.499999250000375\n\njulia> isapprox(std(y; corrected=false), 1, atol=1e-5)\ntrue\n\njulia> x = rand(10:100, 10, 10);\n\njulia> y = Flux.normalise(x, dims=1);\n\njulia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/layers/#Test-vs.-Train","page":"Built-in Layers","title":"Test vs. Train","text":"","category":"section"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference. ","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"warning: Warning\nThis automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.","category":"page"},{"location":"reference/models/layers/","page":"Built-in Layers","title":"Built-in Layers","text":"testmode!(::Any)\ntestmode!(::Any, ::Any)\ntrainmode!","category":"page"},{"location":"reference/models/layers/#Flux.testmode!-Tuple{Any}","page":"Built-in Layers","title":"Flux.testmode!","text":"testmode!(model, [mode]) -> model\n\nSet a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.\n\nIf you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.\n\nThere is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.\n\nExample\n\njulia> d = Dropout(0.3)\nDropout(0.3)\n\njulia> testmode!(d) # dropout is now always disabled\nDropout(0.3, active=false)\n\njulia> trainmode!(d) # dropout is now always enabled\nDropout(0.3, active=true)\n\njulia> testmode!(d, :auto) # back to default\nDropout(0.3)\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.testmode!-Tuple{Any, Any}","page":"Built-in Layers","title":"Flux.testmode!","text":"testmode!(model, inactive)\n\nThis two-argument method is largely internal. It recurses into the model, and until a method like testmode!(d::Dropout, inactive) alters the activity of a layer. Custom layers can support manual testmode! / trainmode! switching by defining such a method.\n\nPossible values of inactive are:\n\ntrue for testing, i.e. active=false\nfalse for training, same as trainmode!(m)\n:auto or nothing for Flux to detect training automatically.\n\ncompat: Compat\nThis method may be removed in a future breaking change, to separate the user-facing testmode! from the internal recursion.\n\n\n\n\n\n","category":"method"},{"location":"reference/models/layers/#Flux.trainmode!","page":"Built-in Layers","title":"Flux.trainmode!","text":"trainmode!(model) -> model\n\nSet a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.\n\n\n\n\n\ntrainmode!(m, active)\n\nwarning: Warning\nThis two-argument method is deprecated.\n\nPossible values of active are:\n\ntrue for training, or \nfalse for testing, same as testmode!(m)\n:auto or nothing for Flux to detect training automatically.\n\n\n\n\n\n","category":"function"},{"location":"guide/models/overview/#man-overview","page":"Fitting a Line","title":"Flux Overview: Fitting a Straight Line","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Here's how you'd use Flux to build and train the most basic of models, step by step.","category":"page"},{"location":"guide/models/overview/#A-Trivial-Prediction","page":"Fitting a Line","title":"A Trivial Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will predict the output of the function 4x + 2. Making such predictions is called \"linear regression\", and is really too simple to need a neural network. But it's a nice toy example.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, import Flux and define the function we want to simulate:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux\n\njulia> actual(x) = 4x + 2\nactual (generic function with 1 method)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This example will build a model to approximate the actual function.","category":"page"},{"location":"guide/models/overview/#1.-Provide-Training-and-Test-Data","page":"Fitting a Line","title":"1. Provide Training and Test Data","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Use the actual function to build sets of data for training and verification:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> x_train, x_test = hcat(0:5...), hcat(6:10...)\n([0 1 … 4 5], [6 7 … 9 10])\n\njulia> y_train, y_test = actual.(x_train), actual.(x_test)\n([2 6 … 18 22], [26 30 … 38 42])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Normally, your training and test data come from real world observations, but here we simulate them.","category":"page"},{"location":"guide/models/overview/#2.-Build-a-Model-to-Make-Predictions","page":"Fitting a Line","title":"2. Build a Model to Make Predictions","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, build a model to make predictions with 1 input and 1 output:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters\n\njulia> model.weight\n1×1 Matrix{Float32}:\n 0.95041317\n\njulia> model.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, a dense layer is a struct with fields weight and bias. weight represents a weights' matrix and bias represents a bias vector. There's another way to think about a model. In Flux, models are conceptually predictive functions: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Dense(1 => 1) also implements the function σ(Wx+b) where W and b are the weights and biases. σ is an activation function (more on activations later). Our model has one weight and one bias, but typical models will have many more. Think of weights and biases as knobs and levers Flux can use to tune predictions. Activation functions are transformations that tailor models to your needs. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This model will already make predictions, though not accurate ones yet:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_train)\n1×6 Matrix{Float32}:\n 0.0 0.906654 1.81331 2.71996 3.62662 4.53327","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In order to make better predictions, you'll need to provide a loss function to tell Flux how to objectively evaluate the quality of a prediction. Loss functions compute the cumulative distance between actual values and predictions. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Statistics\n\njulia> loss(model, x, y) = mean(abs2.(model(x) .- y));\n\njulia> loss(predict, x_train, y_train)\n122.64734f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"More accurate predictions will yield a lower loss. You can write your own loss functions or rely on those already provided by Flux. This loss function is called mean squared error (and built-in as mse). Flux works by iteratively reducing the loss through training.","category":"page"},{"location":"guide/models/overview/#3.-Improve-the-Prediction","page":"Fitting a Line","title":"3. Improve the Prediction","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Under the hood, the Flux Flux.train! function uses a loss function and training data to improve the parameters of your model based on a pluggable optimiser:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> using Flux: train!\n\njulia> opt = Descent()\nDescent(0.1)\n\njulia> data = [(x_train, y_train)]\n1-element Vector{Tuple{Matrix{Int64}, Matrix{Int64}}}:\n ([0 1 … 4 5], [2 6 … 18 22])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, we have the optimiser and data we'll pass to train!. All that remains are the parameters of the model. Remember, each model is a Julia struct with a function and configurable parameters. Remember, the dense layer has weights and biases that depend on the dimensions of the inputs and outputs: ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight\n1×1 Matrix{Float32}:\n 0.9066542\n\njulia> predict.bias\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The dimensions of these model parameters depend on the number of inputs and outputs.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Flux will adjust predictions by iteratively changing these parameters according to the optimiser.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This optimiser implements the classic gradient descent strategy. Now improve the parameters of the model with a call to Flux.train! like this:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> train!(loss, predict, data, opt)","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"And check the loss:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> loss(predict, x_train, y_train)\n116.38745f0","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"It went down. Why? ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict.weight, predict.bias\n(Float32[7.246838;;], Float32[1.748103])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The parameters have changed. This single step is the essence of machine learning.","category":"page"},{"location":"guide/models/overview/#3.-Iteratively-Train-the-Model","page":"Fitting a Line","title":"3+. Iteratively Train the Model","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"In the previous section, we made a single call to train! which iterates over the data we passed in just once. An epoch refers to one pass over the dataset. Typically, we will run the training for multiple epochs to drive the loss down even further. Let's run it a few more times:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> for epoch in 1:200\n train!(loss, predict, data, opt)\n end\n\njulia> loss(predict, x_train, y_train)\n0.00339581f0\n\njulia> predict.weight, predict.bias\n(Float32[4.0159144;;], Float32[2.004479])","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After 200 training steps, the loss went down, and the parameters are getting close to those in the function the model is built to predict.","category":"page"},{"location":"guide/models/overview/#4.-Verify-the-Results","page":"Fitting a Line","title":"4. Verify the Results","text":"","category":"section"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Now, let's verify the predictions:","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"julia> predict(x_test)\n1×5 Matrix{Float32}:\n 26.1121 30.13 34.1479 38.1657 42.1836\n\njulia> y_test\n1×5 Matrix{Int64}:\n 26 30 34 38 42","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"The predictions are good. Here's how we got there. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"After we trained the model, we verified it with the test data to verify the results. ","category":"page"},{"location":"guide/models/overview/","page":"Fitting a Line","title":"Fitting a Line","text":"This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.","category":"page"},{"location":"reference/destructure/#man-destructure","page":"Flat vs. Nested","title":"Flat vs. Nested Structures","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"A Flux model is a nested structure, with parameters stored within many layers. Sometimes you may want a flat representation of them, to interact with functions expecting just one vector. This is provided by destructure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> model = Chain(Dense(2=>1, tanh), Dense(1=>1))\nChain(\n Dense(2 => 1, tanh), # 3 parameters\n Dense(1 => 1), # 2 parameters\n) # Total: 4 arrays, 5 parameters, 276 bytes.\n\njulia> flat, rebuild = Flux.destructure(model)\n(Float32[0.863101, 1.2454957, 0.0, -1.6345707, 0.0], Restructure(Chain, ..., 5))\n\njulia> rebuild(zeros(5)) # same structure, new parameters\nChain(\n Dense(2 => 1, tanh), # 3 parameters (all zero)\n Dense(1 => 1), # 2 parameters (all zero)\n) # Total: 4 arrays, 5 parameters, 276 bytes.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Both destructure and the Restructure function can be used within gradient computations. For instance, this computes the Hessian ∂²L/∂θᵢ∂θⱼ of some loss function, with respect to all parameters of the Flux model. The resulting matrix has off-diagonal entries, which cannot really be expressed in a nested structure:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> x = rand(Float32, 2, 16);\n\njulia> grad = gradient(m -> sum(abs2, m(x)), model) # nested gradient\n((layers = ((weight = Float32[10.339018 11.379145], bias = Float32[22.845667], σ = nothing), (weight = Float32[-29.565302;;], bias = Float32[-37.644184], σ = nothing)),),)\n\njulia> function loss(v::Vector)\n m = rebuild(v)\n y = m(x)\n sum(abs2, y)\n end;\n\njulia> gradient(loss, flat) # flat gradient, same numbers\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184],)\n\njulia> Zygote.hessian(loss, flat) # second derivative\n5×5 Matrix{Float32}:\n -7.13131 -5.54714 -11.1393 -12.6504 -8.13492\n -5.54714 -7.11092 -11.0208 -13.9231 -9.36316\n -11.1393 -11.0208 -13.7126 -27.9531 -22.741\n -12.6504 -13.9231 -27.9531 18.0875 23.03\n -8.13492 -9.36316 -22.741 23.03 32.0\n\njulia> Flux.destructure(grad) # acts on non-models, too\n(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184], Restructure(Tuple, ..., 5))","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"In order to collect all parameters of a model into a list instead, you can use the trainables function:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"julia> Flux.trainables(model)\n5-element Vector{AbstractArray}:\n [0.863101 1.2454957]\n [0.0]\n [1.290355429422727;;]\n [0.0]","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Any mutation of the elements of the resulting list will affect the model's parameters.","category":"page"},{"location":"reference/destructure/#All-Parameters","page":"Flat vs. Nested","title":"All Parameters","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"The functions destructure and trainables live in Optimisers.jl.","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Optimisers.destructure\nOptimisers.trainable\nOptimisers.trainables\nOptimisers.isnumeric","category":"page"},{"location":"reference/destructure/#Optimisers.destructure","page":"Flat vs. Nested","title":"Optimisers.destructure","text":"destructure(model) -> vector, reconstructor\n\nCopies all trainable, isnumeric parameters in the model to a vector, and returns also a function which reverses this transformation. Differentiable.\n\nExample\n\njulia> v, re = destructure((x=[1.0, 2.0], y=(sin, [3.0 + 4.0im])))\n(ComplexF64[1.0 + 0.0im, 2.0 + 0.0im, 3.0 + 4.0im], Restructure(NamedTuple, ..., 3))\n\njulia> re([3, 5, 7+11im])\n(x = [3.0, 5.0], y = (sin, ComplexF64[7.0 + 11.0im]))\n\nIf model contains various number types, they are promoted to make vector, and are usually restored by Restructure. Such restoration follows the rules of ChainRulesCore.ProjectTo, and thus will restore floating point precision, but will permit more exotic numbers like ForwardDiff.Dual.\n\nIf model contains only GPU arrays, then vector will also live on the GPU. At present, a mixture of GPU and ordinary CPU arrays is undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainable","page":"Flat vs. Nested","title":"Optimisers.trainable","text":"trainable(x::Layer) -> NamedTuple\n\nThis may be overloaded to make optimisers ignore some fields of every Layer, which would otherwise contain trainable parameters.\n\nwarning: Warning\nThis is very rarely required. Fields of struct Layer which contain functions, or integers like sizes, are always ignored anyway. Overloading trainable is only necessary when some arrays of numbers are to be optimised, and some arrays of numbers are not.\n\nThe default is Functors.children(x), usually a NamedTuple of all fields, and trainable(x) must contain a subset of these.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.trainables","page":"Flat vs. Nested","title":"Optimisers.trainables","text":"trainables(x, path = false)\n\nReturn an iterable over all the trainable parameters in x, that is all the numerical arrays (see isnumeric) which are reachable through trainable.\n\nParameters appearing multiple times in the model (tied weights) will be present only once in the output.\n\nIf path = false, the output is a list of numerical arrays.\n\nIf path = true, the output is a list of (KeyPath, AbstractArray) pairs, where KeyPath is a type representing the path to the array in the original structure.\n\nSee also destructure for a similar operation that returns a single flat vector instead.\n\nExamples\n\njulia> struct MyLayer\n w\n b\n end\n\njulia> Functors.@functor MyLayer\n\njulia> Optimisers.trainable(x::MyLayer) = (; w = x.w,) # only w is trainable in this example\n\njulia> x = MyLayer([1.0,2.0,3.0], [4.0,5.0,6.0]);\n\njulia> trainables(x)\n1-element Vector{AbstractArray}:\n [1.0, 2.0, 3.0]\n\n julia> x = MyLayer((a=[1.0,2.0], b=[3.0]), [4.0,5.0,6.0]);\n\n julia> trainables(x) # collects nested parameters\n 2-element Vector{AbstractArray}:\n [1.0, 2.0]\n [3.0]\n\njulia> x = (a = [1.0,2.0], b = (Dict(\"c\" => [3.0, 4.0], \"d\" => 5.0), [6.0,7.0]));\n\njulia> for (kp, y) in trainables(x, path = true)\n println(kp, \" => \", y)\n end\nKeyPath(:a,) => [1.0, 2.0]\nKeyPath(:b, 1, \"c\") => [3.0, 4.0]\nKeyPath(:b, 2) => [6.0, 7.0]\n\njulia> getkeypath(x, KeyPath(:b, 1, \"c\"))\n2-element Vector{Float64}:\n 3.0\n 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Optimisers.isnumeric","page":"Flat vs. Nested","title":"Optimisers.isnumeric","text":"isnumeric(x) -> Bool\n\nReturns true on any parameter to be adjusted by Optimisers.jl, namely arrays of non-integer numbers. Returns false on all other types.\n\nRequires also that Functors.isleaf(x) == true, to focus on e.g. the parent of a transposed matrix, not the wrapper.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#All-Layers","page":"Flat vs. Nested","title":"All Layers","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Another kind of flat view of a nested model is provided by the modules command. This extracts a list of all layers:","category":"page"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.modules","category":"page"},{"location":"reference/destructure/#Flux.modules","page":"Flat vs. Nested","title":"Flux.modules","text":"modules(m)\n\nReturn an iterator over non-leaf objects that can be reached by recursing m over the children given by functor.\n\nUseful for applying a function (e.g. a regularizer) over specific modules or subsets of the parameters (e.g. the weights but not the biases).\n\nExamples\n\njulia> m1 = Chain(Dense(28^2, 64), BatchNorm(64, relu));\n\njulia> m2 = Chain(m1, Dense(64, 10))\nChain(\n Chain(\n Dense(784 => 64), # 50_240 parameters\n BatchNorm(64, relu), # 128 parameters, plus 128\n ),\n Dense(64 => 10), # 650 parameters\n) # Total: 6 trainable arrays, 51_018 parameters,\n # plus 2 non-trainable, 128 parameters, summarysize 200.312 KiB.\n\njulia> Flux.modules(m2)\n7-element Vector{Any}:\n Chain(Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10)) # 51_018 parameters, plus 128 non-trainable\n (Chain(Dense(784 => 64), BatchNorm(64, relu)), Dense(64 => 10))\n Chain(Dense(784 => 64), BatchNorm(64, relu)) # 50_368 parameters, plus 128 non-trainable\n (Dense(784 => 64), BatchNorm(64, relu))\n Dense(784 => 64) # 50_240 parameters\n BatchNorm(64, relu) # 128 parameters, plus 128 non-trainable\n Dense(64 => 10) # 650 parameters\n\njulia> L2(m) = sum(sum(abs2, l.weight) for l in Flux.modules(m) if l isa Dense)\nL2 (generic function with 1 method)\n\njulia> L2(m2) isa Float32\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Save-and-Load","page":"Flat vs. Nested","title":"Save and Load","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Flux.state\nFlux.loadmodel!","category":"page"},{"location":"reference/destructure/#Flux.state","page":"Flat vs. Nested","title":"Flux.state","text":"state(x)\n\nReturn an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).\n\nBesides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.\n\nThis method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.\n\nThe state can be passed to loadmodel! to restore the model.\n\nExamples\n\nCopy the state into another model\n\njulia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));\n\njulia> s = Flux.state(m1)\n(layers = ((weight = [1.0; 1.0;;], bias = [0.0, 0.0], σ = ()), (weight = [1.0 1.0], bias = [0.0], σ = ())),)\n\njulia> m2 = Chain(Dense(1, 2, tanh), Dense(2, 1; bias=false)); # weights are random numbers\n\njulia> Flux.loadmodel!(m2, s);\n\njulia> m2[1].weight # now the weights of m2 are the same as m1\n2×1 Matrix{Float32}:\n 1.0\n 1.0\n\njulia> Flux.state(trainmode!(Dropout(0.2))) # contains p & activity, but not RNG state\n(p = 0.2, dims = (), active = true, rng = ())\n\njulia> Flux.state(BatchNorm(1)) # contains non-trainable arrays μ, σ²\n(λ = (), β = Float32[0.0], γ = Float32[1.0], μ = Float32[0.0], σ² = Float32[1.0], ϵ = 1.0f-5, momentum = 0.1f0, affine = true, track_stats = true, active = nothing, chs = 1)\n\nSave and load with BSON\n\njulia> using BSON\n\njulia> BSON.@save \"checkpoint.bson\" model_state = s\n\njulia> Flux.loadmodel!(m2, BSON.load(\"checkpoint.bson\")[:model_state])\n\nSave and load with JLD2\n\njulia> using JLD2\n\njulia> JLD2.jldsave(\"checkpoint.jld2\", model_state = s)\n\njulia> Flux.loadmodel!(m2, JLD2.load(\"checkpoint.jld2\", \"model_state\"))\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Flux.loadmodel!","page":"Flat vs. Nested","title":"Flux.loadmodel!","text":"loadmodel!(dst, src)\n\nCopy all the parameters (trainable and non-trainable) from src into dst.\n\nRecursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).\n\nSee also Flux.state.\n\nExamples\n\njulia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))\nChain(\n Dense(5 => 2, tanh), # 12 parameters\n Dense(2 => 1), # 3 parameters\n) # Total: 4 arrays, 15 parameters, 316 bytes.\n\njulia> dst[1].weight ≈ ones(2, 5) # by construction\ntrue\n\njulia> src = Chain(Dense(5 => 2, relu), Dense(2 => 1, bias=false));\n\njulia> Flux.loadmodel!(dst, src);\n\njulia> dst[1].weight ≈ ones(2, 5) # values changed\nfalse\n\njulia> iszero(dst[2].bias)\ntrue\n\nExtended help\n\nThrows an error when:\n\ndst and src do not share the same fields (at any level)\nthe sizes of leaf nodes are mismatched between dst and src\ncopying non-array values to/from an array parameter (except inactive parameters described below)\ndst is a \"tied\" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values\n\nInactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#KeyPath","page":"Flat vs. Nested","title":"KeyPath","text":"","category":"section"},{"location":"reference/destructure/","page":"Flat vs. Nested","title":"Flat vs. Nested","text":"Functors.KeyPath\nFunctors.getkeypath\nFunctors.haskeypath","category":"page"},{"location":"reference/destructure/#Functors.KeyPath","page":"Flat vs. Nested","title":"Functors.KeyPath","text":"KeyPath(keys...)\n\nA type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, or Int.\n\nFor custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.\n\nFor string and integer keys instead, the access is done with getindex.\n\nSee also getkeypath, haskeypath.\n\nExamples\n\njulia> kp = KeyPath(:b, 3)\nKeyPath(:b, 3)\n\njulia> KeyPath(:a, kp, :c, 4) # construct mixing keys and keypaths\nKeyPath(:a, :b, 3, :c, 4)\n\njulia> struct T\n a\n b\n end\n\njulia> function Base.getproperty(x::T, k::Symbol)\n if k in fieldnames(T)\n return getfield(x, k)\n elseif k === :ab\n return \"ab\"\n else \n error()\n end\n end;\n\njulia> Base.propertynames(::T) = (:a, :b, :ab);\n\njulia> x = T(3, Dict(:c => 4, :d => 5));\n\njulia> getkeypath(x, KeyPath(:ab)) # equivalent to x.ab\n\"ab\"\n\njulia> getkeypath(x, KeyPath(:b, :c)) # equivalent to (x.b)[:c]\n4\n\n\n\n\n\n","category":"type"},{"location":"reference/destructure/#Functors.getkeypath","page":"Flat vs. Nested","title":"Functors.getkeypath","text":"getkeypath(x, kp::KeyPath)\n\nReturn the value in x at the path kp.\n\nSee also KeyPath, haskeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> getkeypath(x, KeyPath(:b, \"d\", 2))\n6\n\n\n\n\n\n","category":"function"},{"location":"reference/destructure/#Functors.haskeypath","page":"Flat vs. Nested","title":"Functors.haskeypath","text":"haskeypath(x, kp::KeyPath)\n\nReturn true if x has a value at the path kp.\n\nSee also KeyPath, getkeypath, and setkeypath!.\n\nExamples\n\njulia> x = Dict(:a => 3, :b => Dict(:c => 4, \"d\" => [5, 6, 7]))\nDict{Symbol, Any} with 2 entries:\n :a => 3\n :b => Dict{Any, Any}(:c=>4, \"d\"=>[5, 6, 7])\n\njulia> haskeypath(x, KeyPath(:a))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 1))\ntrue\n\njulia> haskeypath(x, KeyPath(:b, \"d\", 4))\nfalse\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#One-Hot-Encoding-with-OneHotArrays.jl","page":"OneHotArrays.jl","title":"One-Hot Encoding with OneHotArrays.jl","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"It's common to encode categorical variables (like true, false or cat, dog) in \"one-of-k\" or \"one-hot\" form. OneHotArrays.jl provides the onehot function to make this easy.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehot(:b, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> onehot(:c, [:a, :b, :c])\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n ⋅\n 1","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"There is also a onecold function, which is an inverse of onehot. It can also be given an array of numbers instead of booleans, in which case it performs an argmax-like operation, returning the label with the highest corresponding weight.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> onecold(ans, [:a, :b, :c])\n:c\n\njulia> onecold([true, false, false], [:a, :b, :c])\n:a\n\njulia> onecold([0.3, 0.2, 0.5], [:a, :b, :c])\n:c","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"For multiple samples at once, onehotbatch creates a batch (matrix) of one-hot vectors, and onecold treats matrices as batches.","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"julia> using OneHotArrays\n\njulia> onehotbatch([:b, :a, :b], [:a, :b, :c])\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n ⋅ 1 ⋅\n 1 ⋅ 1\n ⋅ ⋅ ⋅\n\njulia> onecold(ans, [:a, :b, :c])\n3-element Vector{Symbol}:\n :b\n :a\n :b","category":"page"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"Note that these operations returned OneHotVector and OneHotMatrix rather than Arrays. OneHotVectors behave like normal vectors but avoid any unnecessary cost compared to using an integer index directly. For example, multiplying a matrix with a one-hot vector simply slices out the relevant row of the matrix under the hood.","category":"page"},{"location":"reference/data/onehot/#Function-listing","page":"OneHotArrays.jl","title":"Function listing","text":"","category":"section"},{"location":"reference/data/onehot/","page":"OneHotArrays.jl","title":"OneHotArrays.jl","text":"OneHotArrays.onehot\nOneHotArrays.onecold\nOneHotArrays.onehotbatch\nOneHotArrays.OneHotArray\nOneHotArrays.OneHotVector\nOneHotArrays.OneHotMatrix","category":"page"},{"location":"reference/data/onehot/#OneHotArrays.onehot","page":"OneHotArrays.jl","title":"OneHotArrays.onehot","text":"onehot(x, labels, [default])\n\nReturns a OneHotVector which is roughly a sparse representation of x .== labels.\n\nInstead of storing say Vector{Bool}, it stores the index of the first occurrence of x in labels. If x is not found in labels, then it either returns onehot(default, labels), or gives an error if no default is given.\n\nSee also onehotbatch to apply this to many xs, and onecold to reverse either of these, as well as to generalise argmax.\n\nExamples\n\njulia> β = onehot(:b, (:a, :b, :c))\n3-element OneHotVector(::UInt32) with eltype Bool:\n ⋅\n 1\n ⋅\n\njulia> αβγ = (onehot(0, 0:2), β, onehot(:z, [:a, :b, :c], :c)) # uses default\n(Bool[1, 0, 0], Bool[0, 1, 0], Bool[0, 0, 1])\n\njulia> hcat(αβγ...) # preserves sparsity\n3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅\n ⋅ 1 ⋅\n ⋅ ⋅ 1\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onecold","page":"OneHotArrays.jl","title":"OneHotArrays.onecold","text":"onecold(y::AbstractArray, labels = 1:size(y,1))\n\nRoughly the inverse operation of onehot or onehotbatch: This finds the index of the largest element of y, or each column of y, and looks them up in labels.\n\nIf labels are not specified, the default is integers 1:size(y,1) – the same operation as argmax(y, dims=1) but sometimes a different return type.\n\nExamples\n\njulia> onecold([false, true, false])\n2\n\njulia> onecold([0.3, 0.2, 0.5], (:a, :b, :c))\n:c\n\njulia> onecold([ 1 0 0 1 0 1 0 1 0 0 1\n 0 1 0 0 0 0 0 0 1 0 0\n 0 0 0 0 1 0 0 0 0 0 0\n 0 0 0 0 0 0 1 0 0 0 0\n 0 0 1 0 0 0 0 0 0 1 0 ], 'a':'e') |> String\n\"abeacadabea\"\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.onehotbatch","page":"OneHotArrays.jl","title":"OneHotArrays.onehotbatch","text":"onehotbatch(xs, labels, [default])\n\nReturns a OneHotMatrix where kth column of the matrix is onehot(xs[k], labels). This is a sparse matrix, which stores just a Vector{UInt32} containing the indices of the nonzero elements.\n\nIf one of the inputs in xs is not found in labels, that column is onehot(default, labels) if default is given, else an error.\n\nIf xs has more dimensions, N = ndims(xs) > 1, then the result is an AbstractArray{Bool, N+1} which is one-hot along the first dimension, i.e. result[:, k...] == onehot(xs[k...], labels).\n\nNote that xs can be any iterable, such as a string. And that using a tuple for labels will often speed up construction, certainly for less than 32 classes.\n\nExamples\n\njulia> oh = onehotbatch(\"abracadabra\", 'a':'e', 'e')\n5×11 OneHotMatrix(::Vector{UInt32}) with eltype Bool:\n 1 ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ 1\n ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅\n ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅\n\njulia> reshape(1:15, 3, 5) * oh # this matrix multiplication is done efficiently\n3×11 Matrix{Int64}:\n 1 4 13 1 7 1 10 1 4 13 1\n 2 5 14 2 8 2 11 2 5 14 2\n 3 6 15 3 9 3 12 3 6 15 3\n\n\n\n\n\n","category":"function"},{"location":"reference/data/onehot/#OneHotArrays.OneHotArray","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotArray","text":"OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}\nOneHotArray(indices, L)\n\nA one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.\n\nTypically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotVector","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotVector","text":"OneHotVector{T} = OneHotArray{T, 0, 1, T}\nOneHotVector(indices, L)\n\nA one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.\n\n\n\n\n\n","category":"type"},{"location":"reference/data/onehot/#OneHotArrays.OneHotMatrix","page":"OneHotArrays.jl","title":"OneHotArrays.OneHotMatrix","text":"OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}\nOneHotMatrix(indices, L)\n\nA one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"CollapsedDocStrings = true","category":"page"},{"location":"reference/training/zygote/#autodiff-zygote","page":"Gradients – Zygote.jl","title":"Automatic Differentiation using Zygote.jl","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Flux re-exports the gradient from Zygote, and uses this function within train! to differentiate the model. Zygote has its own documentation, in particular listing some important limitations.","category":"page"},{"location":"reference/training/zygote/#Explicit-style","page":"Gradients – Zygote.jl","title":"Explicit style","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"The preferred way of using Zygote, and the only way of using most other AD packages, is to explicitly provide a function and its arguments.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Zygote.gradient(f, args...)\nZygote.withgradient(f, args...)\nZygote.jacobian(f, args...)\nZygote.withjacobian(f, args...)\nZygote.hessian\nZygote.hessian_reverse\nZygote.diaghessian\nZygote.pullback","category":"page"},{"location":"reference/training/zygote/#Zygote.gradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.gradient","text":"gradient(f, args...)\n\nReturns a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or the gradient. If no gradient is defined, ∂f/∂x will be nothing.\n\nf(args...) must be a real number, see jacobian for array output.\n\nSee also withgradient to keep the value f(args...), and pullback for value and back-propagator.\n\njulia> gradient(*, 2.0, 3.0, 5.0)\n(15.0, 10.0, 6.0)\n\njulia> gradient(x -> sum(abs2,x), [7.0, 11.0, 13.0])\n([14.0, 22.0, 26.0],)\n\njulia> gradient([7, 11], 0, 1) do x, y, d\n p = size(x, d)\n sum(x.^p .+ y)\n end\n([14.0, 22.0], 2.0, nothing)\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withgradient-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withgradient","text":"withgradient(f, args...)\nwithgradient(f, ::Params)\n\nReturns both the value of the function and the gradient, as a named tuple.\n\njulia> y, ∇ = withgradient(/, 1, 2)\n(val = 0.5, grad = (0.5, -0.25))\n\njulia> ∇ == gradient(/, 1, 2)\ntrue\n\nAllows you to capture auxillary outputs, in addition to the scalar used by gradient. To do this, f must return a Tuple or NamedTuple. Then it calculates grad = gradient(first∘f, args...) but returns the wholeval = f(args...)`:\n\njulia> withgradient([1,2,4]) do x\n z = 1 ./ x\n sum(z), z # here z is an auxillary output\n end\n(val = (1.75, [1.0, 0.5, 0.25]), grad = ([-1.0, -0.25, -0.0625],))\n\njulia> withgradient(3.0, 4.0) do x, y\n (div = x/y, mul = x*y)\n end\n(val = (div = 0.75, mul = 12.0), grad = (0.25, -0.1875))\n\nAlso supports implicit mode:\n\njulia> w = [3.0];\n\njulia> res = withgradient(() -> sum(abs2, w), Params([w]))\n(val = 9.0, grad = Grads(...))\n\njulia> res.grad[w]\n1-element Vector{Float64}:\n 6.0\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.jacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.jacobian","text":"jacobian(f, args...) -> Tuple\n\nFor each array a ∈ args this returns a matrix with Ja[k,i] = ∂y[k]/∂a[i] where y = f(args...) is usually a vector. Arrays of higher dimension are treated like vec(a), or vec(y) for output.\n\nFor scalar x::Number ∈ args, the result is a vector Jx[k] = ∂y[k]/∂x, while for scalar y all results have just one row.\n\nWith any other argument type, no result is produced, even if gradient would work.\n\nThis reverse-mode Jacobian needs to evaluate the pullback once for each element of y. Doing so is usually only efficient when length(y) is small compared to length(a), otherwise forward mode is likely to be better.\n\nSee also withjacobian, hessian, hessian_reverse.\n\nExamples\n\njulia> jacobian(a -> 100*a[1:3].^2, 1:7)[1] # first index (rows) is output\n3×7 Matrix{Int64}:\n 200 0 0 0 0 0 0\n 0 400 0 0 0 0 0\n 0 0 600 0 0 0 0\n\njulia> jacobian((a,x) -> a.^2 .* x, [1,2,3], 1) # scalar argument has vector jacobian\n([2 0 0; 0 4 0; 0 0 6], [1, 4, 9])\n\njulia> jacobian((a,d) -> prod(a, dims=d), [1 2; 3 4; 5 6], 2)\n([2 0 … 0 0; 0 4 … 3 0; 0 0 … 0 5], [0, 0, 0])\n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\njulia> jacobian((a,s) -> a.^length(s), [1,2,3], \"str\")\n([3 0 0; 0 12 0; 0 0 27], nothing)\n\njulia> jacobian((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))\n([4 4 4], nothing)\n\njulia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5)) # gradient undersands the tuple\n([4 4 4], (6, 1))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.withjacobian-Tuple{Any, Vararg{Any}}","page":"Gradients – Zygote.jl","title":"Zygote.withjacobian","text":"withjacobian(f, args...)\n\nReturns both the value f(args...) and the jacobian as a named tuple.\n\njulia> withjacobian(cumsum, [1,2,3])\n(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))\n\n\n\n\n\n","category":"method"},{"location":"reference/training/zygote/#Zygote.hessian","page":"Gradients – Zygote.jl","title":"Zygote.hessian","text":"hessian(f, x)\n\nConstruct the Hessian ∂²f/∂x², where x is a real number or an array, and f(x) is a real number. When x is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j], using linear indexing x[i] even if the argument is higher-dimensional.\n\nThis uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x). See hessian_reverse for an all-Zygote alternative.\n\nSee also diaghessian to compute only the diagonal part.\n\nExamples\n\njulia> hessian(x -> x[1]*x[2], randn(2))\n2×2 Matrix{Float64}:\n 0.0 1.0\n 1.0 0.0\n\njulia> hessian(x -> sum(x.^3), [1 2; 3 4]) # uses linear indexing of x\n4×4 Matrix{Int64}:\n 6 0 0 0\n 0 18 0 0\n 0 0 12 0\n 0 0 0 24\n\njulia> hessian(sin, pi/2)\n-1.0\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.hessian_reverse","page":"Gradients – Zygote.jl","title":"Zygote.hessian_reverse","text":"hessian_reverse(f, x)\n\nThis should be equivalent to hessian(f, x), but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#Zygote.diaghessian","page":"Gradients – Zygote.jl","title":"Zygote.diaghessian","text":"diaghessian(f, args...) -> Tuple\n\nDiagonal part of the Hessian. Returns a tuple containing, for each argument x, h of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]. The original evaluation y = f(args...) must give a real number y.\n\nFor one vector argument x, this is equivalent to (diag(hessian(f,x)),). Like hessian it uses ForwardDiff over Zygote. \n\nwarning: Warning\nFor arguments of any type except Number & AbstractArray, the result is nothing.\n\nExamples\n\njulia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]\n2×2 Matrix{Int64}:\n 6 12\n 18 24\n\njulia> Diagonal(vec(ans)) == hessian(x -> sum(x.^3), [1 2; 3 4]) # full Hessian is diagonal\ntrue\n\njulia> diaghessian((x,y) -> sum(x .* y .* y'), [1 22; 333 4], [0.5, 0.666]) # two array arguments\n([0.0 0.0; 0.0 0.0], [2.0, 8.0])\n\njulia> diaghessian(atan, 1, 2) # two scalar arguments\n(-0.16, 0.16)\n\njulia> hessian(xy -> atan(xy[1], xy[2]), [1, 2]) # full Hessian is not diagonal\n2×2 Matrix{Float64}:\n -0.16 -0.12\n -0.12 0.16\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ZygoteRules.pullback","page":"Gradients – Zygote.jl","title":"ZygoteRules.pullback","text":"pullback(f, args...)\npullback(f, ::Params)\n\nReturns the value of the function f and a back-propagator function, which can be called to obtain a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or gradient.\n\ny, back = pullback(f, args...)\n∇ = back(seed)\n\nback must be called with a start value seed matching the output of f(args...). If f(args...) returns a number, seed should be a number. If f(args...) returns an array, seed should be an equally-sized array.\n\nSee also withgradient to obtain the value and gradients in one call, and gradient for obtaining just the gradients.\n\njulia> y, back = pullback(*, 2.0, 3.0, 5.0);\n\njulia> y\n30.0\n\njulia> back(1.0)\n(15.0, 10.0, 6.0)\n\njulia> back(2.0)\n(30.0, 20.0, 12.0)\n\njulia> y, back = pullback(x -> [x, x], 1.0);\n\njulia> y\n2-element Vector{Float64}:\n 1.0\n 1.0\n\njulia> back([1.0, 1.0])\n(2.0,)\n\njulia> back([2.0, nothing])\n(2.0,)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRules","page":"Gradients – Zygote.jl","title":"ChainRules","text":"","category":"section"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"Sometimes it is necessary to exclude some code, or a whole function, from automatic differentiation. This can be done using ChainRules:","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.ignore_derivatives\nChainRulesCore.@non_differentiable","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.ignore_derivatives","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ignore_derivatives","text":"ignore_derivatives(f::Function)\n\nTells the AD system to ignore the gradients of the wrapped closure. The primal computation (forward pass) is executed normally.\n\nignore_derivatives() do\n value = rand()\n push!(collection, value)\nend\n\nUsing this incorrectly could lead to incorrect gradients. For example, the following function will have zero gradients with respect to its argument:\n\nfunction wrong_grads(x)\n y = ones(3)\n ignore_derivatives() do\n push!(y, x)\n end\n return sum(y)\nend\n\n\n\n\n\nignore_derivatives(x)\n\nTells the AD system to ignore the gradients of the argument. Can be used to avoid unnecessary computation of gradients.\n\nignore_derivatives(x) * w\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@non_differentiable","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@non_differentiable","text":"@non_differentiable(signature_expression)\n\nA helper to make it easier to declare that a method is not differentiable. This is a short-hand for defining an frule and rrule that return NoTangent() for all partials (even for the function s̄elf-partial itself)\n\nKeyword arguments should not be included.\n\njulia> @non_differentiable Base.:(==)(a, b)\n\njulia> _, pullback = rrule(==, 2.0, 3.0);\n\njulia> pullback(1.0)\n(NoTangent(), NoTangent(), NoTangent())\n\nYou can place type-constraints in the signature:\n\njulia> @non_differentiable Base.length(xs::Union{Number, Array})\n\njulia> frule((ZeroTangent(), 1), length, [2.0, 3.0])\n(2, NoTangent())\n\nwarning: Warning\nThis helper macro covers only the simple common cases. It does not support where-clauses. For these you can declare the rrule and frule directly\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"To manually supply the gradient for one function, you should define a method of rrule. ChainRules has detailed documentation on how this works.","category":"page"},{"location":"reference/training/zygote/","page":"Gradients – Zygote.jl","title":"Gradients – Zygote.jl","text":"ChainRulesCore.rrule\nChainRulesCore.frule\nChainRulesCore.@scalar_rule\nChainRulesCore.NoTangent\nChainRulesCore.ZeroTangent\nChainRulesCore.RuleConfig\nChainRulesCore.Tangent\nChainRulesCore.canonicalize","category":"page"},{"location":"reference/training/zygote/#ChainRulesCore.rrule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.rrule","text":"rrule([::RuleConfig,] f, x...)\n\nExpressing x as the tuple (x₁, x₂, ...) and the output tuple of f(x...) as Ω, return the tuple:\n\n(Ω, (Ω̄₁, Ω̄₂, ...) -> (s̄elf, x̄₁, x̄₂, ...))\n\nWhere the second return value is the the propagation rule or pullback. It takes in cotangents corresponding to the outputs (x̄₁, x̄₂, ...), and s̄elf, the internal values of the function itself (for closures)\n\nIf no method matching rrule(f, xs...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> x = rand();\n\njulia> sinx, sin_pullback = rrule(sin, x);\n\njulia> sinx == sin(x)\ntrue\n\njulia> sin_pullback(1) == (NoTangent(), cos(x))\ntrue\n\nbinary input, unary output scalar function:\n\njulia> x, y = rand(2);\n\njulia> hypotxy, hypot_pullback = rrule(hypot, x, y);\n\njulia> hypotxy == hypot(x, y)\ntrue\n\njulia> hypot_pullback(1) == (NoTangent(), (x / hypot(x, y)), (y / hypot(x, y)))\ntrue\n\nThe optional RuleConfig option allows specifying rrules only for AD systems that support given features. If not needed, then it can be omitted and the rrule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: frule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.frule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.frule","text":"frule([::RuleConfig,] (Δf, Δx...), f, x...)\n\nExpressing the output of f(x...) as Ω, return the tuple:\n\n(Ω, ΔΩ)\n\nThe second return value is the tangent w.r.t. the output.\n\nIf no method matching frule((Δf, Δx...), f, x...) has been defined, then return nothing.\n\nExamples:\n\nunary input, unary output scalar function:\n\njulia> dself = NoTangent();\n\njulia> x = rand()\n0.8236475079774124\n\njulia> sinx, Δsinx = frule((dself, 1), sin, x)\n(0.7336293678134624, 0.6795498147167869)\n\njulia> sinx == sin(x)\ntrue\n\njulia> Δsinx == cos(x)\ntrue\n\nUnary input, binary output scalar function:\n\njulia> sincosx, Δsincosx = frule((dself, 1), sincos, x);\n\njulia> sincosx == sincos(x)\ntrue\n\njulia> Δsincosx[1] == cos(x)\ntrue\n\njulia> Δsincosx[2] == -sin(x)\ntrue\n\nNote that techically speaking julia does not have multiple output functions, just functions that return a single output that is iterable, like a Tuple. So this is actually a Tangent:\n\njulia> Δsincosx\nTangent{Tuple{Float64, Float64}}(0.6795498147167869, -0.7336293678134624)\n\nThe optional RuleConfig option allows specifying frules only for AD systems that support given features. If not needed, then it can be omitted and the frule without it will be hit as a fallback. This is the case for most rules.\n\nSee also: rrule, @scalar_rule, RuleConfig\n\n\n\n\n\n","category":"function"},{"location":"reference/training/zygote/#ChainRulesCore.@scalar_rule","page":"Gradients – Zygote.jl","title":"ChainRulesCore.@scalar_rule","text":"@scalar_rule(f(x₁, x₂, ...),\n @setup(statement₁, statement₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nA convenience macro that generates simple scalar forward or reverse rules using the provided partial derivatives. Specifically, generates the corresponding methods for frule and rrule:\n\nfunction ChainRulesCore.frule((NoTangent(), Δx₁, Δx₂, ...), ::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, (\n (∂f₁_∂x₁ * Δx₁ + ∂f₁_∂x₂ * Δx₂ + ...),\n (∂f₂_∂x₁ * Δx₁ + ∂f₂_∂x₂ * Δx₂ + ...),\n ...\n )\nend\n\nfunction ChainRulesCore.rrule(::typeof(f), x₁::Number, x₂::Number, ...)\n Ω = f(x₁, x₂, ...)\n $(statement₁, statement₂, ...)\n return Ω, ((ΔΩ₁, ΔΩ₂, ...)) -> (\n NoTangent(),\n ∂f₁_∂x₁ * ΔΩ₁ + ∂f₂_∂x₁ * ΔΩ₂ + ...),\n ∂f₁_∂x₂ * ΔΩ₁ + ∂f₂_∂x₂ * ΔΩ₂ + ...),\n ...\n )\nend\n\nIf no type constraints in f(x₁, x₂, ...) within the call to @scalar_rule are provided, each parameter in the resulting frule/rrule definition is given a type constraint of Number. Constraints may also be explicitly be provided to override the Number constraint, e.g. f(x₁::Complex, x₂), which will constrain x₁ to Complex and x₂ to Number.\n\nAt present this does not support defining for closures/functors. Thus in reverse-mode, the first returned partial, representing the derivative with respect to the function itself, is always NoTangent(). And in forward-mode, the first input to the returned propagator is always ignored.\n\nThe result of f(x₁, x₂, ...) is automatically bound to Ω. This allows the primal result to be conveniently referenced (as Ω) within the derivative/setup expressions.\n\nThis macro assumes complex functions are holomorphic. In general, for non-holomorphic functions, the frule and rrule must be defined manually.\n\nIf the derivative is one, (e.g. for identity functions) true can be used as the most general multiplicative identity.\n\nThe @setup argument can be elided if no setup code is need. In other words:\n\n@scalar_rule(f(x₁, x₂, ...),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nis equivalent to:\n\n@scalar_rule(f(x₁, x₂, ...),\n @setup(nothing),\n (∂f₁_∂x₁, ∂f₁_∂x₂, ...),\n (∂f₂_∂x₁, ∂f₂_∂x₂, ...),\n ...)\n\nFor examples, see ChainRules' rulesets directory.\n\nSee also: frule, rrule.\n\n\n\n\n\n","category":"macro"},{"location":"reference/training/zygote/#ChainRulesCore.NoTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.NoTangent","text":"NoTangent() <: AbstractZero\n\nThis tangent indicates that the derivative does not exist. It is the tangent type for primal types that are not differentiable, such as integers or booleans (when they are not being used to represent floating-point values). The only valid way to perturb such values is to not change them at all. As a consequence, NoTangent is functionally identical to ZeroTangent(), but it provides additional semantic information.\n\nAdding NoTangent() to a primal is generally wrong: gradient-based methods cannot be used to optimize over discrete variables. An optimization package making use of this might want to check for such a case.\n\nnote: Note\nThis does not indicate that the derivative is not implemented, but rather that mathematically it is not defined.\n\nThis mostly shows up as the derivative with respect to dimension, index, or size arguments.\n\n function rrule(fill, x, len::Int)\n y = fill(x, len)\n fill_pullback(ȳ) = (NoTangent(), @thunk(sum(Ȳ)), NoTangent())\n return y, fill_pullback\n end\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.ZeroTangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.ZeroTangent","text":"ZeroTangent() <: AbstractZero\n\nThe additive identity for tangents. This is basically the same as 0. A derivative of ZeroTangent() does not propagate through the primal function.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.RuleConfig","page":"Gradients – Zygote.jl","title":"ChainRulesCore.RuleConfig","text":"RuleConfig{T}\n\nThe configuration for what rules to use. T: traits. This should be a Union of all special traits needed for rules to be allowed to be defined for your AD. If nothing special this should be set to Union{}.\n\nAD authors should define a subtype of RuleConfig to use when calling frule/rrule.\n\nRule authors can dispatch on this config when defining rules. For example:\n\n# only define rrule for `pop!` on AD systems where mutation is supported.\nrrule(::RuleConfig{>:SupportsMutation}, typeof(pop!), ::Vector) = ...\n\n# this definition of map is for any AD that defines a forwards mode\nrrule(conf::RuleConfig{>:HasForwardsMode}, typeof(map), ::Vector) = ...\n\n# this definition of map is for any AD that only defines a reverse mode.\n# It is not as good as the rrule that can be used if the AD defines a forward-mode as well.\nrrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...\n\nFor more details see rule configurations and calling back into AD.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.Tangent","page":"Gradients – Zygote.jl","title":"ChainRulesCore.Tangent","text":"Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent\n\nThis type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.\n\nTangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.\n\nT is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.\n\nFor Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/zygote/#ChainRulesCore.canonicalize","page":"Gradients – Zygote.jl","title":"ChainRulesCore.canonicalize","text":"canonicalize(tangent::Tangent{P}) -> Tangent{P}\n\nReturn the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().\n\n\n\n\n\n","category":"function"},{"location":"guide/models/basics/#man-basics","page":"Gradients and Layers","title":"How Flux Works: Gradients and Layers","text":"","category":"section"},{"location":"guide/models/basics/#man-taking-gradients","page":"Gradients and Layers","title":"Taking Gradients","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux's core feature is taking gradients of Julia code. The gradient function takes another Julia function f and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> using Flux\n\njulia> f(x) = 3x^2 + 2x + 1;\n\njulia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2\n\njulia> df(2)\n14.0\n\njulia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6\n\njulia> d2f(2)\n6.0","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"When a function has many parameters, we can get gradients of each one at the same time:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> f(x, y) = sum((x .- y).^2);\n\njulia> gradient(f, [2, 1], [2, 0])\n([0.0, 2.0], [-0.0, -2.0])","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"These gradients are based on x and y. Flux works by instead taking gradients based on the weights and biases that make up the parameters of a model.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Machine learning often can have hundreds of parameter arrays. Instead of passing them to gradient individually, we can store them together in a structure. The simplest example is a named tuple, created by the following syntax:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> nt = (a = [2, 1], b = [2, 0], c = tanh);\n\njulia> g(x::NamedTuple) = sum(abs2, x.a .- x.b);\n\njulia> g(nt)\n1\n\njulia> dg_nt = gradient(g, nt)[1]\n(a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Notice that gradient has returned a matching structure. The field dg_nt.a is the gradient for nt.a, and so on. Some fields have no gradient, indicated by nothing. ","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Rather than define a function like g every time (and think up a name for it), it is often useful to use anonymous functions: this one is x -> sum(abs2, x.a .- x.b). Anonymous functions can be defined either with -> or with do, and such do blocks are often useful if you have a few steps to perform:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> gradient((x, y) -> sum(abs2, x.a ./ y .- x.b), nt, [1, 2])\n((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])\n\njulia> gradient(nt, [1, 2]) do x, y\n z = x.a ./ y\n sum(abs2, z .- x.b)\n end\n((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Sometimes you may want to know the value of the function, as well as its gradient. Rather than calling the function a second time, you can call withgradient instead:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"julia> Flux.withgradient(g, nt)\n(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))","category":"page"},{"location":"guide/models/basics/#Building-Simple-Models","page":"Gradients and Layers","title":"Building Simple Models","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Consider a simple linear regression, which tries to predict an output array y from an input x.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"\npredict(W, b, x) = W*x .+ b\n\nfunction loss(W, b, x, y)\n ŷ = predict(W, b, x)\n sum((y .- ŷ).^2)\nend\n\nx, y = rand(5), rand(2) # Dummy data\nW = rand(2, 5)\nb = rand(2)\n\nloss(W, b, x, y) # ~ 3","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"To improve the prediction we can take the gradients of the loss with respect to W and b and perform gradient descent.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Flux\n\ndW, db = gradient((W, b) -> loss(W, b, x, y), W, b)","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Now that we have gradients, we can pull them out and update W to train the model.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"W .-= 0.1 .* dW\n\nloss(W, b, x, y) # ~ 2.5","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"The loss has decreased a little, meaning that our prediction x is closer to the target y. If we have some data we can already try training the model.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can look very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models.","category":"page"},{"location":"guide/models/basics/#Building-Layers","page":"Gradients and Layers","title":"Building Layers","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like sigmoid in between them. We could write this as:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Flux\n\nW1 = rand(3, 5)\nb1 = rand(3)\nlayer1(x) = W1 * x .+ b1\n\nW2 = rand(2, 3)\nb2 = rand(2)\nlayer2(x) = W2 * x .+ b2\n\nmodel(x) = layer2(sigmoid.(layer1(x)))\n\nmodel(rand(5)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"function linear(in, out)\n W = randn(out, in)\n b = randn(out)\n x -> W * x .+ b\nend\n\nlinear1 = linear(5, 3) # we can access linear1.W etc\nlinear2 = linear(3, 2)\n\nmodel(x) = linear2(sigmoid.(linear1(x)))\n\nmodel(rand(5)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Another (equivalent) way is to create a struct that explicitly represents the affine layer.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"struct Affine\n W\n b\nend\n\nAffine(in::Integer, out::Integer) =\n Affine(randn(out, in), zeros(out))\n\n# Overload call, so the object can be used as a function\n(m::Affine)(x) = m.W * x .+ m.b\n\na = Affine(10, 5)\n\na(rand(10)) # => 5-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Congratulations! You just built the Dense layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"(There is one small difference with Dense – for convenience it also takes an activation function, like Dense(10 => 5, sigmoid).)","category":"page"},{"location":"guide/models/basics/#Stacking-It-Up","page":"Gradients and Layers","title":"Stacking It Up","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"It's pretty common to write models that look something like:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"layer1 = Dense(10 => 5, relu)\n# ...\nmodel(x) = layer3(layer2(layer1(x)))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"For long chains, it might be a bit more intuitive to have a list of layers, like this:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"using Flux\n\nlayers = [Dense(10 => 5, relu), Dense(5 => 2), softmax]\n\nmodel(x) = foldl((x, m) -> m(x), layers, init = x)\n\nmodel(rand(10)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Handily, this is also provided for in Flux:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"model2 = Chain(\n Dense(10 => 5, relu),\n Dense(5 => 2),\n softmax)\n\nmodel2(rand(10)) # => 2-element vector","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"A nice property of this approach is that because \"models\" are just functions (possibly with trainable parameters), you can also see this as simple function composition.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"m = Dense(5 => 2) ∘ Dense(10 => 5, σ)\n\nm(rand(10))","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Likewise, Chain will happily work with any Julia function.","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"m = Chain(x -> x^2, x -> x+1)\n\nm(5) # => 26","category":"page"},{"location":"guide/models/basics/#Layer-Helpers","page":"Gradients and Layers","title":"Layer Helpers","text":"","category":"section"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"There is still one problem with this Affine layer, that Flux does not know to look inside it. This means that Flux.train! won't see its parameters, nor will gpu be able to move them to your GPU. These features are enabled by the @layer macro:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Flux.@layer Affine","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the Affine layer as follows, using the helper function create_bias:","category":"page"},{"location":"guide/models/basics/","page":"Gradients and Layers","title":"Gradients and Layers","text":"function Affine((in, out)::Pair; bias=true, init=glorot_uniform)\n W = init(out, in)\n b = Flux.create_bias(W, bias, out)\n return Affine(W, b)\nend\n\nAffine(3 => 1, bias=false) |> gpu","category":"page"},{"location":"guide/models/recurrence/#Recurrent-Models","page":"Recurrence","title":"Recurrent Models","text":"","category":"section"},{"location":"guide/models/recurrence/#Recurrent-cells","page":"Recurrence","title":"Recurrent cells","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To introduce Flux's recurrence functionalities, we will consider the following vanilla recurrent neural network structure:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"(Image: )","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the above, we have a sequence of length 3, where x1 to x3 represent the input at each step (could be a timestamp or a word in a sentence), and y1 to y3 are their respective outputs.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"An aspect to recognise is that in such a model, the recurrent cells A all refer to the same structure. What distinguishes it from a simple dense layer is that the cell A is fed, in addition to an input x, with information from the previous state of the model (hidden state denoted as h1 & h2 in the diagram).","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In the most basic RNN case, cell A could be defined by the following: ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"output_size = 5\ninput_size = 2\nWxh = randn(Float32, output_size, input_size)\nWhh = randn(Float32, output_size, output_size)\nb = randn(Float32, output_size)\n\nfunction rnn_cell(h, x)\n h = tanh.(Wxh * x .+ Whh * h .+ b)\n return h, h\nend\n\nx = rand(Float32, input_size) # dummy input data\nh = rand(Float32, output_size) # random initial hidden state\n\nh, y = rnn_cell(h, x)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Notice how the above is essentially a Dense layer that acts on two inputs, h and x.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"If you run the last line a few times, you'll notice the output y changing slightly even though the input x is the same.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"There are various recurrent cells available in Flux, notably RNNCell, LSTMCell and GRUCell, which are documented in the layer reference. The hand-written example above can be replaced with:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Flux\n\nrnn = Flux.RNNCell(2, 5)\n\nx = rand(Float32, 2) # dummy data\nh = rand(Float32, 5) # initial hidden state\n\nh, y = rnn(h, x)","category":"page"},{"location":"guide/models/recurrence/#Stateful-Models","page":"Recurrence","title":"Stateful Models","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"For the most part, we don't want to manage hidden states ourselves, but to treat our models as being stateful. Flux provides the Recur wrapper to do this.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"x = rand(Float32, 2)\nh = rand(Float32, 5)\n\nm = Flux.Recur(rnn, h)\n\ny = m(x)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The Recur wrapper stores the state between runs in the m.state field.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"If we use the RNN(2, 5) constructor – as opposed to RNNCell – you'll see that it's simply a wrapped cell.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> using Flux\n\njulia> RNN(2, 5) # or equivalently RNN(2 => 5)\nRecur(\n RNNCell(2 => 5, tanh), # 45 parameters\n) # Total: 4 trainable arrays, 45 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 412 bytes.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Equivalent to the RNN stateful constructor, LSTM and GRU are also available. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Using these tools, we can now build the model shown in the above diagram with: ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> m = Chain(RNN(2 => 5), Dense(5 => 1))\nChain(\n Recur(\n RNNCell(2 => 5, tanh), # 45 parameters\n ),\n Dense(5 => 1), # 6 parameters\n) # Total: 6 trainable arrays, 51 parameters,\n # plus 1 non-trainable, 5 parameters, summarysize 580 bytes. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In this example, each output has only one component.","category":"page"},{"location":"guide/models/recurrence/#Working-with-sequences","page":"Recurrence","title":"Working with sequences","text":"","category":"section"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Using the previously defined m recurrent model, we can now apply it to a single step from our sequence:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> x = rand(Float32, 2);\n\njulia> m(x)\n1-element Vector{Float32}:\n 0.45860028","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"The m(x) operation would be represented by x1 -> A -> y1 in our diagram. If we perform this operation a second time, it will be equivalent to x2 -> A -> y2 since the model m has stored the state resulting from the x1 step.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Now, instead of computing a single step at a time, we can get the full y1 to y3 sequence in a single pass by iterating the model on a sequence of data. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To do so, we'll need to structure the input data as a Vector of observations at each time step. This Vector will therefore be of length = seq_length and each of its elements will represent the input features for a given step. In our example, this translates into a Vector of length 3, where each element is a Matrix of size (features, batch_size), or just a Vector of length features if dealing with a single observation. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> x = [rand(Float32, 2) for i = 1:3];\n\njulia> [m(xi) for xi in x]\n3-element Vector{Vector{Float32}}:\n [0.36080405]\n [-0.13914406]\n [0.9310162]","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"warning: Use of map and broadcast\nMapping and broadcasting operations with stateful layers such are discouraged, since the julia language doesn't guarantee a specific execution order. Therefore, avoid y = m.(x)\n# or \ny = map(m, x)and use explicit loops y = [m(x) for x in x]","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"If for some reason one wants to exclude the first step of the RNN chain for the computation of the loss, that can be handled with:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"using Flux.Losses: mse\n\nfunction loss(x, y)\n m(x[1]) # ignores the output but updates the hidden states\n sum(mse(m(xi), yi) for (xi, yi) in zip(x[2:end], y))\nend\n\ny = [rand(Float32, 1) for i=1:2]\nloss(x, y)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In such a model, only the last two outputs are used to compute the loss, hence the target y being of length 2. This is a strategy that can be used to easily handle a seq-to-one kind of structure, compared to the seq-to-seq assumed so far. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Alternatively, if one wants to perform some warmup of the sequence, it could be performed once, followed with a regular training where all the steps of the sequence would be considered for the gradient update:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"function loss(m, x, y)\n sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))\nend\n\nseq_init = [rand(Float32, 2)]\nseq_1 = [rand(Float32, 2) for i = 1:3]\nseq_2 = [rand(Float32, 2) for i = 1:3]\n\ny1 = [rand(Float32, 1) for i = 1:3]\ny2 = [rand(Float32, 1) for i = 1:3]\n\nX = [seq_1, seq_2]\nY = [y1, y2]\ndata = zip(X,Y)\n\nFlux.reset!(m)\n[m(x) for x in seq_init]\n\nopt = Flux.setup(Adam(1e-3), m)\nFlux.train!(loss, m, data, opt)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In this previous example, model's state is first reset with Flux.reset!. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with seq_init, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (seq_1 and seq_2) and all the timesteps outputs are considered for the loss.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In this scenario, it is important to note that a single continuous sequence is considered. Since the model state is not reset between the 2 batches, the state of the model flows through the batches, which only makes sense in the context where seq_1 is the continuation of seq_init and so on.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, if we set the batch size to 4, a single batch would be of the shape:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"x = [rand(Float32, 2, 4) for i = 1:3]\ny = [rand(Float32, 1, 4) for i = 1:3]","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing m(batch[1]), would still represent x1 -> y1 in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix). We do not need to use Flux.reset!(m) here; each sentence in the batch will output in its own \"column\", and the outputs of the different sentences won't mix. ","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"To illustrate, we go through an example of batching with our implementation of rnn_cell. The implementation doesn't need to change; the batching comes for \"free\" from the way Julia does broadcasting and the rules of matrix multiplication.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"output_size = 5\ninput_size = 2\nWxh = randn(Float32, output_size, input_size)\nWhh = randn(Float32, output_size, output_size)\nb = randn(Float32, output_size)\n\nfunction rnn_cell(h, x)\n h = tanh.(Wxh * x .+ Whh * h .+ b)\n return h, h\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"Here, we use the last dimension of the input and the hidden state as the batch dimension. I.e., h[:, n] would be the hidden state of the nth sentence in the batch.","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"batch_size = 4\nx = rand(Float32, input_size, batch_size) # dummy input data\nh = rand(Float32, output_size, batch_size) # random initial hidden state\n\nh, y = rnn_cell(h, x)","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"julia> size(h) == size(y) == (output_size, batch_size)\ntrue","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"In many situations, such as when dealing with a language model, the sentences in each batch are independent (i.e. the last item of the first sentence of the first batch is independent from the first item of the first sentence of the second batch), so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situations, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"function loss(x, y)\n Flux.reset!(m)\n sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))\nend","category":"page"},{"location":"guide/models/recurrence/","page":"Recurrence","title":"Recurrence","text":"A potential source of ambiguity with RNN in Flux can come from the different data layout compared to some common frameworks where data is typically a 3 dimensional array: (features, seq length, samples). In Flux, those 3 dimensions are provided through a vector of seq length containing a matrix (features, samples).","category":"page"},{"location":"reference/models/nnlib/#Neural-Network-primitives-from-NNlib.jl","page":"Low-level Operations – NNlib.jl","title":"Neural Network primitives from NNlib.jl","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux re-exports all of the functions exported by the NNlib package. This includes activation functions, described on their own page. Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.","category":"page"},{"location":"reference/models/nnlib/#Attention","page":"Low-level Operations – NNlib.jl","title":"Attention","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Primitives for the MultiHeadAttention layer.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dot_product_attention\nNNlib.dot_product_attention_scores\nNNlib.make_causal_mask","category":"page"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention","text":"dot_product_attention(query, key, value, [bias]; [fdrop, mask, nheads])\n\nMultihead dot product attention used in transformer architectures.\n\nThe input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.\n\nReturns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).\n\nSee also dot_product_attention_scores if you only need the attention scores.\n\nArguments\n\nquery: Query array of size (qk_dim, q_len, batch_size...).\nkey: Key array of size (qk_dim, kv_len, batch_size...).\nvalue: Value array of size (v_dim, kv_len, batch_size...).\nbias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.\nfdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).\nmask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.\nnheads: Number of heads to split the input arrays into. Default 1.\n\nExamples\n\nq, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)\ny, α = dot_product_attention(q, k, v)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dot_product_attention_scores","page":"Low-level Operations – NNlib.jl","title":"NNlib.dot_product_attention_scores","text":"dot_product_attention_scores(query, key, [bias]; [fdrop, mask])\n\nReturn the attention scores for the dot_product_attention. Input arrays must have dimensions (num_features ÷ nheads, nheads, sequence_length, batch_size).\n\nSee dot_product_attention for more details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.make_causal_mask","page":"Low-level Operations – NNlib.jl","title":"NNlib.make_causal_mask","text":"make_causal_mask(x, dims=2)\n\nReturn a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.\n\nCan be used to mask the attention scores in dot_product_attention.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Softmax","page":"Low-level Operations – NNlib.jl","title":"Softmax","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"softmax\nlogsoftmax","category":"page"},{"location":"reference/models/nnlib/#NNlib.softmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.softmax","text":"softmax(x; dims = 1)\n\nSoftmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:\n\nsoftmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)\n\nwith additional manipulations enhancing numerical stability.\n\nFor a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.\n\nSee also logsoftmax.\n\nExamples\n\njulia> softmax([1, 2, 3])\n3-element Vector{Float64}:\n 0.09003057317038046\n 0.24472847105479764\n 0.6652409557748218\n\njulia> softmax([1 2 3; 2 2 2]) # dims=1\n2×3 Matrix{Float64}:\n 0.268941 0.5 0.731059\n 0.731059 0.5 0.268941\n\njulia> softmax([1 2 3; 2 2 2]; dims=2)\n2×3 Matrix{Float64}:\n 0.0900306 0.244728 0.665241\n 0.333333 0.333333 0.333333\n\nNote that, when used with Flux.jl, softmax must not be passed to layers like Dense which accept an activation function. The activation is broadcasted over the result, thus applies to individual numbers. But softmax always needs to see the whole column.\n\njulia> using Flux\n\njulia> x = randn(Float32, 4, 4, 3, 13);\n\njulia> model = Chain(Conv((4, 4), 3 => 8, tanh), Flux.flatten, Dense(8 => 7), softmax);\n\njulia> model(x) |> size\n(7, 13)\n\njulia> Dense(4 => 7, softmax)(x)\nERROR: `softmax(x)` called with a number, but it expects an array. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.logsoftmax","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsoftmax","text":"logsoftmax(x; dims = 1)\n\nComputes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.\n\nIt is semantically equivalent to the following:\n\nlogsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))\n\nSee also softmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Pooling","page":"Low-level Operations – NNlib.jl","title":"Pooling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.PoolDims\nNNlib.lpnormpool\nNNlib.maxpool\nNNlib.meanpool","category":"page"},{"location":"reference/models/nnlib/#NNlib.PoolDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.PoolDims","text":"PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};\n stride=k, padding=0, dilation=1) where {M, L}\n\nDimensions for a \"pooling\" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.lpnormpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.lpnormpool","text":"lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`\np is restricted to 0 < p < Inf.\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\nFor all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.\n\nThus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.maxpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.maxpool","text":"maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform max pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.meanpool","page":"Low-level Operations – NNlib.jl","title":"NNlib.meanpool","text":"meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)\n\nPerform mean pool operation with window size k on input tensor x.\n\nArguments:\n\nx and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`\npad: See pad_zeros for details.\nstride: Either a tuple with the same length as k, or one integer for all directions. Default is k.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Padding","page":"Low-level Operations – NNlib.jl","title":"Padding","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.pad_circular\nNNlib.pad_constant\nNNlib.pad_reflect\nNNlib.pad_repeat\nNNlib.pad_symmetric\nNNlib.pad_zeros","category":"page"},{"location":"reference/models/nnlib/#NNlib.pad_circular","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_circular","text":"pad_circular(x, pad::Tuple; [dims])\npad_circular(x, pad::Int; [dims])\n\nPad the array x \"circularly\" across the border by wrapping around values from the opposite side of x. \n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nThe pad length on either side in any dimension must not exceed the size of x in that dimension, i.e. pad_circular is not able to create abitrary sized tilings of x.\n\nSee also pad_repeat, pad_reflect, pad_symmetric, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_circular(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n 9 3 6 9 3 6\n 7 1 4 7 1 4\n 8 2 5 8 2 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_constant","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_constant","text":"pad_constant(x, pad::Tuple, val = 0; [dims = :])\npad_constant(x, pad::Int, val = 0; [dims = :])\n\nPad the array x with the constant value val.\n\npad can be a tuple of integers. If it is of some length 2 * length(dims) that specifies the left and right padding size for each of the dimensions in dims as (l1, r1, ..., ln, rn). If supplied with a tuple of length length(dims) instead, it applies symmetric padding. If dims is not given, it defaults to all dimensions.\n\nFor integer pad input, it is applied on both sides on every dimension in dims.\n\nSee also pad_zeros, pad_repeat, pad_reflect, pad_symmetric, and pad_circular.\n\njulia> r = reshape(1:4, 2, 2)\n2×2 reshape(::UnitRange{Int64}, 2, 2) with eltype Int64:\n 1 3\n 2 4\n\njulia> pad_constant(r, (1, 2, 3, 4), 8)\n5×9 Matrix{Int64}:\n 8 8 8 8 8 8 8 8 8\n 8 8 8 1 3 8 8 8 8\n 8 8 8 2 4 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n 8 8 8 8 8 8 8 8 8\n\njulia> pad_constant(r, 1, 8)\n4×4 Matrix{Int64}:\n 8 8 8 8\n 8 1 3 8\n 8 2 4 8\n 8 8 8 8\n\njulia> r = reshape(1:27, 3, 3, 3)\n3×3×3 reshape(::UnitRange{Int64}, 3, 3, 3) with eltype Int64:\n[:, :, 1] =\n 1 4 7\n 2 5 8\n 3 6 9\n\n[:, :, 2] =\n 10 13 16\n 11 14 17\n 12 15 18\n\n[:, :, 3] =\n 19 22 25\n 20 23 26\n 21 24 27\n\njulia> pad_constant(r, (2,1), dims = 1) # assymetric padding\n6×3×3 Array{Int64, 3}:\n[:, :, 1] =\n 0 0 0\n 0 0 0\n 1 4 7\n 2 5 8\n 3 6 9\n 0 0 0\n\n[:, :, 2] =\n 0 0 0\n 0 0 0\n 10 13 16\n 11 14 17\n 12 15 18\n 0 0 0\n\n[:, :, 3] =\n 0 0 0\n 0 0 0\n 19 22 25\n 20 23 26\n 21 24 27\n 0 0 0\n\njulia> pad_constant(r, (2,1, 3), dims = (1,2)) # padding must always be either the same length as dims, or double it\nERROR: ArgumentError: Could not parse padding (2, 1, 3) and dims (1, 2)\nStacktrace:\n[...]\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_reflect","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_reflect","text":"pad_reflect(x, pad::Tuple; [dims])\npad_reflect(x, pad::Int; [dims])\n\nPad the array x reflecting its values across the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_reflect(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n 5 2 5 8 5 2\n 6 3 6 9 6 3\n 5 2 5 8 5 2\n 4 1 4 7 4 1\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_repeat","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_repeat","text":"pad_repeat(x, pad::Tuple; [dims])\npad_repeat(x, pad::Int; [dims])\n\nPad the array x repeating the values on the border.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_reflect, pad_symmetric, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_repeat(r, (1,2,3,4))\n6×10 Matrix{Int64}:\n 1 1 1 1 4 7 7 7 7 7\n 1 1 1 1 4 7 7 7 7 7\n 2 2 2 2 5 8 8 8 8 8\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n 3 3 3 3 6 9 9 9 9 9\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_symmetric","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_symmetric","text":"pad_symmetric(x, pad::Tuple; [dims])\npad_symmetric(x, pad::Int; [dims])\n\nPad the array x reflecting its values symmetrically across the border, i.e. the border values of x are present in the padding values, in contrast to pad_reflect.\n\npad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.\n\nIf pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension). \n\nSee also pad_repeat, pad_reflect, pad_circular, and pad_constant.\n\njulia> r = reshape(1:9, 3, 3)\n3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:\n 1 4 7\n 2 5 8\n 3 6 9\n\njulia> pad_symmetric(r, (1,2,1,2))\n6×6 Matrix{Int64}:\n 1 1 4 7 7 4\n 1 1 4 7 7 4\n 2 2 5 8 8 5\n 3 3 6 9 9 6\n 3 3 6 9 9 6\n 2 2 5 8 8 5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pad_zeros","page":"Low-level Operations – NNlib.jl","title":"NNlib.pad_zeros","text":"pad_zeros(x, pad::Tuple; [dims])\npad_zeros(x, pad::Int; [dims])\n\nPad the array x with zeros. Equivalent to pad_constant with the constant equal to 0. \n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Convolution","page":"Low-level Operations – NNlib.jl","title":"Convolution","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally. ","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"conv\nConvDims\ndepthwiseconv\nDepthwiseConvDims\nDenseConvDims","category":"page"},{"location":"reference/models/nnlib/#NNlib.conv","page":"Low-level Operations – NNlib.jl","title":"NNlib.conv","text":"conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)\n\nApply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.ConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.ConvDims","text":"ConvDims\n\nType system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.depthwiseconv","page":"Low-level Operations – NNlib.jl","title":"NNlib.depthwiseconv","text":"depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)\n\nDepthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.DepthwiseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DepthwiseConvDims","text":"DepthwiseConvDims\n\nConcrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#NNlib.DenseConvDims","page":"Low-level Operations – NNlib.jl","title":"NNlib.DenseConvDims","text":"DenseConvDims\n\nConcrete subclass of ConvDims for a normal, dense, conv2d/conv3d.\n\n\n\n\n\n","category":"type"},{"location":"reference/models/nnlib/#Dropout","page":"Low-level Operations – NNlib.jl","title":"Dropout","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.dropout\nNNlib.dropout!","category":"page"},{"location":"reference/models/nnlib/#NNlib.dropout","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout","text":"dropout([rng], A, p; [dims])\n\nReturns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).\n\nBy default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.\n\nOptional first argument is the random number generator used.\n\nExamples\n\njulia> dropout(ones(2, 10), 0.2)\n2×10 Matrix{Float64}:\n 1.25 1.25 0.0 1.25 1.25 1.25 1.25 1.25 1.25 1.25\n 1.25 1.25 1.25 0.0 1.25 1.25 0.0 1.25 1.25 1.25\n\njulia> mean(dropout(ones(10^4, 5), 0.2), dims=1)\n1×5 Matrix{Float64}:\n 0.998 1.00075 0.99125 0.99575 1.00075\n\njulia> dropout(ones(5, 5), 0.7, dims=1) # whole row the same\n5×5 Matrix{Float64}:\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n 0.0 0.0 0.0 0.0 0.0\n 3.33333 3.33333 3.33333 3.33333 3.33333\n 0.0 0.0 0.0 0.0 0.0\n\njulia> mean(dropout(ones(10^4, 5), 0.3, dims=1), dims=1)\n1×5 Matrix{Float64}:\n 1.00571 1.00571 1.00571 1.00571 1.00571\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.dropout!","page":"Low-level Operations – NNlib.jl","title":"NNlib.dropout!","text":"dropout!(B, A, p; [dims])\n\nThis does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Upsampling","page":"Low-level Operations – NNlib.jl","title":"Upsampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"upsample_nearest\nupsample_linear\n∇upsample_linear\nupsample_bilinear\n∇upsample_bilinear\nupsample_trilinear\n∇upsample_trilinear\npixel_shuffle","category":"page"},{"location":"reference/models/nnlib/#NNlib.upsample_nearest","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_nearest","text":"upsample_nearest(x, scale::NTuple{S,Int})\nupsample_nearest(x; size::NTuple{S,Int})\n\nUpsamples the array x by integer multiples along the first S dimensions. Subsequent dimensions of x are not altered.\n\nEither the scale factors or the final output size can be specified.\n\nSee also upsample_bilinear, for two dimensions of an N=4 array.\n\nExample\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2, 3))\n4×9 Matrix{Int64}:\n 1 1 1 2 2 2 3 3 3\n 1 1 1 2 2 2 3 3 3\n 4 4 4 5 5 5 6 6 6\n 4 4 4 5 5 5 6 6 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6]; size=(4, 9)) # equivalent\ntrue\n\njulia> upsample_nearest([1 2 3; 4 5 6], (2,))\n4×3 Matrix{Int64}:\n 1 2 3\n 1 2 3\n 4 5 6\n 4 5 6\n\njulia> ans == upsample_nearest([1 2 3; 4 5 6], size=(4,))\ntrue\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_linear","text":"upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)\nupsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)\n\nUpsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_linear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_linear","text":"∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_bilinear","text":"upsample_bilinear(x::AbstractArray{T,4}, scale::NTuple{2,Real}; align_corners::Bool = true)\nupsample_bilinear(x::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true)\n\nUpsamples the first 2 dimensions of the array x by the upsample factors stored in scale, using bilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, S3, S4), where S1, S2, S3, S4 = size(x).\n\nExamples\n\njulia> x = reshape(Float32[1 2 3; 4 5 6], (2,3,1,1))\n2×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 2.0 3.0\n 4.0 5.0 6.0\n\njulia> upsample_bilinear(x, (2, 3))\n4×9×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.25 1.5 1.75 2.0 2.25 2.5 2.75 3.0\n 2.0 2.25 2.5 2.75 3.0 3.25 3.5 3.75 4.0\n 3.0 3.25 3.5 3.75 4.0 4.25 4.5 4.75 5.0\n 4.0 4.25 4.5 4.75 5.0 5.25 5.5 5.75 6.0\n\njulia> ans == upsample_bilinear(x; size=(4, 9)) # specify ouput size instead\ntrue\n\njulia> upsample_bilinear(x, (2.5, 3.5)) # non-integer scaling factors are allowed\n5×10×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 1.0 1.22222 1.44444 1.66667 1.88889 … 2.33333 2.55556 2.77778 3.0\n 1.75 1.97222 2.19444 2.41667 2.63889 3.08333 3.30556 3.52778 3.75\n 2.5 2.72222 2.94444 3.16667 3.38889 3.83333 4.05556 4.27778 4.5\n 3.25 3.47222 3.69444 3.91667 4.13889 4.58333 4.80556 5.02778 5.25\n 4.0 4.22222 4.44444 4.66667 4.88889 5.33333 5.55556 5.77778 6.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_bilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_bilinear","text":"∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral (W,H) size of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.upsample_trilinear","text":"upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)\nupsample_trilinear(x::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true)\n\nUpsamples the first 3 dimensions of the array x by the upsample factors stored in scale, using trilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.\n\nThe size of the output is equal to (scale[1]*S1, scale[2]*S2, scale[3]*S3, S4, S5), where S1, S2, S3, S4, S5 = size(x).\n\nExamples\n\nupsample_trilinear(x, (2, 3, 4))\nupsample_trilinear(x; size=(4, 9, 11)) # specify ouput size instead\nupsample_trilinear(x, (2.5, 3.5, pi)) # non-integer scaling factors are allowed\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇upsample_trilinear","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇upsample_trilinear","text":"∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T\n\nArguments\n\nΔ: Incoming gradient array, backpropagated from downstream layers\nsize: Lateral size & depth (W,H,D) of the image upsampled in the first place\n\nOutputs\n\ndx: Downsampled version of Δ\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.pixel_shuffle","page":"Low-level Operations – NNlib.jl","title":"NNlib.pixel_shuffle","text":"pixel_shuffle(x, r::Integer)\n\nPixel shuffling operation, upscaling by a factor r.\n\nFor 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.\n\nUsed in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., \"Real-Time Single Image and Video Super-Resolution ...\", CVPR 2016, https://arxiv.org/abs/1609.05158\n\nExamples\n\njulia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]\n2×3×4×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 12.1 13.1\n 21.1 22.1 23.1\n\n[:, :, 2, 1] =\n 11.2 12.2 13.2\n 21.2 22.2 23.2\n\n[:, :, 3, 1] =\n 11.3 12.3 13.3\n 21.3 22.3 23.3\n\n[:, :, 4, 1] =\n 11.4 12.4 13.4\n 21.4 22.4 23.4\n\njulia> pixel_shuffle(x, 2) # 4 channels used up as 2x upscaling of image dimensions\n4×6×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 11.1 11.3 12.1 12.3 13.1 13.3\n 11.2 11.4 12.2 12.4 13.2 13.4\n 21.1 21.3 22.1 22.3 23.1 23.3\n 21.2 21.4 22.2 22.4 23.2 23.4\n\njulia> y = [i + channel/10 for i in 1:3, channel in 1:6, batch in 1:1]\n3×6×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.2 1.3 1.4 1.5 1.6\n 2.1 2.2 2.3 2.4 2.5 2.6\n 3.1 3.2 3.3 3.4 3.5 3.6\n\njulia> pixel_shuffle(y, 2) # 1D image, with 6 channels reduced to 3\n6×3×1 Array{Float64, 3}:\n[:, :, 1] =\n 1.1 1.3 1.5\n 1.2 1.4 1.6\n 2.1 2.3 2.5\n 2.2 2.4 2.6\n 3.1 3.3 3.5\n 3.2 3.4 3.6\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Batched-Operations","page":"Low-level Operations – NNlib.jl","title":"Batched Operations","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"batched_mul\nbatched_mul!\nbatched_adjoint\nbatched_transpose\nbatched_vec","category":"page"},{"location":"reference/models/nnlib/#NNlib.batched_mul","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul","text":"batched_mul(A, B) -> C\nA ⊠ B # \\boxtimes\n\nBatched matrix multiplication. Result has C[:,:,k...] == A[:,:,k...] * B[:,:,k...] where k... represent any indices in the last dimensions.\n\nIf ndims(A) == ndims(B) == 3 and size(B,3) == 1 then instead C[:,:,k] == A[:,:,k] * B[:,:,1], and similarly for A.\n\nTo transpose each matrix, apply batched_transpose to the array, or batched_adjoint for conjugate-transpose:\n\njulia> A, B = randn(2,5,17), randn(5,9,17);\n\njulia> A ⊠ B |> size\n(2, 9, 17)\n\njulia> batched_adjoint(A) |> size\n(5, 2, 17)\n\njulia> batched_mul(A, batched_adjoint(randn(9,5,17))) |> size\n(2, 9, 17)\n\njulia> A ⊠ randn(5,9,1) |> size\n(2, 9, 17)\n\njulia> batched_transpose(A) == PermutedDimsArray(A, (2,1,3))\ntrue\n\nThe equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.\n\nHowever, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.\n\nBoth this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV[\"JULIA_DEBUG\"] = NNlib will display them.\n\n\n\n\n\nbatched_mul(A::Array{T,3}, B::Matrix)\nbatched_mul(A::Matrix, B::Array{T,3})\nA ⊠ B\n\nThis is always matrix-matrix multiplication, but either A or B may lack a batch index.\n\nWhen B is a matrix, result has C[:,:,k] == A[:,:,k] * B[:,:] for all k.\nWhen A is a matrix, then C[:,:,k] == A[:,:] * B[:,:,k]. This can also be done by reshaping and calling *, for instance A ⊡ B using TensorCore.jl, but is implemented here using batched_gemm instead of gemm.\n\njulia> randn(16,8,32) ⊠ randn(8,4) |> size\n(16, 4, 32)\n\njulia> randn(16,8,32) ⊠ randn(8,4,1) |> size # equivalent\n(16, 4, 32)\n\njulia> randn(16,8) ⊠ randn(8,4,32) |> size\n(16, 4, 32)\n\nSee also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_mul!","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_mul!","text":"batched_mul!(C, A, B) -> C\nbatched_mul!(C, A, B, α=1, β=0)\n\nIn-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.\n\nThis will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.\n\nFor complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_adjoint","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_adjoint","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_transpose","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_transpose","text":"batched_transpose(A::AbstractArray{T,3})\nbatched_adjoint(A)\n\nEquivalent to applying transpose or adjoint to each matrix A[:,:,k].\n\nThese exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.\n\nPermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).\n\nBatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}\nBatchedAdjoint{T, S}\n\nLazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.batched_vec","page":"Low-level Operations – NNlib.jl","title":"NNlib.batched_vec","text":"batched_vec(A::Array{T,3}, B::Matrix)\nbatched_vec(A::Array{T,3}, b::Vector)\n\nBatched matrix-vector multiplication: the result has C[:,:,k] == A[:,:,k] * B[:,k] for all k, or else C[:,:,k] == A[:,:,k] * b for b::Vector.\n\nWith the same argument types, batched_mul(A, B) would regard B as a fixed matrix, not a batch of vectors. Both reshape and then call batched_mul(::Array{T,3}, ::Array{T,3}).\n\njulia> A, B, b = randn(16,8,32), randn(8,32), randn(8);\n\njulia> batched_vec(A,B) |> size\n(16, 32)\n\njulia> batched_vec(A,b) |> size\n(16, 32)\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Gather-and-Scatter","page":"Low-level Operations – NNlib.jl","title":"Gather and Scatter","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"Flux's Embedding layer uses NNlib.gather as its backend.","category":"page"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"NNlib.gather\nNNlib.gather!\nNNlib.scatter\nNNlib.scatter!","category":"page"},{"location":"reference/models/nnlib/#NNlib.gather","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather","text":"NNlib.gather(src, idx) -> dst\n\nReverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather! for an in-place version.\n\nExamples\n\njulia> NNlib.gather([1,20,300,4000], [2,4,2])\n3-element Vector{Int64}:\n 20\n 4000\n 20\n\njulia> NNlib.gather([1 2 3; 4 5 6], [1,3,1,3,1])\n2×5 Matrix{Int64}:\n 1 3 1 3 1\n 4 6 4 6 4\n\n\n\n\n\ngather(src, IJK...)\n\nConvert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).\n\nExamples\n\njulia> src = reshape([1:15;], 3, 5)\n3×5 Matrix{Int64}:\n 1 4 7 10 13\n 2 5 8 11 14\n 3 6 9 12 15\n\njulia> NNlib.gather(src, [1, 2], [2, 4])\n2-element Vector{Int64}:\n 4\n 11\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.gather!","page":"Low-level Operations – NNlib.jl","title":"NNlib.gather!","text":"NNlib.gather!(dst, src, idx)\n\nReverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to\n\ndst[:, ... , k] .= src[:, ... , idx[k]...]\n\nNotice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to\n\ndst[:, k] .= src[:, idx[k]]\n\nand k will run over 1:length(idx).\n\nThe elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.\n\nSee gather for an allocating version.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter","text":"NNlib.scatter(op, src, idx; [init, dstsize])\n\nScatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.\n\nIf keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).\nIf dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.\n\nSee scatter! for full details on how idx works.\n\nExamples\n\njulia> NNlib.scatter(+, [10,100,1000], [3,1,2])\n3-element Vector{Int64}:\n 100\n 1000\n 10\n\njulia> NNlib.scatter(+, [1 2 3 4; 5 6 7 8], [2,1,1,5])\n2×5 Matrix{Int64}:\n 5 1 0 0 4\n 13 5 0 0 8\n\njulia> NNlib.scatter(*, [10,200,3000], [1,4,2]; init = 10, dstsize = 6)\n6-element Vector{Int64}:\n 100\n 30000\n 10\n 2000\n 10\n 10\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.scatter!","page":"Low-level Operations – NNlib.jl","title":"NNlib.scatter!","text":"NNlib.scatter!(op, dst, src, idx)\n\nScatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to\n\ndst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])\n\nSee also scatter, gather.\n\nArguments\n\nop: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.\ndst: The destination for src to aggregate to. This argument will be mutated.\nsrc: The source data for aggregating.\nidx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.\n\nExamples\n\njulia> NNlib.scatter!(+, ones(3), [10,100], [1,3])\n3-element Vector{Float64}:\n 11.0\n 1.0\n 101.0\n\njulia> NNlib.scatter!(*, fill(0.5, 2, 4), [1 10; 100 1000], [3,2])\n2×4 Matrix{Float64}:\n 0.5 5.0 0.5 0.5\n 0.5 500.0 50.0 0.5\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Sampling","page":"Low-level Operations – NNlib.jl","title":"Sampling","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"grid_sample\n∇grid_sample","category":"page"},{"location":"reference/models/nnlib/#NNlib.grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.grid_sample","text":"grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)\n\nGiven input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.\n\nThis implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).\n\nArguments\n\ninput: Input array in (W_in, H_in, C, N) shape.\ngrid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.\nTherefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.\nOut-of-bound values are handled according to the padding_mode.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.\n\nReturns\n\n(W_out, H_out, C, N) sampled grid from input.\n\nExamples\n\nIn the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.\n\njulia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))\n2×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 2.0 4.0\n\njulia> grid = Array{Float64}(undef, 2, 3, 2, 1);\n\njulia> grid[:, 1, 1, 1] .= (-3, -1);\n\njulia> grid[:, 2, 1, 1] .= (0, -1);\n\njulia> grid[:, 3, 1, 1] .= (1, -1);\n\njulia> grid[:, 1, 2, 1] .= (-1, 1);\n\njulia> grid[:, 2, 2, 1] .= (0, 1);\n\njulia> grid[:, 3, 2, 1] .= (3, 1);\n\njulia> grid_sample(x, grid; padding_mode=:zeros)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 0.0 3.0\n 1.5 3.5\n 2.0 0.0\n\njulia> grid_sample(x, grid; padding_mode=:border)\n3×2×1×1 Array{Float64, 4}:\n[:, :, 1, 1] =\n 1.0 3.0\n 1.5 3.5\n 2.0 4.0\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.∇grid_sample","page":"Low-level Operations – NNlib.jl","title":"NNlib.∇grid_sample","text":"∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T\n\nArguments\n\nΔ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).\ninput: Input from primal computation in (W_in, H_in, C, N) shape.\ngrid: Grid from primal computation in (2, W_out, H_out, N) shape.\npadding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.\n\nReturns\n\ndinput (same shape as input) and dgrid (same shape as grid) gradients.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Losses","page":"Low-level Operations – NNlib.jl","title":"Losses","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"ctc_loss","category":"page"},{"location":"reference/models/nnlib/#NNlib.ctc_loss","page":"Low-level Operations – NNlib.jl","title":"NNlib.ctc_loss","text":"ctc_loss(ŷ, y)\n\nComputes the connectionist temporal classification loss between ŷ and y. ŷ must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to ŷ, so ŷ must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with ŷ. The blank label is assumed to be the last label category in ŷ, so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#Miscellaneous","page":"Low-level Operations – NNlib.jl","title":"Miscellaneous","text":"","category":"section"},{"location":"reference/models/nnlib/","page":"Low-level Operations – NNlib.jl","title":"Low-level Operations – NNlib.jl","text":"logsumexp\nNNlib.glu","category":"page"},{"location":"reference/models/nnlib/#NNlib.logsumexp","page":"Low-level Operations – NNlib.jl","title":"NNlib.logsumexp","text":"logsumexp(x; dims = :)\n\nComputes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.\n\nSee also logsoftmax.\n\n\n\n\n\n","category":"function"},{"location":"reference/models/nnlib/#NNlib.glu","page":"Low-level Operations – NNlib.jl","title":"NNlib.glu","text":"glu(x, dim = 1)\n\nThe gated linear unit from the \"Language Modeling with Gated Convolutional Networks\" paper.\n\nCalculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.\n\n\n\n\n\n","category":"function"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"CurrentModule = Flux\nCollapsedDocStrings = true","category":"page"},{"location":"reference/training/optimisers/#man-optimisers","page":"Optimisation Rules","title":"Optimisation Rules","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Any optimization rule from Optimisers.jl can be used with train! and other training functions.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"For full details of how the new interface works, see the Optimisers.jl documentation.","category":"page"},{"location":"reference/training/optimisers/#Optimisers-Reference","page":"Optimisation Rules","title":"Optimisers Reference","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"All optimisers return an object that, when passed to train!, will update the parameters passed to it.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.Descent\nOptimisers.Momentum\nOptimisers.Nesterov\nOptimisers.RMSProp\nOptimisers.Adam\nOptimisers.RAdam\nOptimisers.AdaMax\nOptimisers.AdaGrad\nOptimisers.AdaDelta\nOptimisers.AMSGrad\nOptimisers.NAdam\nOptimisers.AdamW\nOptimisers.OAdam\nOptimisers.AdaBelief\nOptimisers.Lion","category":"page"},{"location":"reference/training/optimisers/#Optimisers.Descent","page":"Optimisation Rules","title":"Optimisers.Descent","text":"Descent(η = 1f-1)\nDescent(; eta)\n\nClassic gradient descent optimiser with learning rate η. For each parameter p and its gradient dp, this runs p -= η*dp.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Momentum","page":"Optimisation Rules","title":"Optimisers.Momentum","text":"Momentum(η = 0.01, ρ = 0.9)\nMomentum(; [eta, rho])\n\nGradient descent optimizer with learning rate η and momentum ρ.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Nesterov","page":"Optimisation Rules","title":"Optimisers.Nesterov","text":"Nesterov(η = 0.001, ρ = 0.9)\n\nGradient descent optimizer with learning rate η and Nesterov momentum ρ.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nNesterov momentum (ρ): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RMSProp","page":"Optimisation Rules","title":"Optimisers.RMSProp","text":"RMSProp(η = 0.001, ρ = 0.9, ϵ = 1e-8; centred = false)\nRMSProp(; [eta, rho, epsilon, centred])\n\nOptimizer using the RMSProp algorithm. Often a good choice for recurrent networks. Parameters other than learning rate generally don't need tuning.\n\nCentred RMSProp is a variant which normalises gradients by an estimate their variance, instead of their second moment.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nMomentum (ρ == rho): Controls the acceleration of gradient descent in the prominent direction, in effect dampening oscillations.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\nKeyword centred (or centered): Indicates whether to use centred variant of the algorithm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Adam","page":"Optimisation Rules","title":"Optimisers.Adam","text":"Adam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nAdam optimiser.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.RAdam","page":"Optimisation Rules","title":"Optimisers.RAdam","text":"RAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nRectified Adam optimizer.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaMax","page":"Optimisation Rules","title":"Optimisers.AdaMax","text":"AdaMax(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nAdaMax is a variant of Adam based on the ∞-norm.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaGrad","page":"Optimisation Rules","title":"Optimisers.AdaGrad","text":"AdaGrad(η = 0.1, ϵ = 1e-8)\n\nAdaGrad optimizer. It has parameter specific learning rates based on how frequently it is updated. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaDelta","page":"Optimisation Rules","title":"Optimisers.AdaDelta","text":"AdaDelta(ρ = 0.9, ϵ = 1e-8)\n\nAdaDelta is a version of AdaGrad adapting its learning rate based on a window of past gradient updates. Parameters don't need tuning.\n\nParameters\n\nRho (ρ): Factor by which the gradient is decayed at each time step.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AMSGrad","page":"Optimisation Rules","title":"Optimisers.AMSGrad","text":"AMSGrad(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nThe AMSGrad version of the Adam optimiser. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.NAdam","page":"Optimisation Rules","title":"Optimisers.NAdam","text":"NAdam(η = 0.001, β = (0.9, 0.999), ϵ = 1e-8)\n\nNAdam is a Nesterov variant of Adam. Parameters don't need tuning.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdamW","page":"Optimisation Rules","title":"Optimisers.AdamW","text":"AdamW(η = 0.001, β = (0.9, 0.999), λ = 0, ϵ = 1e-8)\nAdamW(; [eta, beta, lambda, epsilon])\n\nAdamW is a variant of Adam fixing (as in repairing) its weight decay regularization. Implemented as an OptimiserChain of Adam and WeightDecay`.\n\nParameters\n\nLearning rate (η == eta): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple == beta): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nWeight decay (λ == lambda): Controls the strength of L_2 regularisation.\nMachine epsilon (ϵ == epsilon): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"function"},{"location":"reference/training/optimisers/#Optimisers.OAdam","page":"Optimisation Rules","title":"Optimisers.OAdam","text":"OAdam(η = 0.001, β = (0.5, 0.9), ϵ = 1e-8)\n\nOAdam (Optimistic Adam) is a variant of Adam adding an \"optimistic\" term suitable for adversarial training.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.AdaBelief","page":"Optimisation Rules","title":"Optimisers.AdaBelief","text":"AdaBelief(η = 0.001, β = (0.9, 0.999), ϵ = 1e-16)\n\nThe AdaBelief optimiser is a variant of the well-known Adam optimiser.\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\nMachine epsilon (ϵ::Float32): Constant to prevent division by zero (no need to change default)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.Lion","page":"Optimisation Rules","title":"Optimisers.Lion","text":"Lion(η = 0.001, β = (0.9, 0.999))\n\nLion optimiser.\n\nParameters\n\nLearning rate (η): Magnitude by which gradients are updating the weights.\nDecay of momentums (β::Tuple): Exponential decay for the first (β1) and the second (β2) momentum estimate.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Composing-Optimisers","page":"Optimisation Rules","title":"Composing Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Flux (through Optimisers.jl) defines a special kind of optimiser called OptimiserChain which takes in arbitrary optimisers as input. Its behaviour is similar to the usual optimisers, but differs in that it acts by calling the optimisers listed in it sequentially. Each optimiser produces a modified gradient that will be fed into the next, and the resultant update will be applied to the parameter as usual. A classic use case is where adding decays is desirable. Optimisers.jl defines the basic decay corresponding to an L_2 regularization in the loss as WeightDecay.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(WeightDecay(1e-4), Descent())","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Here we apply the weight decay to the Descent optimiser. The resulting optimiser opt can be used as any optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"w = [randn(10, 10), randn(10, 10)]\nopt_state = Flux.setup(opt, w)\n\nloss(w, x) = Flux.mse(w[1] * x, w[2] * x)\n\nloss(w, rand(10)) # around 0.9\n\nfor t = 1:10^5\n g = gradient(w -> loss(w[1], w[2], rand(10)), w)\n Flux.update!(opt_state, w, g)\nend\n\nloss(w, rand(10)) # around 0.9","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"It is possible to compose optimisers for some added flexibility.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.OptimiserChain","category":"page"},{"location":"reference/training/optimisers/#Optimisers.OptimiserChain","page":"Optimisation Rules","title":"Optimisers.OptimiserChain","text":"OptimiserChain(opts...)\n\nCompose a sequence of optimisers so that each opt in opts updates the gradient, in the order specified.\n\nWith an empty sequence, OptimiserChain() is the identity, so update! will subtract the full gradient from the parameters. This is equivalent to Descent(1).\n\nExample\n\njulia> o = OptimiserChain(ClipGrad(1.0), Descent(0.1));\n\njulia> m = (zeros(3),);\n\njulia> s = Optimisers.setup(o, m)\n(Leaf(OptimiserChain(ClipGrad(1.0), Descent(0.1)), (nothing, nothing)),)\n\njulia> Optimisers.update(s, m, ([0.3, 1, 7],))[2] # clips before discounting\n([-0.03, -0.1, -0.1],)\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Scheduling-Optimisers","page":"Optimisation Rules","title":"Scheduling Optimisers","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in ParameterSchedulers.jl. The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a cosine annealing schedule with a momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between 1e-4 and 1e-2 every 10 steps. We also create a new Momentum optimiser.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers\n\nopt = Momentum()\nschedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)\nfor (eta, epoch) in zip(schedule, 1:100)\n opt.eta = eta\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"schedule can also be indexed (e.g. schedule(100)) or iterated like any iterator in Julia.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a stateful schedule, you can use ParameterSchedulers.Stateful:","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"using ParameterSchedulers: Stateful, next!\n\nschedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10))\nfor epoch in 1:100\n opt.eta = next!(schedule)\n # your training code here\nend","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.","category":"page"},{"location":"reference/training/optimisers/#Decays","page":"Optimisation Rules","title":"Decays","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.SignDecay\nOptimisers.WeightDecay","category":"page"},{"location":"reference/training/optimisers/#Optimisers.SignDecay","page":"Optimisation Rules","title":"Optimisers.SignDecay","text":"SignDecay(λ = 1e-3)\n\nImplements L_1 regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.\n\nSee also [WeightDecay] for L_2 normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.WeightDecay","page":"Optimisation Rules","title":"Optimisers.WeightDecay","text":"WeightDecay(λ = 5e-4)\n\nImplements L_2 regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.\n\nIt does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.\n\nSee also [SignDecay] for L_1 normalisation.\n\nParameters\n\nPenalty (λ ≥ 0): Controls the strength of the regularisation.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Gradient-Clipping","page":"Optimisation Rules","title":"Gradient Clipping","text":"","category":"section"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))","category":"page"},{"location":"reference/training/optimisers/","page":"Optimisation Rules","title":"Optimisation Rules","text":"Optimisers.ClipGrad\nOptimisers.ClipNorm","category":"page"},{"location":"reference/training/optimisers/#Optimisers.ClipGrad","page":"Optimisation Rules","title":"Optimisers.ClipGrad","text":"ClipGrad(δ = 10)\n\nRestricts every gradient component to obey -δ ≤ dx[i] ≤ δ.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipNorm.\n\n\n\n\n\n","category":"type"},{"location":"reference/training/optimisers/#Optimisers.ClipNorm","page":"Optimisation Rules","title":"Optimisers.ClipNorm","text":"ClipNorm(ω = 10, p = 2; throw = true)\n\nScales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).\n\nThrows an error if the norm is infinite or NaN, which you can turn off with throw = false.\n\nTypically composed with other rules using OptimiserChain.\n\nSee also ClipGrad.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#GPU-Support","page":"GPU Support","title":"GPU Support","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Starting with v0.14, Flux doesn't force a specific GPU backend and the corresponding package dependencies on the users. Thanks to the package extension mechanism introduced in julia v1.9, Flux conditionally loads GPU specific code once a GPU package is made available (e.g. through using CUDA).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"NVIDIA GPU support requires the packages CUDA.jl and cuDNN.jl to be installed in the environment. In the julia REPL, type ] add CUDA, cuDNN to install them. For more details see the CUDA.jl readme.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"AMD GPU support is available since Julia 1.9 on systems with ROCm and MIOpen installed. For more details refer to the AMDGPU.jl repository.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Metal GPU acceleration is available on Apple Silicon hardware. For more details refer to the Metal.jl repository. Metal support in Flux is experimental and many features are not yet available.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In order to trigger GPU support in Flux, you need to call using CUDA, using AMDGPU or using Metal in your code. Notice that for CUDA, explicitly loading also cuDNN is not required, but the package has to be installed in the environment. ","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"compat: Flux ≤ 0.13\nOld versions of Flux automatically installed CUDA.jl to provide GPU support. Starting from Flux v0.14, CUDA.jl is not a dependency anymore and has to be installed manually.","category":"page"},{"location":"guide/gpu/#Checking-GPU-Availability","page":"GPU Support","title":"Checking GPU Availability","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default, Flux will run the checks on your system to see if it can support GPU functionality. You can check if Flux identified a valid GPU setup by typing the following:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using CUDA\n\njulia> CUDA.functional()\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For AMD GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using AMDGPU\n\njulia> AMDGPU.functional()\ntrue\n\njulia> AMDGPU.functional(:MIOpen)\ntrue","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For Metal GPU:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Metal\n\njulia> Metal.functional()\ntrue","category":"page"},{"location":"guide/gpu/#Selecting-GPU-backend","page":"GPU Support","title":"Selecting GPU backend","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Available GPU backends are: CUDA, AMDGPU and Metal.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux relies on Preferences.jl for selecting default GPU backend to use.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"There are two ways you can specify it:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"From the REPL/code in your project, call Flux.gpu_backend!(\"AMDGPU\") and restart (if needed) Julia session for the changes to take effect.\nIn LocalPreferences.toml file in you project directory specify:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"[Flux]\ngpu_backend = \"AMDGPU\"","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Current GPU backend can be fetched from Flux.GPU_BACKEND variable:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> Flux.GPU_BACKEND\n\"CUDA\"","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The current backend will affect the behaviour of methods like the method gpu described below.","category":"page"},{"location":"guide/gpu/#Basic-GPU-Usage","page":"GPU Support","title":"Basic GPU Usage","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Support for array operations on other hardware backends, like GPUs, is provided by external packages like CUDA.jl, AMDGPU.jl, and Metal.jl. Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For example, we can use CUDA.CuArray (with the cu converter) to run our basic example on an NVIDIA GPU.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"(Note that you need to have CUDA available to use CUDA.CuArray – please see the CUDA.jl instructions for more details.)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using CUDA\n\nW = cu(rand(2, 5)) # a 2×5 CuArray\nb = cu(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = cu(rand(5)), cu(rand(2)) # Dummy data\nloss(x, y) # ~ 3","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Note that we convert both the parameters (W, b) and the data set (x, y) to cuda arrays. Taking derivatives and training works exactly as before.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If you define a structured model, like a Dense layer or Chain, you just need to convert the internal parameters. Flux provides fmap, which allows you to alter all parameters of a model at once.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"d = Dense(10 => 5, σ)\nd = fmap(cu, d)\nd.weight # CuArray\nd(cu(rand(10))) # CuArray output\n\nm = Chain(Dense(10 => 5, σ), Dense(5 => 2), softmax)\nm = fmap(cu, m)\nm(cu(rand(10)))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"As a convenience, Flux provides the gpu function to convert models and data to the GPU if one is available. By default, it'll do nothing. So, you can safely call gpu on some data or model (as shown below), and the code will not error, regardless of whether the GPU is available or not. If a GPU library (e.g. CUDA) loads successfully, gpu will move data from the CPU to the GPU. As is shown below, this will change the type of something like a regular array to a CuArray.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA\n\njulia> m = Dense(10, 5) |> gpu\nDense(10 => 5) # 55 parameters\n\njulia> x = rand(10) |> gpu\n10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.066846445\n ⋮\n 0.76706964\n\njulia> m(x)\n5-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n -0.99992573\n ⋮\n -0.547261","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The analogue cpu is also available for moving models and data back off of the GPU.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> x = rand(10) |> gpu\n10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:\n 0.8019236\n ⋮\n 0.7766742\n\njulia> x |> cpu\n10-element Vector{Float32}:\n 0.8019236\n ⋮\n 0.7766742","category":"page"},{"location":"guide/gpu/#Transferring-Training-Data","page":"GPU Support","title":"Transferring Training Data","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Iterating over the batches in a DataLoader object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this:\ntrain_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true)\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x_cpu, y_cpu) in train_loader\n x = gpu(x_cpu)\n y = gpu(y_cpu)\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nRather than write this out every time, you can just call gpu(::DataLoader):\ngpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu\n# ... model definition, optimiser setup\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n grads = gradient(m -> loss(m, x, y), model)\n Flux.update!(opt_state, model, grads[1])\n end\nend\nThis is equivalent to DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...). Something similar can also be done with CUDA.CuIterator, gpu_train_loader = CUDA.CuIterator(train_loader). However, this only works with a limited number of data types: first(train_loader) should be a tuple (or NamedTuple) of arrays.\nTransferring all training data to the GPU at once before creating the DataLoader. This is usually performed for smaller datasets which are sure to fit in the available GPU memory.\ngpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32)\n# ...\nfor epoch in 1:epochs\n for (x, y) in gpu_train_loader\n # ...\nHere (X, Y) |> gpu applies gpu to both arrays, as it recurses into structures.","category":"page"},{"location":"guide/gpu/#Saving-GPU-Trained-Models","page":"GPU Support","title":"Saving GPU-Trained Models","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"After the training process is done, one must always transfer the trained model back to the cpu memory scope before serializing or saving to disk. This can be done, as described in the previous section, with:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"model = cpu(model) # or model = model |> cpu","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"and then","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"using BSON\n# ...\nBSON.@save \"./path/to/trained_model.bson\" model\n\n# in this approach the cpu-transferred model (referenced by the variable `model`)\n# only exists inside the `let` statement\nlet model = cpu(model)\n # ...\n BSON.@save \"./path/to/trained_model.bson\" model\nend\n\n# is equivalent to the above, but uses `key=value` storing directive from BSON.jl\nBSON.@save \"./path/to/trained_model.bson\" model = cpu(model)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The reason behind this is that models trained in the GPU but not transferred to the CPU memory scope will expect CuArrays as input. In other words, Flux models expect input data coming from the same kind device in which they were trained on.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In controlled scenarios in which the data fed to the loaded models is garanteed to be in the GPU there's no need to transfer them back to CPU memory scope, however in production environments, where artifacts are shared among different processes, equipments or configurations, there is no garantee that the CUDA.jl package will be available for the process performing inference on the model loaded from the disk.","category":"page"},{"location":"guide/gpu/#Disabling-CUDA-or-choosing-which-GPUs-are-visible-to-Flux","page":"GPU Support","title":"Disabling CUDA or choosing which GPUs are visible to Flux","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Sometimes it is required to control which GPUs are visible to julia on a system with multiple GPUs or disable GPUs entirely. This can be achieved with an environment variable CUDA_VISIBLE_DEVICES.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To disable all devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='-1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"To select specific devices by device id:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"$ export CUDA_VISIBLE_DEVICES='0,1'","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"More information for conditional use of GPUs in CUDA.jl can be found in its documentation, and information about the specific use of the variable is described in the Nvidia CUDA blog post.","category":"page"},{"location":"guide/gpu/#Using-device-objects","page":"GPU Support","title":"Using device objects","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"As a more convenient syntax, Flux allows the usage of GPU device objects which can be used to easily transfer models to GPUs (and defaulting to using the CPU if no GPU backend is available). This syntax has a few advantages including automatic selection of the GPU backend and type stability of data movement. To do this, the Flux.get_device function can be used.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux.get_device first checks for a GPU preference, and if possible returns a device for the preference backend. For instance, consider the following example, where we load the CUDA.jl package to use an NVIDIA GPU (\"CUDA\" is the default preference):","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA;\n\njulia> device = Flux.get_device(; verbose=true) # returns handle to an NVIDIA GPU\n[ Info: Using backend set in preferences: CUDA.\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> device.deviceID # check the id of the GPU\nCuDevice(0): NVIDIA GeForce GTX 1650\n\njulia> model = Dense(2 => 3);\n\njulia> model.weight # the model initially lives in CPU memory\n3×2 Matrix{Float32}:\n -0.984794 -0.904345\n 0.720379 -0.486398\n 0.851011 -0.586942\n\njulia> model = model |> device # transfer model to the GPU\nDense(2 => 3) # 9 parameters\n\njulia> model.weight\n3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n -0.984794 -0.904345\n 0.720379 -0.486398\n 0.851011 -0.586942\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"The device preference can also be set via the Flux.gpu_backend! function. For instance, below we first set our device preference to \"CPU\":","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux; Flux.gpu_backend!(\"CPU\")\n┌ Info: New GPU backend set: CPU.\n└ Restart your Julia session for this change to take effect!","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, after restarting the Julia session, Flux.get_device returns a handle to the \"CPU\":","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA; # even if CUDA is loaded, we'll still get a CPU device\n\njulia> device = Flux.get_device(; verbose=true) # get a CPU device\n[ Info: Using backend set in preferences: CPU.\n(::Flux.FluxCPUDevice) (generic function with 1 method)\n\njulia> model = Dense(2 => 3);\n\njulia> model = model |> device\nDense(2 => 3) # 9 parameters\n\njulia> model.weight # no change; model still lives on CPU\n3×2 Matrix{Float32}:\n -0.942968 0.856258\n 0.440009 0.714106\n -0.419192 -0.471838","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Clearly, this means that the same code will work for any GPU backend and the CPU. ","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If the preference backend isn't available or isn't functional, then Flux.get_device looks for a CUDA, AMDGPU or Metal backend, and returns a corresponding device (if the backend is available and functional). Otherwise, a CPU device is returned. In the below example, the GPU preference is \"CUDA\":","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux; # preference is CUDA, but CUDA.jl not loaded\n\njulia> device = Flux.get_device(; verbose=true) # this will resort to automatic device selection\n[ Info: Using backend set in preferences: CUDA.\n┌ Warning: Trying to use backend: CUDA but it's trigger package is not loaded.\n│ Please load the package and call this function again to respect the preferences backend.\n└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:637\n[ Info: Using backend: CPU.\n(::Flux.FluxCPUDevice) (generic function with 1 method)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"For detailed information about how the backend is selected, check the documentation for Flux.get_device.","category":"page"},{"location":"guide/gpu/#Data-movement-across-GPU-devices","page":"GPU Support","title":"Data movement across GPU devices","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux also supports getting handles to specific GPU devices, and transferring models from one GPU device to another GPU device from the same backend. Let's try it out for NVIDIA GPUs. First, we list all the available devices:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, CUDA;\n\njulia> CUDA.devices()\nCUDA.DeviceIterator() for 3 devices:\n0. GeForce RTX 2080 Ti\n1. GeForce RTX 2080 Ti\n2. TITAN X (Pascal)\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's select the device with id 0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device0 = Flux.get_device(\"CUDA\", 0) # the currently supported values for backend are \"CUDA\" and \"AMDGPU\"\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Then, let's move a simple dense layer to the GPU represented by device0:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> dense_model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> dense_model = dense_model |> device0;\n\njulia> dense_model.weight\n3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.695662 0.816299\n -0.204763 -0.10232\n -0.955829 0.538412\n\njulia> CUDA.device(dense_model.weight) # check the GPU to which dense_model is attached\nCuDevice(0): GeForce RTX 2080 Ti\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Next, we'll get a handle to the device with id 1, and move dense_model to that device:","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> device1 = Flux.get_device(\"CUDA\", 1)\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> dense_model = dense_model |> device1; # don't directly print the model; see warning below\n\njulia> CUDA.device(dense_model.weight)\nCuDevice(1): GeForce RTX 2080 Ti\n","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"warning: Printing models after moving to a different device\nDue to a limitation in how GPU packages currently work, printing models on the REPL after moving them to a GPU device which is different from the current device will lead to an error.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux.AbstractDevice\nFlux.FluxCPUDevice\nFlux.FluxCUDADevice\nFlux.FluxAMDGPUDevice\nFlux.FluxMetalDevice\nFlux.supported_devices\nFlux.get_device\nFlux.gpu_backend!","category":"page"},{"location":"guide/gpu/#Flux.AbstractDevice","page":"GPU Support","title":"Flux.AbstractDevice","text":"Flux.AbstractDevice <: Function\n\nAn abstract type representing device objects for different GPU backends. The currently supported backends are \"CUDA\", \"AMDGPU\", \"Metal\" and \"CPU\"; the \"CPU\" backend is the fallback case when no GPU is available. GPU extensions of Flux define subtypes of this type.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxCPUDevice","page":"GPU Support","title":"Flux.FluxCPUDevice","text":"Flux.FluxCPUDevice <: Flux.AbstractDevice\n\nA type representing device objects for the \"CPU\" backend for Flux. This is the fallback case when no GPU is available to Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxCUDADevice","page":"GPU Support","title":"Flux.FluxCUDADevice","text":"FluxCUDADevice <: AbstractDevice\n\nA type representing device objects for the \"CUDA\" backend for Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxAMDGPUDevice","page":"GPU Support","title":"Flux.FluxAMDGPUDevice","text":"FluxAMDGPUDevice <: AbstractDevice\n\nA type representing device objects for the \"AMDGPU\" backend for Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.FluxMetalDevice","page":"GPU Support","title":"Flux.FluxMetalDevice","text":"FluxMetalDevice <: AbstractDevice\n\nA type representing device objects for the \"Metal\" backend for Flux.\n\n\n\n\n\n","category":"type"},{"location":"guide/gpu/#Flux.supported_devices","page":"GPU Support","title":"Flux.supported_devices","text":"Flux.supported_devices()\n\nGet all supported backends for Flux, in order of preference.\n\nExample\n\njulia> using Flux;\n\njulia> Flux.supported_devices()\n(\"CUDA\", \"AMDGPU\", \"Metal\", \"CPU\")\n\n\n\n\n\n","category":"function"},{"location":"guide/gpu/#Flux.get_device","page":"GPU Support","title":"Flux.get_device","text":"Flux.get_device(; verbose=false)::Flux.AbstractDevice\n\nReturns a device object for the most appropriate backend for the current Julia session. \n\nFirst, the function checks whether a backend preference has been set via the Flux.gpu_backend! function. If so, an attempt is made to load this backend. If the corresponding trigger package has been loaded and the backend is functional, a device corresponding to the given backend is loaded. Otherwise, the backend is chosen automatically. To update the backend preference, use Flux.gpu_backend!.\n\nIf there is no preference, then for each of the \"CUDA\", \"AMDGPU\", \"Metal\" and \"CPU\" backends in the given order, this function checks whether the given backend has been loaded via the corresponding trigger package, and whether the backend is functional. If so, the device corresponding to the backend is returned. If no GPU backend is available, a Flux.FluxCPUDevice is returned.\n\nIf verbose is set to true, then the function prints informative log messages.\n\nExamples\n\nFor the example given below, the backend preference was set to \"AMDGPU\" via the gpu_backend! function.\n\njulia> using Flux;\n\njulia> model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> device = Flux.get_device(; verbose=true) # this will just load the CPU device\n[ Info: Using backend set in preferences: AMDGPU.\n┌ Warning: Trying to use backend: AMDGPU but it's trigger package is not loaded.\n│ Please load the package and call this function again to respect the preferences backend.\n└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:638\n[ Info: Using backend: CPU.\n(::Flux.FluxCPUDevice) (generic function with 1 method)\n\njulia> model = model |> device\nDense(2 => 3) # 9 parameters\n\njulia> model.weight\n3×2 Matrix{Float32}:\n -0.304362 -0.700477\n -0.861201 0.67825\n -0.176017 0.234188\n\nHere is the same example, but using \"CUDA\":\n\njulia> using Flux, CUDA;\n\njulia> model = Dense(2 => 3)\nDense(2 => 3) # 9 parameters\n\njulia> device = Flux.get_device(; verbose=true)\n[ Info: Using backend set in preferences: AMDGPU.\n┌ Warning: Trying to use backend: AMDGPU but it's trigger package is not loaded.\n│ Please load the package and call this function again to respect the preferences backend.\n└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:637\n[ Info: Using backend: CUDA.\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> model = model |> device\nDense(2 => 3) # 9 parameters\n\njulia> model.weight\n3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:\n 0.820013 0.527131\n -0.915589 0.549048\n 0.290744 -0.0592499\n\n\n\n\n\nFlux.get_device(backend::String, idx::Int = 0)::Flux.AbstractDevice\n\nGet a device object for a backend specified by the string backend and idx. The currently supported values of backend are \"CUDA\", \"AMDGPU\" and \"CPU\". idx must be an integer value between 0 and the number of available devices.\n\nExamples\n\njulia> using Flux, CUDA;\n\njulia> CUDA.devices()\nCUDA.DeviceIterator() for 3 devices:\n0. GeForce RTX 2080 Ti\n1. GeForce RTX 2080 Ti\n2. TITAN X (Pascal)\n\njulia> device0 = Flux.get_device(\"CUDA\", 0)\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> device0.deviceID\nCuDevice(0): GeForce RTX 2080 Ti\n\njulia> device1 = Flux.get_device(\"CUDA\", 1)\n(::Flux.FluxCUDADevice) (generic function with 1 method)\n\njulia> device1.deviceID\nCuDevice(1): GeForce RTX 2080 Ti\n\njulia> cpu_device = Flux.get_device(\"CPU\")\n(::Flux.FluxCPUDevice) (generic function with 1 method)\n\n\n\n\n\n\n","category":"function"},{"location":"guide/gpu/#Flux.gpu_backend!","page":"GPU Support","title":"Flux.gpu_backend!","text":"gpu_backend!(backend::String)\n\nSet the GPU backend to backend in the LocalPreferences.toml file in you project directory. After restarting Julia, the new backend will affect all subsequent calls to gpu and get_device.\n\nThe supported backends are \"CUDA\", \"AMDGPU\", \"Metal\" and \"CPU\".\n\n\n\n\n\n","category":"function"},{"location":"guide/gpu/#Distributed-data-parallel-training","page":"GPU Support","title":"Distributed data parallel training","text":"","category":"section"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"danger: Experimental\nDistributed support is experimental and could change in the future.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Flux supports now distributed data parallel training with DistributedUtils module. If you want to run your code on multiple GPUs, you have to install MPI.jl (see docs for more info).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using MPI\n\njulia> MPI.install_mpiexecjl()","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can run your code with mpiexecjl --project=. -n julia .jl from CLI.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You can use either the MPIBackend or NCCLBackend, the latter only if also NCCL.jl is loaded. First, initialize a backend with DistributedUtils.initialize, e.g.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Flux, MPI, NCCL\n\njulia> DistributedUtils.initialize(NCCLBackend)\n\njulia> backend = DistributedUtils.get_distributed_backend(NCCLBackend)\nNCCLBackend{Communicator, MPIBackend{MPI.Comm}}(Communicator(Ptr{NCCL.LibNCCL.ncclComm} @0x000000000607a660), MPIBackend{MPI.Comm}(MPI.Comm(1140850688)))","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Pass your model, as well as any data to GPU device.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = Chain(Dense(1 => 256, tanh), Dense(256 => 1)) |> gpu\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.\n\njulia> x = rand(Float32, 1, 16) |> gpu\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.239324 0.331029 0.924996 0.55593 0.853093 0.874513 0.810269 0.935858 0.477176 0.564591 0.678907 0.729682 0.96809 0.115833 0.66191 0.75822\n\njulia> y = x .^ 3\n1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.0137076 0.0362744 0.791443 0.171815 0.620854 0.668804 0.53197 0.819654 0.108651 0.179971 0.312918 0.388508 0.907292 0.00155418 0.29 0.435899","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"In this case, we are training on a total of 16 * number of processes samples. You can also use DistributedUtils.DistributedDataContainer to split the data uniformly across processes (or split the data manually).","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> data = DistributedUtils.DistributedDataContainer(backend, x)\nFlux.DistributedUtils.DistributedDataContainer(Float32[0.23932439 0.33102947 … 0.66191036 0.75822026], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"You have to wrap your model in DistributedUtils.FluxDistributedModel and synchronize it (broadcast accross all processes):","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0)\nChain(\n Dense(1 => 256, tanh), # 512 parameters\n\n Dense(256 => 1), # 257 parameters\n) # Total: 4 arrays, 769 parameters, 744 bytes.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Time to set up an optimizer by using DistributedUtils.DistributedOptimizer and synchronize it as well.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Optimisers\n\njulia> opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))\nDistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8))\n\njulia> st_opt = Optimisers.setup(opt, model)\n(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)\n\njulia> st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0) \n(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Now you can define loss and train the model.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> loss(model) = mean((model(x) .- y).^2)\nloss (generic function with 1 method)\n\njulia> for epoch in 1:100\n global model, st_opt\n l, grad = Zygote.withgradient(loss, model)\n println(\"Epoch $epoch: Loss $l\")\n st_opt, model = Optimisers.update(st_opt, model, grad[1])\n end\nEpoch 1: Loss 0.011638729\nEpoch 2: Loss 0.0116432225\nEpoch 3: Loss 0.012763695\n...","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"Remember that in order to run it on multiple GPUs you have to run from CLI mpiexecjl --project=. -n julia .jl, where is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"By default MPI.jl MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more here on custom installation and rebuilding MPI.jl. Then test if your MPI is CUDA-aware by","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> import Pkg\njulia> Pkg.test(\"MPI\"; test_args=[\"--backend=CUDA\"])","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"If it is, set your local preference as below","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"julia> using Preferences\njulia> set_preferences!(\"Flux\", \"FluxDistributedMPICUDAAware\" => true)","category":"page"},{"location":"guide/gpu/","page":"GPU Support","title":"GPU Support","text":"warning: Known shortcomings\nWe don't run CUDA-aware tests so you're running it at own risk.","category":"page"},{"location":"reference/utilities/#man-init-funcs","page":"Weight Initialisation","title":"Random Weight Initialisation","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux initialises convolutional layers and recurrent cells with glorot_uniform by default. Most layers accept a function as an init keyword, which replaces this default. For example:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> conv = Conv((3, 3), 3 => 2, relu; init=Flux.glorot_normal)\nConv((3, 3), 3 => 2, relu) # 56 parameters\n\njulia> conv.bias\n2-element Vector{Float32}:\n 0.0\n 0.0","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Note that init creates the weight array, but not the bias vector.","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Many of the initialisation functions accept keywords such as gain, and a random number generator. To make it easy to pass these to layers, there are methods which return a function:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"julia> Dense(4 => 5, tanh; init=Flux.glorot_uniform(gain=2))\nDense(4 => 5, tanh) # 25 parameters\n\njulia> Dense(4 => 5, tanh; init=Flux.randn32(MersenneTwister(1)))\nDense(4 => 5, tanh) # 25 parameters","category":"page"},{"location":"reference/utilities/#Initialisation-functions","page":"Weight Initialisation","title":"Initialisation functions","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.glorot_uniform\nFlux.glorot_normal\nFlux.kaiming_uniform\nFlux.kaiming_normal\nFlux.truncated_normal\nFlux.orthogonal\nFlux.sparse_init\nFlux.identity_init\nFlux.ones32\nFlux.zeros32\nFlux.rand32\nFlux.randn32\nFlux.create_bias","category":"page"},{"location":"reference/utilities/#Flux.glorot_uniform","page":"Weight Initialisation","title":"Flux.glorot_uniform","text":"glorot_uniform([rng], size...; gain = 1) -> Array\nglorot_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval -x x, where x = gain * sqrt(6 / (fan_in + fan_out)).\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> Flux.glorot_uniform(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.glorot_uniform(10, 100)), digits=3)\n(-0.233f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 10)), digits=3)\n(-0.234f0, 0.233f0)\n\njulia> round.(extrema(Flux.glorot_uniform(100, 100)), digits=3)\n(-0.173f0, 0.173f0)\n\njulia> Dense(3 => 2, tanh; init = Flux.glorot_uniform(MersenneTwister(1)))\nDense(3 => 2, tanh) # 8 parameters\n\njulia> ans.bias\n2-element Vector{Float32}:\n 0.0\n 0.0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.glorot_normal","page":"Weight Initialisation","title":"Flux.glorot_normal","text":"glorot_normal([rng], size...; gain = 1) -> Array\nglorot_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a normal distribution with standard deviation gain * sqrt(2 / (fan_in + fan_out)), using nfan.\n\nThis method is described in [1] and also known as Xavier initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.glorot_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.glorot_normal(1000, 10)), digits=3)\n0.045f0\n\njulia> round(std(Flux.glorot_normal(1000, 1000)), digits=3)\n0.032f0\n\njulia> Dense(10 => 1000, tanh; init = Flux.glorot_normal(gain=100))\nDense(10 => 1000, tanh) # 11_000 parameters\n\njulia> round(std(ans.weight), sigdigits=3)\n4.45f0\n\nReferences\n\n[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_uniform","page":"Weight Initialisation","title":"Flux.kaiming_uniform","text":"kaiming_uniform([rng], size...; gain = √2) -> Array\nkaiming_uniform([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval [-x, x], where x = gain * sqrt(3/fan_in) using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 10)), digits=3)\n(-0.774f0, 0.773f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(10, 100)), digits=3)\n(-0.243f0, 0.245f0)\n\njulia> round.(extrema(Flux.kaiming_uniform(100, 100)), digits=3)\n(-0.245f0, 0.245f0)\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.kaiming_normal","page":"Weight Initialisation","title":"Flux.kaiming_normal","text":"kaiming_normal([rng], size...; gain = √2) -> Array\nkaiming_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size containing random numbers taken from a normal distribution standard deviation gain / sqrt(fan_in), using nfan.\n\nThis method is described in [1] and also known as He initialization.\n\nExamples\n\njulia> using Statistics\n\njulia> round(std(Flux.kaiming_normal(10, 1000)), digits=3)\n0.044f0\n\njulia> round(std(Flux.kaiming_normal(1000, 10)), digits=3)\n0.449f0\n\njulia> round(std(Flux.kaiming_normal(1000, 1000)), digits=3)\n0.045f0\n\nReferences\n\n[1] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.truncated_normal","page":"Weight Initialisation","title":"Flux.truncated_normal","text":"truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array\ntruncated_normal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size where each element is drawn from a truncated normal distribution. The numbers are distributed like filter(x -> lo<=x<=hi, mean .+ std .* randn(100)).\n\nThe values are generated by sampling a Uniform(0, 1) (rand()) and then applying the inverse CDF of the truncated normal distribution. This method works best when lo ≤ mean ≤ hi.\n\nExamples\n\njulia> using Statistics\n\njulia> Flux.truncated_normal(3, 4) |> summary\n\"3×4 Matrix{Float32}\"\n\njulia> round.(extrema(Flux.truncated_normal(10^6)); digits=3)\n(-2.0f0, 2.0f0)\n\njulia> round(std(Flux.truncated_normal(10^6; lo = -100, hi = 100)))\n1.0f0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.orthogonal","page":"Weight Initialisation","title":"Flux.orthogonal","text":"orthogonal([rng], size...; gain = 1) -> Array\northogonal([rng]; kw...) -> Function\n\nReturn an Array{Float32} of the given size which is a (semi) orthogonal matrix, as described in [1].\n\nCannot construct a vector, i.e. length(size) == 1 is forbidden. For length(size) > 2, a prod(size[1:(end - 1)]) by size[end] orthogonal matrix is computed before reshaping it to the original dimensions.\n\nExamples\n\njulia> W = Flux.orthogonal(5, 7);\n\njulia> summary(W)\n\"5×7 Matrix{Float32}\"\n\njulia> W * W' ≈ I(5)\ntrue\n\njulia> W2 = Flux.orthogonal(7, 5);\n\njulia> W2 * W2' ≈ I(7)\nfalse\n\njulia> W2' * W2 ≈ I(5)\ntrue\n\njulia> W3 = Flux.orthogonal(3, 3, 2, 4);\n\njulia> transpose(reshape(W3, :, 4)) * reshape(W3, :, 4) ≈ I(4)\ntrue\n\nReferences\n\n[1] Saxe, McClelland, Ganguli. \"Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\", ICLR 2014, https://arxiv.org/abs/1312.6120\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.sparse_init","page":"Weight Initialisation","title":"Flux.sparse_init","text":"sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array\nsparse_init([rng]; kw...) -> Function\n\nReturn a Matrix{Float32} of size rows, cols where each column contains a fixed fraction of zero elements given by sparsity. Non-zero elements are normally distributed with a mean of zero and standard deviation std.\n\nThis method is described in [1].\n\nExamples\n\njulia> count(iszero, Flux.sparse_init(10, 10, sparsity=1/5))\n20\n\njulia> sum(0 .== Flux.sparse_init(10, 11, sparsity=0.9), dims=1)\n1×11 Matrix{Int64}:\n 9 9 9 9 9 9 9 9 9 9 9\n\njulia> Dense(3 => 10, tanh; init=Flux.sparse_init(sparsity=0.5))\nDense(3 => 10, tanh) # 40 parameters\n\njulia> count(iszero, ans.weight, dims=1)\n1×3 Matrix{Int64}:\n 5 5 5\n\nReferences\n\n[1] Martens, J, \"Deep learning via Hessian-free optimization\" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.identity_init","page":"Weight Initialisation","title":"Flux.identity_init","text":"identity_init(size...; gain=1, shift=0) -> Array\nidentity_init(; kw...) -> Function\n\nReturn an Array{Float32} of the given size which yields an identity mapping when used as parameters in most Flux layers. Use gain to scale the identity by a constant.\n\nOften useful in the context of transfer learning, i.e when one wants to add more capacity to a model but start from the same mapping.\n\nHas the following behaviour\n\n1D: A Vector of zeros (useful for an identity bias)\n2D: An identity matrix (useful for an identity matrix multiplication)\nMore than 2D: A dense block array of center tap spatial filters (useful for an identity convolution)\n\nSome caveats: \n\nNot all layers will be identity mapping when used with this init. Exceptions include recurrent layers and normalization layers.\nLayers must have input_size == output_size for identity mapping to be possible. When this is not the case, extra dimensions of the array are padded with zeros.\nFor convolutional layers, in addition to the above, the kernel sizes must also be odd and padding must be applied so that output feature maps have the same size as input feature maps, e.g by using SamePad.\n\nUse keyword shift (integer or tuple) to apply circular shift to the output, equivalent to Base.circshift(identity_init(size...), shift).\n\nFor consistency with other initialisers, it accepts rng::AbstractRNG as an optional first argument. But this is ignored, since the result is not random.\n\nExamples\n\njulia> Flux.identity_init(3,5)\n3×5 Matrix{Float32}:\n 1.0 0.0 0.0 0.0 0.0\n 0.0 1.0 0.0 0.0 0.0\n 0.0 0.0 1.0 0.0 0.0\n\njulia> Dense(5 => 3, relu, init=Flux.identity_init)([1,-2,3,-4,5])\n3-element Vector{Float32}:\n 1.0\n 0.0\n 3.0\n\njulia> Flux.identity_init(3,3,2; gain=100)\n3×3×2 Array{Float32, 3}:\n[:, :, 1] =\n 0.0 0.0 0.0\n 100.0 0.0 0.0\n 0.0 0.0 0.0\n\n[:, :, 2] =\n 0.0 0.0 0.0\n 0.0 100.0 0.0\n 0.0 0.0 0.0\n\njulia> x4 = cat([1 2 3; 4 5 6; 7 8 9]; dims=4);\n\njulia> Conv((2,2), 1 => 1, init=Flux.identity_init(gain=10), pad=SamePad())(x4)\n3×3×1×1 Array{Float32, 4}:\n[:, :, 1, 1] =\n 10.0 20.0 30.0\n 40.0 50.0 60.0\n 70.0 80.0 90.0\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.ones32","page":"Weight Initialisation","title":"Flux.ones32","text":"ones32(size...) = ones(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 1s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.zeros32","page":"Weight Initialisation","title":"Flux.zeros32","text":"zeros32(size...) = zeros(Float32, size...)\n\nReturn an Array{Float32} of the given size filled with 0s.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.rand32","page":"Weight Initialisation","title":"Flux.rand32","text":"rand32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.randn32","page":"Weight Initialisation","title":"Flux.randn32","text":"randn32([rng], size...)\n\nReturn an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.create_bias","page":"Weight Initialisation","title":"Flux.create_bias","text":"create_bias(weights, bias, size...)\n\nReturn a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.\n\nbias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.\nbias == false returns false, which is understood by AD to be non-differentiable.\nbias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"These functions call:","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.rng_from_array\nFlux.nfan","category":"page"},{"location":"reference/utilities/#Flux.rng_from_array","page":"Weight Initialisation","title":"Flux.rng_from_array","text":"rng_from_array(x)\n\nCreate an instance of the RNG most appropriate for x. The current defaults are:\n\nx isa CuArray: CUDA.default_rng()\nx isa AbstractArray: `Random.default_rng()\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.nfan","page":"Weight Initialisation","title":"Flux.nfan","text":"nfan(n_out, n_in=1) -> Tuple\nnfan(dims...)\nnfan(dims::Tuple)\n\nFor a layer characterized by dimensions dims, return a tuple (fan_in, fan_out), where fan_in is the number of input neurons connected to an output one, and fan_out is the number of output neurons connected to an input one.\n\nThis function is mainly used by weight initializers, e.g., kaiming_normal.\n\nExamples\n\njulia> layer = Dense(10, 20);\n\njulia> Flux.nfan(size(layer.weight))\n(10, 20)\n\njulia> layer = Conv((3, 3), 2=>10);\n\njulia> Flux.nfan(size(layer.weight))\n(18, 90)\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Changing-the-type-of-all-parameters","page":"Weight Initialisation","title":"Changing the type of all parameters","text":"","category":"section"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):","category":"page"},{"location":"reference/utilities/","page":"Weight Initialisation","title":"Weight Initialisation","text":"Flux.f64\nFlux.f32\nFlux.f16","category":"page"},{"location":"reference/utilities/#Flux.f64","page":"Weight Initialisation","title":"Flux.f64","text":"f64(m)\n\nConverts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.\n\nSee also f32 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f32","page":"Weight Initialisation","title":"Flux.f32","text":"f32(m)\n\nConverts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.\n\nSee also f64 and f16.\n\n\n\n\n\n","category":"function"},{"location":"reference/utilities/#Flux.f16","page":"Weight Initialisation","title":"Flux.f16","text":"f16(m)\n\nConverts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.\n\nSupport for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.\n\nSee also f32 and f64.\n\nExample\n\njulia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10)) # all Float32\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 6.211 MiB.\n\njulia> m |> f16 # takes half the memory\nChain(\n Dense(784 => 2048, relu), # 1_607_680 parameters\n Dense(2048 => 10), # 20_490 parameters\n) # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.\n\n\n\n\n\n","category":"function"},{"location":"reference/outputsize/#Shape-Inference","page":"Shape Inference","title":"Shape Inference","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux has some tools to help generate models in an automated fashion, by inferring the size of arrays that layers will recieve, without doing any computation. This is especially useful for convolutional models, where the same Conv layer accepts any size of image, but the next layer may not. ","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The higher-level tool is a macro @autosize which acts on the code defining the layers, and replaces each appearance of _ with the relevant size. This simple example returns a model with Dense(845 => 10) as the last layer:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"@autosize (28, 28, 1, 32) Chain(Conv((3, 3), _ => 5, relu, stride=2), Flux.flatten, Dense(_ => 10))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The input size may be provided at runtime, like @autosize (sz..., 1, 32) Chain(Conv(..., but all the layer constructors containing _ must be explicitly written out – the macro sees the code as written.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"This macro relies on a lower-level function outputsize, which you can also use directly:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"c = Conv((3, 3), 1 => 5, relu, stride=2)\nFlux.outputsize(c, (28, 28, 1, 32)) # returns (13, 13, 5, 32)","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"The function outputsize works by passing a \"dummy\" array into the model, which propagates through very cheaply. It should work for all layers, including custom layers, out of the box.","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"An example of how to automate model building is this:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"\"\"\"\n make_model(width, height, [inchannels, nclasses; layer_config])\n\nCreate a CNN for a given set of configuration parameters. Arguments:\n- `width`, `height`: the input image size in pixels\n- `inchannels`: the number of channels in the input image, default `1`\n- `nclasses`: the number of output classes, default `10`\n- Keyword `layer_config`: a vector of the number of channels per layer, default `[16, 16, 32, 64]`\n\"\"\"\nfunction make_model(width, height, inchannels = 1, nclasses = 10;\n layer_config = [16, 16, 32, 64])\n # construct a vector of layers:\n conv_layers = []\n push!(conv_layers, Conv((5, 5), inchannels => layer_config[1], relu, pad=SamePad()))\n for (inch, outch) in zip(layer_config, layer_config[2:end])\n push!(conv_layers, Conv((3, 3), inch => outch, sigmoid, stride=2))\n end\n\n # compute the output dimensions after these conv layers:\n conv_outsize = Flux.outputsize(conv_layers, (width, height, inchannels); padbatch=true)\n\n # use this to define appropriate Dense layer:\n last_layer = Dense(prod(conv_outsize) => nclasses)\n return Chain(conv_layers..., Flux.flatten, last_layer)\nend\n\nm = make_model(28, 28, 3, layer_config = [9, 17, 33, 65])\n\nFlux.outputsize(m, (28, 28, 3, 42)) == (10, 42) == size(m(randn(Float32, 28, 28, 3, 42)))","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Alternatively, using the macro, the definition of make_model could end with:","category":"page"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":" # compute the output dimensions & construct appropriate Dense layer:\n return @autosize (width, height, inchannels, 1) Chain(conv_layers..., Flux.flatten, Dense(_ => nclasses))\nend","category":"page"},{"location":"reference/outputsize/#Listing","page":"Shape Inference","title":"Listing","text":"","category":"section"},{"location":"reference/outputsize/","page":"Shape Inference","title":"Shape Inference","text":"Flux.@autosize\nFlux.outputsize","category":"page"},{"location":"reference/outputsize/#Flux.@autosize","page":"Shape Inference","title":"Flux.@autosize","text":"@autosize (size...,) Chain(Layer(_ => 2), Layer(_), ...)\n\nReturns the specified model, with each _ replaced by an inferred number, for input of the given size.\n\nThe unknown sizes are usually the second-last dimension of that layer's input, which Flux regards as the channel dimension. (A few layers, Dense & LayerNorm, instead always use the first dimension.) The underscore may appear as an argument of a layer, or inside a =>. It may be used in further calculations, such as Dense(_ => _÷4).\n\nExamples\n\njulia> @autosize (3, 1) Chain(Dense(_ => 2, sigmoid), BatchNorm(_, affine=false))\nChain(\n Dense(3 => 2, σ), # 8 parameters\n BatchNorm(2, affine=false),\n) \n\njulia> img = [28, 28];\n\njulia> @autosize (img..., 1, 32) Chain( # size is only needed at runtime\n Chain(c = Conv((3,3), _ => 5; stride=2, pad=SamePad()),\n p = MeanPool((3,3)),\n b = BatchNorm(_),\n f = Flux.flatten),\n Dense(_ => _÷4, relu, init=Flux.rand32), # can calculate output size _÷4\n SkipConnection(Dense(_ => _, relu), +),\n Dense(_ => 10),\n )\nChain(\n Chain(\n c = Conv((3, 3), 1 => 5, pad=1, stride=2), # 50 parameters\n p = MeanPool((3, 3)),\n b = BatchNorm(5), # 10 parameters, plus 10\n f = Flux.flatten,\n ),\n Dense(80 => 20, relu), # 1_620 parameters\n SkipConnection(\n Dense(20 => 20, relu), # 420 parameters\n +,\n ),\n Dense(20 => 10), # 210 parameters\n) # Total: 10 trainable arrays, 2_310 parameters,\n # plus 2 non-trainable, 10 parameters, summarysize 10.469 KiB.\n\njulia> outputsize(ans, (28, 28, 1, 32))\n(10, 32)\n\nLimitations:\n\nWhile @autosize (5, 32) Flux.Bilinear(_ => 7) is OK, something like Bilinear((_, _) => 7) will fail.\nWhile Scale(_) and LayerNorm(_) are fine (and use the first dimension), Scale(_,_) and LayerNorm(_,_) will fail if size(x,1) != size(x,2).\n\n\n\n\n\n","category":"macro"},{"location":"reference/outputsize/#Flux.outputsize","page":"Shape Inference","title":"Flux.outputsize","text":"outputsize(m, x_size, y_size, ...; padbatch=false)\n\nFor model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.\n\nExamples\n\njulia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);\n\njulia> par = Parallel(vcat, Dense(5 => 9), Dense(7 => 11));\n\njulia> Flux.outputsize(par, (5, 64), (7, 64))\n(20, 64)\n\njulia> m = Chain(par, Dense(20 => 13), softmax);\n\njulia> Flux.outputsize(m, (5,), (7,); padbatch=true)\n(13, 1)\n\njulia> par(x, y) == par((x, y)) == Chain(par, identity)((x, y))\ntrue\n\nNotice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.\n\n\n\n\n\n","category":"function"},{"location":"guide/performance/#man-performance-tips","page":"Performance Tips","title":"Performance Tips","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"All the usual Julia performance tips apply. As always profiling your code is generally a useful way of finding bottlenecks. Below follow some Flux specific tips/reminders.","category":"page"},{"location":"guide/performance/#Don't-use-more-precision-than-you-need","page":"Performance Tips","title":"Don't use more precision than you need","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Flux works great with all kinds of number types. But often you do not need to be working with say Float64 (let alone BigFloat). Switching to Float32 can give you a significant speed up, not because the operations are faster, but because the memory usage is halved. Which means allocations occur much faster. And you use less memory.","category":"page"},{"location":"guide/performance/#Preserve-inputs'-types","page":"Performance Tips","title":"Preserve inputs' types","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Not only should your activation and loss functions be type-stable, they should also preserve the type of their inputs.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"A very artificial example using an activation function like","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"my_tanh(x) = Float64(tanh(x))","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"will result in performance on Float32 input orders of magnitude slower than the normal tanh would, because it results in having to use slow mixed type multiplication in the dense layers. Similar situations can occur in the loss function during backpropagation.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Which means if you change your data say from Float64 to Float32 (which should give a speedup: see above), you will see a large slow-down.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"This can occur sneakily, because you can cause type-promotion by interacting with a numeric literals. E.g. the following will have run into the same problem as above:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = 0.01*x + tanh(x)","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While one could change the activation function (e.g. to use 0.01f0*x), the idiomatic (and safe way) to avoid type casts whenever inputs changes is to use oftype:","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"leaky_tanh(x) = oftype(x/1, 0.01)*x + tanh(x)","category":"page"},{"location":"guide/performance/#Evaluate-batches-as-matrices-of-features","page":"Performance Tips","title":"Evaluate batches as matrices of features","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"While it can sometimes be tempting to process your observations (feature vectors) one at a time e.g.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector})\n sum(zip(xs, ys)) do (x, y_target)\n y_pred = model(x) # evaluate the model\n return loss(y_pred, y_target)\n end\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"It is much faster to concatenate them into a matrix, as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications. The improvement is enough that it is worthwhile allocating new memory to store them contiguously.","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"x_batch = reduce(hcat, xs)\ny_batch = reduce(hcat, ys)\n...\nfunction loss_total(x_batch::Matrix, y_batch::Matrix)\n y_preds = model(x_batch)\n sum(loss.(y_preds, y_batch))\nend","category":"page"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.","category":"page"},{"location":"guide/performance/#Be-aware-of-GPU-memory-inefficiencies","page":"Performance Tips","title":"Be aware of GPU memory inefficiencies","text":"","category":"section"},{"location":"guide/performance/","page":"Performance Tips","title":"Performance Tips","text":"Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.","category":"page"},{"location":"#Flux:-The-Julia-Machine-Learning-Library","page":"Welcome","title":"Flux: The Julia Machine Learning Library","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Flux is a library for machine learning. It comes \"batteries-included\" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.\nExtensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.\nPlay nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.","category":"page"},{"location":"#Installation","page":"Welcome","title":"Installation","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Download Julia 1.9 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.","category":"page"},{"location":"#Learning-Flux","page":"Welcome","title":"Learning Flux","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"The quick start page trains a simple neural network.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"This rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.","category":"page"},{"location":"#Community","page":"Welcome","title":"Community","text":"","category":"section"},{"location":"","page":"Welcome","title":"Welcome","text":"Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.","category":"page"},{"location":"","page":"Welcome","title":"Welcome","text":"If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.","category":"page"},{"location":"tutorials/linear_regression/#man-linear-regression","page":"Linear Regression","title":"Tutorial: Linear Regression","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Flux is a pure Julia ML stack that allows you to build predictive models. Here are the steps for a typical Flux program:","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Provide training and test data\nBuild a model with configurable parameters to make predictions\nIteratively train the model by tweaking the parameters to improve predictions\nVerify your model","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Under the hood, Flux uses a technique called automatic differentiation to take gradients that help improve predictions. Flux is also fully written in Julia so you can easily replace any layer of Flux with your own code to improve your understanding or satisfy special requirements.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The following page contains a step-by-step walkthrough of the linear regression algorithm in Julia using Flux! We will start by creating a simple linear regression model for dummy data and then move on to a real dataset. The first part would involve writing some parts of the model on our own, which will later be replaced by Flux.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let us start by building a simple linear regression model. This model would be trained on the data points of the form (x₁, y₁), (x₂, y₂), ... , (xₙ, yₙ). In the real world, these xs can have multiple features, and the ys denote a label. In our example, each x has a single feature; hence, our data would have n data points, each point mapping a single feature to a single label.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Importing the required Julia packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Plots","category":"page"},{"location":"tutorials/linear_regression/#Generating-a-dataset","page":"Linear Regression","title":"Generating a dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data usually comes from the real world, which we will be exploring in the last part of this tutorial, but we don't want to jump straight to the relatively harder part. Here we will generate the xs of our data points and map them to the respective ys using a simple function. Remember, here each x is equivalent to a feature, and each y is the corresponding label. Combining all the xs and ys would create the complete dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = hcat(collect(Float32, -3:0.1:3)...)\n1×61 Matrix{Float32}:\n -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 … 2.4 2.5 2.6 2.7 2.8 2.9 3.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The hcat call generates a Matrix with numbers ranging from -3.0 to 3.0 with a gap of 0.1 between them. Each column of this matrix holds a single x, a total of 61 xs. The next step would be to generate the corresponding labels or the ys.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> f(x) = @. 3x + 2;\n\njulia> y = f(x)\n1×61 Matrix{Float32}:\n -7.0 -6.7 -6.4 -6.1 -5.8 -5.5 … 9.5 9.8 10.1 10.4 10.7 11.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The function f maps each x to a y, and as x is a Matrix, the expression broadcasts the scalar values using @. macro. Our data points are ready, but they are too perfect. In a real-world scenario, we will not have an f function to generate y values, but instead, the labels would be manually added.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x = x .* reshape(rand(Float32, 61), (1, 61));","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Visualizing the final data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(vec(x), vec(y), lw = 3, seriestype = :scatter, label = \"\", title = \"Generated data\", xlabel = \"x\", ylabel= \"y\");","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-data)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data looks random enough now! The x and y values are still somewhat correlated; hence, the linear regression algorithm should work fine on our dataset.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed ahead and build a model for our dataset!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-model","page":"Linear Regression","title":"Building a model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A linear regression model is defined mathematically as -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"model(W b x) = Wx + b","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"where W is the weight matrix and b is the bias. For our case, the weight matrix (W) would constitute only a single element, as we have only a single feature. We can define our model in Julia using the exact same notation!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) = @. W*x + b\ncustom_model (generic function with 1 method)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The @. macro allows you to perform the calculations by broadcasting the scalar quantities (for example - the bias).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The next step would be to initialize the model parameters, which are the weight and the bias. There are a lot of initialization techniques available for different machine learning models, but for the sake of this example, let's pull out the weight from a uniform distribution and initialize the bias as 0.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = rand(Float32, 1, 1)\n1×1 Matrix{Float32}:\n 0.99285793\n\njulia> b = [0.0f0]\n1-element Vector{Float32}:\n 0.0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Time to test if our model works!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_model(W, b, x) |> size\n(1, 61)\n\njulia> custom_model(W, b, x)[1], y[1]\n(-1.6116865f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It does! But the predictions are way off. We need to train the model to improve the predictions, but before training the model we need to define the loss function. The loss function would ideally output a quantity that we will try to minimize during the entire training process. Here we will use the mean sum squared error loss function.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function custom_loss(W, b, x, y)\n ŷ = custom_model(W, b, x)\n sum((y .- ŷ).^2) / length(x)\n end;\n\njulia> custom_loss(W, b, x, y)\n23.772217f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Calling the loss function on our xs and ys shows how far our predictions (ŷ) are from the real labels. More precisely, it calculates the sum of the squares of residuals and divides it by the total number of data points.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We have successfully defined our model and the loss function, but surprisingly, we haven't used Flux anywhere till now. Let's see how we can write the same code using Flux. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model = Dense(1 => 1)\nDense(1 => 1) # 2 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"A Dense(1 => 1) layer denotes a layer of one neuron with one input (one feature) and one output. This layer is exactly same as the mathematical model defined by us above! Under the hood, Flux too calculates the output using the same expression! But, we don't have to initialize the parameters ourselves this time, instead Flux does it for us.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model.weight, flux_model.bias\n(Float32[-1.2678515;;], Float32[0.0])","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Now we can check if our model is acting right. We can pass the complete data in one go, with each x having exactly one feature (one input) -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> flux_model(x) |> size\n(1, 61)\n\njulia> flux_model(x)[1], y[1]\n(-1.8525281f0, -7.0f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It is! The next step would be defining the loss function using Flux's functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function flux_loss(flux_model, x, y)\n ŷ = flux_model(x)\n Flux.mse(ŷ, y)\n end;\n\njulia> flux_loss(flux_model, x, y)\n22.74856f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Everything works as before! It almost feels like Flux provides us with smart wrappers for the functions we could have written on our own. Now, as the last step of this section, let's see how different the flux_model is from our custom_model. A good way to go about this would be to fix the parameters of both models to be the same. Let's change the parameters of our custom_model to match that of the flux_model -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W = Float32[1.1412252]\n1-element Vector{Float32}:\n 1.1412252","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"To check how both the models are performing on the data, let's find out the losses using the loss and flux_loss functions -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y), flux_loss(flux_model, x, y)\n(22.74856f0, 22.74856f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The losses are identical! This means that our model and the flux_model are identical on some level, and the loss functions are completely identical! The difference in models would be that Flux's Dense layer supports many other arguments that can be used to customize the layer further. But, for this tutorial, let us stick to our simple custom_model.","category":"page"},{"location":"tutorials/linear_regression/#Training-the-model","page":"Linear Regression","title":"Training the model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's train our model using the classic Gradient Descent algorithm. According to the gradient descent algorithm, the weights and biases should be iteratively updated using the following mathematical equations -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"beginaligned\nW = W - eta * fracdLdW \nb = b - eta * fracdLdb\nendaligned","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Here, W is the weight matrix, b is the bias vector, eta is the learning rate, fracdLdW is the derivative of the loss function with respect to the weight, and fracdLdb is the derivative of the loss function with respect to the bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The derivatives are calculated using an Automatic Differentiation tool, and Flux uses Zygote.jl for the same. Since Zygote.jl is an independent Julia package, it can be used outside of Flux as well! Refer to the documentation of Zygote.jl for more information on the same.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Our first step would be to obtain the gradient of the loss function with respect to the weights and the biases. Flux re-exports Zygote's gradient function; hence, we don't need to import Zygote explicitly to use the functionality.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now update the parameters, following the gradient descent algorithm -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> W .= W .- 0.1 .* dLdW\n1-element Vector{Float32}:\n 1.8144473\n\njulia> b .= b .- 0.1 .* dLdb\n1-element Vector{Float32}:\n 0.41325632","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The parameters have been updated! We can now check the value of the loss function -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> custom_loss(W, b, x, y)\n17.157953f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down! This means that we successfully trained our model for one epoch. We can plug the training code written above into a loop and train the model for a higher number of epochs. It can be customized either to have a fixed number of epochs or to stop when certain conditions are met, for example, change in loss < 0.1. The loop can be tailored to suit the user's needs, and the conditions can be specified in plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's plug our super training logic inside a function and test it again -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_custom_model()\n dLdW, dLdb, _, _ = gradient(custom_loss, W, b, x, y)\n @. W = W - 0.1 * dLdW\n @. b = b - 0.1 * dLdb\n end;\n\njulia> train_custom_model();\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[2.340657], Float32[0.7516814], 13.64972f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"It works, and the loss went down again! This was the second epoch of our training procedure. Let's plug this in a for loop and train the model for 30 epochs.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> for i = 1:40\n train_custom_model()\n end\n\njulia> W, b, custom_loss(W, b, x, y)\n(Float32[4.2422233], Float32[2.2460847], 7.6680417f0)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"There was a significant reduction in loss, and the parameters were updated!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can train the model even more or tweak the hyperparameters to achieve the desired result faster, but let's stop here. We trained our model for 42 epochs, and loss went down from 22.74856 to 7.6680417f. Time for some visualization!","category":"page"},{"location":"tutorials/linear_regression/#Results","page":"Linear Regression","title":"Results","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The main objective of this tutorial was to fit a line to our dataset using the linear regression algorithm. The training procedure went well, and the loss went down significantly! Let's see what the fitted line looks like. Remember, Wx + b is nothing more than a line's equation, with slope = W[1] and y-intercept = b[1] (indexing at 1 as W and b are iterable).","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Plotting the line and the data points using Plot.jl -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> plot(reshape(x, (61, 1)), reshape(y, (61, 1)), lw = 3, seriestype = :scatter, label = \"\", title = \"Simple Linear Regression\", xlabel = \"x\", ylabel= \"y\");\n\njulia> plot!((x) -> b[1] + W[1] * x, -3, 3, label=\"Custom model\", lw=2);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"(Image: linear-regression-line)","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The line fits well! There is room for improvement, but we leave that up to you! You can play with the optimisers, the number of epochs, learning rate, etc. to improve the fitting and reduce the loss!","category":"page"},{"location":"tutorials/linear_regression/#Linear-regression-model-on-a-real-dataset","page":"Linear Regression","title":"Linear regression model on a real dataset","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We now move on to a relatively complex linear regression model. Here we will use a real dataset from MLDatasets.jl, which will not confine our data points to have only one feature. Let's start by importing the required packages -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> using Flux, Statistics, MLDatasets, DataFrames","category":"page"},{"location":"tutorials/linear_regression/#Gathering-real-data","page":"Linear Regression","title":"Gathering real data","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's start by initializing our dataset. We will be using the BostonHousing dataset consisting of 506 data points. Each of these data points has 13 features and a corresponding label, the house's price. The xs are still mapped to a single y, but now, a single x data point has 13 features. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> dataset = BostonHousing();\n\njulia> x, y = BostonHousing(as_df=false)[:];\n\njulia> x, y = Float32.(x), Float32.(y);","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now split the obtained data into training and testing data -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train, x_test, y_train, y_test = x[:, 1:400], x[:, 401:end], y[:, 1:400], y[:, 401:end];\n\njulia> x_train |> size, x_test |> size, y_train |> size, y_test |> size\n((13, 400), (13, 106), (1, 400), (1, 106))","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This data contains a diverse number of features, which means that the features have different scales. A wise option here would be to normalise the data, making the training process more efficient and fast. Let's check the standard deviation of the training data before normalising it.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> std(x_train)\n134.06786f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The data is indeed not normalised. We can use the Flux.normalise function to normalise the training data.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_train_n = Flux.normalise(x_train);\n\njulia> std(x_train_n)\n1.0000844f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The standard deviation is now close to one! Our data is ready!","category":"page"},{"location":"tutorials/linear_regression/#Building-a-Flux-model","page":"Linear Regression","title":"Building a Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now directly use Flux and let it do all the work internally! Let's define a model that takes in 13 inputs (13 features) and gives us a single output (the label). We will then pass our entire data through this model in one go, and Flux will handle everything for us! Remember, we could have declared a model in plain Julia as well. The model will have 14 parameters: 13 weights and 1 bias.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> model = Dense(13 => 1)\nDense(13 => 1) # 14 parameters","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Same as before, our next step would be to define a loss function to quantify our accuracy somehow. The lower the loss, the better the model!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function loss(model, x, y)\n ŷ = model(x)\n Flux.mse(ŷ, y)\n end;\n\njulia> loss(model, x_train_n, y_train)\n676.1656f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can now proceed to the training phase!","category":"page"},{"location":"tutorials/linear_regression/#Training-the-Flux-model","page":"Linear Regression","title":"Training the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The training procedure would make use of the same mathematics, but now we can pass in the model inside the gradient call and let Flux and Zygote handle the derivatives!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> function train_model()\n dLdm, _, _ = gradient(loss, model, x_train_n, y_train)\n @. model.weight = model.weight - 0.000001 * dLdm.weight\n @. model.bias = model.bias - 0.000001 * dLdm.bias\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Contrary to our last training procedure, let's say that this time we don't want to hardcode the number of epochs. We want the training procedure to stop when the loss converges, that is, when change in loss < δ. The quantity δ can be altered according to a user's need, but let's fix it to 10⁻³ for this tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"We can write such custom training loops effortlessly using Flux and plain Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss_init = Inf;\n\njulia> while true\n train_model()\n if loss_init == Inf\n loss_init = loss(model, x_train_n, y_train)\n continue\n end\n if abs(loss_init - loss(model, x_train_n, y_train)) < 1e-4\n break\n else\n loss_init = loss(model, x_train_n, y_train)\n end\n end;","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The code starts by initializing an initial value for the loss, infinity. Next, it runs an infinite loop that breaks if change in loss < 10⁻³, or the code changes the value of loss_init to the current loss and moves on to the next iteration.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"This custom loop works! This shows how easily a user can write down any custom training routine using Flux and Julia!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Let's have a look at the loss -","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> loss(model, x_train_n, y_train)\n27.1272f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss went down significantly! It can be minimized further by choosing an even smaller δ.","category":"page"},{"location":"tutorials/linear_regression/#Testing-the-Flux-model","page":"Linear Regression","title":"Testing the Flux model","text":"","category":"section"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"julia> x_test_n = Flux.normalise(x_test);\n\njulia> loss(model, x_test_n, y_test)\n66.91015f0","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users. ","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.","category":"page"},{"location":"tutorials/linear_regression/","page":"Linear Regression","title":"Linear Regression","text":"info: Info\nOriginally published on 21 November 2022, by Saransh Chopra.","category":"page"},{"location":"guide/saving/#Saving-and-Loading-Models","page":"Saving & Loading","title":"Saving and Loading Models","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You may wish to save models so that they can be loaded and run in a later session. Flux provides a number of ways to do this. The recommended way, which is the most robust one for long term storage, is to use Flux.state in combination with a serialization format like JLD2.jl or BSON.jl.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> struct MyModel\n net\n end\n\njulia> Flux.@layer MyModel\n\njulia> MyModel() = MyModel(Chain(Dense(10 => 5, relu), Dense(5 => 2)));\n\njulia> model = MyModel()\nMyModel(Chain(Dense(10 => 5, relu), Dense(5 => 2))) # 67 parameters\n\njulia> model_state = Flux.state(model);\n\njulia> using JLD2\n\njulia> jldsave(\"mymodel.jld2\"; model_state)","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session using Flux.loadmodel!:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, JLD2\n\njulia> model_state = JLD2.load(\"mymodel.jld2\", \"model_state\");\n\njulia> model = MyModel(); # MyModel definition must be available\n\njulia> Flux.loadmodel!(model, model_state);","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"note: Note\nIf a saved model's parameters are stored on the GPU, the model will not load later on if there is no GPU support available. It's best to move your model to the CPU with cpu(model) before saving it.","category":"page"},{"location":"guide/saving/#Checkpointing","page":"Saving & Loading","title":"Checkpointing","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"In longer training runs it's a good idea to periodically save your model, so that you can resume if training is interrupted (for example, if there's a power cut). ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux: throttle\n\njulia> using JLD2\n\njulia> m = Chain(Dense(10 => 5, relu), Dense(5 => 2))\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 524 bytes.\n\njulia> for epoch in 1:10\n # ... train model ...\n jldsave(\"model-checkpoint.jld2\", model_state = Flux.state(m))\n end;","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"This will update the \"model-checkpoint.jld2\" every epoch.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can get more advanced by saving a series of models throughout training, for example","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m))","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"will produce a series of models like \"model-2018-03-06T02:57:10.41.jld2\". You could also store the current test set loss, so that it's easy to (for example) revert to an older copy of the model if it starts to overfit.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"jldsave(\"model-$(now()).jld2\", model_state = Flux.state(m), loss = testloss())","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Note that to resume a model's training, you might need to restore other stateful parts of your training loop. Possible examples are the optimiser state and the randomness used to partition the original data into the training and validation sets.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"You can store the optimiser state alongside the model, to resume training exactly where you left off: ","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"model = MyModel()\nopt_state = Flux.setup(AdamW(), model)\n\n# ... train model ...\n\nmodel_state = Flux.state(model)\njldsave(\"checkpoint_epoch=42.jld2\"; model_state, opt_state)","category":"page"},{"location":"guide/saving/#Saving-Models-as-Julia-Structs","page":"Saving & Loading","title":"Saving Models as Julia Structs","text":"","category":"section"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Models are just normal Julia structs, so it's fine to use any Julia storage format to save the struct as it is instead of saving the state returned by Flux.state. BSON.jl is particularly convenient for this, since it can also save anonymous functions, which are sometimes part of a model definition.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Save a model:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux\n\njulia> model = Chain(Dense(10 => 5, NNlib.relu), Dense(5 => 2));\n\njulia> using BSON: @save\n\njulia> @save \"mymodel.bson\" model","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"Load it again in a new session:","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"julia> using Flux, BSON\n\njulia> BSON.@load \"mymodel.bson\" model\n\njulia> model\nChain(\n Dense(10 => 5, relu), # 55 parameters\n Dense(5 => 2), # 12 parameters\n) # Total: 4 arrays, 67 parameters, 524 bytes.","category":"page"},{"location":"guide/saving/","page":"Saving & Loading","title":"Saving & Loading","text":"warning: Warning\nSaving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.","category":"page"}] } diff --git a/previews/PR2464/tutorials/linear_regression/index.html b/previews/PR2464/tutorials/linear_regression/index.html index 3ece515b01..722c4a6f3d 100644 --- a/previews/PR2464/tutorials/linear_regression/index.html +++ b/previews/PR2464/tutorials/linear_regression/index.html @@ -106,4 +106,4 @@ 27.1272f0

The loss went down significantly! It can be minimized further by choosing an even smaller δ.

Testing the Flux model

The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.

julia> x_test_n = Flux.normalise(x_test);
 
 julia> loss(model, x_test_n, y_test)
-66.91015f0

The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!


Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.

Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.

After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.

Info

Originally published on 21 November 2022, by Saransh Chopra.

+66.91015f0

The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!


Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.

Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.

After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.

Info

Originally published on 21 November 2022, by Saransh Chopra.

diff --git a/previews/PR2464/tutorials/logistic_regression/index.html b/previews/PR2464/tutorials/logistic_regression/index.html index ca09934f83..9bc9254a93 100644 --- a/previews/PR2464/tutorials/logistic_regression/index.html +++ b/previews/PR2464/tutorials/logistic_regression/index.html @@ -131,4 +131,4 @@ flux_accuracy(x, y) = 0.98 julia> flux_loss(flux_model, x, flux_y_onehot) -0.6952386604624324

We see a very similar final loss and accuracy.


Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.

Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!

Info

Originally published on 1st April 2023, by Saransh Chopra.

+0.6952386604624324

We see a very similar final loss and accuracy.


Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.

Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!

Info

Originally published on 1st April 2023, by Saransh Chopra.

diff --git a/previews/PR2464/tutorials/model_zoo/index.html b/previews/PR2464/tutorials/model_zoo/index.html index f63d6c7e78..09971ec682 100644 --- a/previews/PR2464/tutorials/model_zoo/index.html +++ b/previews/PR2464/tutorials/model_zoo/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -
+