diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 90ea2ba0c9..79cb38c178 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-08-03T08:23:32","documenter_version":"1.5.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-08-09T18:43:59","documenter_version":"1.5.0"}} \ No newline at end of file diff --git a/dev/ecosystem/index.html b/dev/ecosystem/index.html index 5e34e3bde6..8fcac343ea 100644 --- a/dev/ecosystem/index.html +++ b/dev/ecosystem/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -

The Julia Ecosystem around Flux

One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.

This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.

Flux models

  • Flux's model-zoo contains examples from many domains.

Computer vision

  • ObjectDetector.jl provides ready-to-go image detection via YOLO.
  • Metalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.
  • UNet.jl is a generic UNet implementation.

Natural language processing

  • Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.
  • TextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.

Reinforcement learning

  • AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
  • ReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.

Graph learning

  • GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.
  • GeometricFlux.jl is the first graph neural network library for julia.
  • NeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.
  • SeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.

Time series

Robust networks

  • RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.

Tools closely associated with Flux

Utility tools you're unlikely to have met if you never used Flux!

High-level training flows

  • FastAI.jl is a Julia port of Python's fast.ai library.
  • FluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl
  • Ignite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.
  • Tsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.

Datasets

Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:

Plumbing

Tools to put data into the right order for creating a model.

  • Augmentor.jl is a real-time library augmentation library for increasing the number of training images.
  • DataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.
  • MLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.

Parameters


Differentiable programming

Packages based on differentiable programming but not necessarily related to Machine Learning.

  • The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.
  • DiffEqFlux.jl provides tools for creating Neural Differential Equations.
  • Flux3D.jl shows off machine learning on 3D data.
  • RayTracer.jl combines ML with computer vision via a differentiable renderer.
  • Duckietown.jl Differentiable Duckietown simulator.
  • The Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.
  • AtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.
  • DiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.

Probabilistic programming

  • Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.
  • Omega.jl is a research project aimed at causal, higher-order probabilistic programming.
  • Stheno.jl provides flexible Gaussian processes.

Statistics


Useful miscellaneous packages

Some useful and random packages!

  • AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.
  • Mill.jl helps to prototype flexible multi-instance learning models.
  • MLMetrics.jl is a utility for scoring models in data science and machine learning.
  • Torch.jl exposes torch in Julia.
  • ValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.
  • InvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.
  • ProgressMeter.jl progress meters for long-running computations.
  • TensorBoardLogger.jl easy peasy logging to tensorboard in Julia
  • ArgParse.jl is a package for parsing command-line arguments to Julia programs.
  • Parameters.jl types with default field values, keyword constructors and (un-)pack macros.
  • BSON.jl is a package for working with the Binary JSON serialisation format.
  • DataFrames.jl in-memory tabular data in Julia.
  • DrWatson.jl is a scientific project assistant software.

This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.


Alternatives to Flux

Julia has several other libraries for making neural networks.

  • SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl)

  • Knet.jl is a neural network library built around AutoGrad.jl.

  • Lux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.

Explicit or explicit?

Flux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word "explicit" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.

+

The Julia Ecosystem around Flux

One of the main strengths of Julia lies in an ecosystem of packages globally providing a rich and consistent user experience.

This is a non-exhaustive list of Julia packages, nicely complementing Flux in typical machine learning and deep learning workflows. To add your project please send a PR. See also academic work citing Flux or citing Zygote.

Flux models

  • Flux's model-zoo contains examples from many domains.

Computer vision

  • ObjectDetector.jl provides ready-to-go image detection via YOLO.
  • Metalhead.jl includes many state-of-the-art computer vision models which can easily be used for transfer learning.
  • UNet.jl is a generic UNet implementation.

Natural language processing

  • Transformers.jl provides components for Transformer models for NLP, as well as providing several trained models out of the box.
  • TextAnalysis.jl provides several NLP algorithms that use Flux models under the hood.

Reinforcement learning

  • AlphaZero.jl provides a generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
  • ReinforcementLearning.jl offers a collection of tools for doing reinforcement learning research in Julia.

Graph learning

  • GraphNeuralNetworks.jl is a fresh, performant and flexible graph neural network library based on Flux.jl.
  • GeometricFlux.jl is the first graph neural network library for julia.
  • NeuralOperators.jl enables training infinite dimensional PDEs by learning a continuous function instead of using the finite element method.
  • SeaPearl.jl is a Constraint Programming solver that uses Reinforcement Learning based on graphs as input.

Time series

Robust networks

  • RobustNeuralNetworks.jl includes classes of neural networks that are constructed to naturally satisfy robustness constraints.

Tools closely associated with Flux

Utility tools you're unlikely to have met if you never used Flux!

High-level training flows

  • FastAI.jl is a Julia port of Python's fast.ai library.
  • FluxTraining.jl is a package for using and writing powerful, extensible training loops for deep learning models. It supports callbacks for many common use cases like hyperparameter scheduling, metrics tracking and logging, checkpointing, early stopping, and more. It powers training in FastAI.jl
  • Ignite.jl is a Julia port of the Python library ignite for simplifying neural network training and validation loops, using events and handlers.
  • Tsunami.jl adds high-level ways to control training, parameter schedules & logging, heavily inspired by pytorch-lightning.

Datasets

Commonly used machine learning datasets are provided by the following packages in the julia ecosystem:

Plumbing

Tools to put data into the right order for creating a model.

  • Augmentor.jl is a real-time library augmentation library for increasing the number of training images.
  • DataAugmentation.jl aims to make it easy to build stochastic, label-preserving augmentation pipelines for vision use cases involving images, keypoints and segmentation masks.
  • MLUtils.jl (replaces MLDataUtils.jl and MLLabelUtils.jl) is a library for processing Machine Learning datasets.

Parameters


Differentiable programming

Packages based on differentiable programming but not necessarily related to Machine Learning.

  • The SciML ecosystem uses Flux and Zygote to mix neural nets with differential equations, to get the best of black box and mechanistic modelling.
  • DiffEqFlux.jl provides tools for creating Neural Differential Equations.
  • Flux3D.jl shows off machine learning on 3D data.
  • RayTracer.jl combines ML with computer vision via a differentiable renderer.
  • Duckietown.jl Differentiable Duckietown simulator.
  • The Yao.jl project uses Flux and Zygote for Quantum Differentiable Programming.
  • AtomicGraphNets.jl enables learning graph based models on atomic systems used in chemistry.
  • DiffImages.jl differentiable computer vision modeling in Julia with the Images.jl ecosystem.

Probabilistic programming

  • Turing.jl extends Flux's differentiable programming capabilities to probabilistic programming.
  • Omega.jl is a research project aimed at causal, higher-order probabilistic programming.
  • Stheno.jl provides flexible Gaussian processes.

Statistics


Useful miscellaneous packages

Some useful and random packages!

  • AdversarialPrediction.jl provides a way to easily optimise generic performance metrics in supervised learning settings using the Adversarial Prediction framework.
  • Mill.jl helps to prototype flexible multi-instance learning models.
  • MLMetrics.jl is a utility for scoring models in data science and machine learning.
  • Torch.jl exposes torch in Julia.
  • ValueHistories.jl is a utility for efficient tracking of optimization histories, training curves or other information of arbitrary types and at arbitrarily spaced sampling times.
  • InvertibleNetworks.jl Building blocks for invertible neural networks in the Julia programming language.
  • ProgressMeter.jl progress meters for long-running computations.
  • TensorBoardLogger.jl easy peasy logging to tensorboard in Julia
  • ArgParse.jl is a package for parsing command-line arguments to Julia programs.
  • Parameters.jl types with default field values, keyword constructors and (un-)pack macros.
  • BSON.jl is a package for working with the Binary JSON serialisation format.
  • DataFrames.jl in-memory tabular data in Julia.
  • DrWatson.jl is a scientific project assistant software.

This tight integration among Julia packages is shown in some of the examples in the model-zoo repository.


Alternatives to Flux

Julia has several other libraries for making neural networks.

  • SimpleChains.jl is focused on making small, simple, CPU-based, neural networks fast. Uses LoopVectorization.jl. (Was FastChain in DiffEqFlux.jl)

  • Knet.jl is a neural network library built around AutoGrad.jl.

  • Lux.jl (earlier ExplicitFluxLayers.jl) shares much of the design, use-case, and NNlib.jl / Optimisers.jl back-end of Flux. But instead of encapsulating all parameters within the model structure, it separates this into 3 components: a model, a tree of parameters, and a tree of model states.

Explicit or explicit?

Flux's training docs talk about changes from Zygote's implicit to explicit gradients, dictionary-like to tree-like structures. (See also Zygote's description of these.) Lux also uses Zygote, but uses the word "explicit" to mean something unrelated, namely storing the tree of parameters (and of state) separately from the model.

diff --git a/dev/guide/gpu/index.html b/dev/guide/gpu/index.html index cf7c008477..5f61b54351 100644 --- a/dev/guide/gpu/index.html +++ b/dev/guide/gpu/index.html @@ -169,10 +169,10 @@ julia> CUDA.device(dense_model.weight) CuDevice(1): GeForce RTX 2080 Ti -

Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.

Printing models after moving to a different device

Due to a limitation in how GPU packages currently work, printing models on the REPL after moving them to a GPU device which is different from the current device will lead to an error.

Flux.AbstractDeviceType
Flux.AbstractDevice <: Function

An abstract type representing device objects for different GPU backends. The currently supported backends are "CUDA", "AMDGPU", "Metal" and "CPU"; the "CPU" backend is the fallback case when no GPU is available. GPU extensions of Flux define subtypes of this type.

source
Flux.FluxCPUDeviceType
Flux.FluxCPUDevice <: Flux.AbstractDevice

A type representing device objects for the "CPU" backend for Flux. This is the fallback case when no GPU is available to Flux.

source
Flux.FluxCUDADeviceType
FluxCUDADevice <: AbstractDevice

A type representing device objects for the "CUDA" backend for Flux.

source
Flux.FluxAMDGPUDeviceType
FluxAMDGPUDevice <: AbstractDevice

A type representing device objects for the "AMDGPU" backend for Flux.

source
Flux.FluxMetalDeviceType
FluxMetalDevice <: AbstractDevice

A type representing device objects for the "Metal" backend for Flux.

source
Flux.supported_devicesFunction
Flux.supported_devices()

Get all supported backends for Flux, in order of preference.

Example

julia> using Flux;
+

Due to a limitation in Metal.jl, currently this kind of data movement across devices is only supported for CUDA and AMDGPU backends.

Printing models after moving to a different device

Due to a limitation in how GPU packages currently work, printing models on the REPL after moving them to a GPU device which is different from the current device will lead to an error.

Flux.AbstractDeviceType
Flux.AbstractDevice <: Function

An abstract type representing device objects for different GPU backends. The currently supported backends are "CUDA", "AMDGPU", "Metal" and "CPU"; the "CPU" backend is the fallback case when no GPU is available. GPU extensions of Flux define subtypes of this type.

source
Flux.FluxCPUDeviceType
Flux.FluxCPUDevice <: Flux.AbstractDevice

A type representing device objects for the "CPU" backend for Flux. This is the fallback case when no GPU is available to Flux.

source
Flux.FluxCUDADeviceType
FluxCUDADevice <: AbstractDevice

A type representing device objects for the "CUDA" backend for Flux.

source
Flux.FluxAMDGPUDeviceType
FluxAMDGPUDevice <: AbstractDevice

A type representing device objects for the "AMDGPU" backend for Flux.

source
Flux.FluxMetalDeviceType
FluxMetalDevice <: AbstractDevice

A type representing device objects for the "Metal" backend for Flux.

source
Flux.supported_devicesFunction
Flux.supported_devices()

Get all supported backends for Flux, in order of preference.

Example

julia> using Flux;
 
 julia> Flux.supported_devices()
-("CUDA", "AMDGPU", "Metal", "CPU")
source
Flux.get_deviceFunction
Flux.get_device(; verbose=false)::Flux.AbstractDevice

Returns a device object for the most appropriate backend for the current Julia session.

First, the function checks whether a backend preference has been set via the Flux.gpu_backend! function. If so, an attempt is made to load this backend. If the corresponding trigger package has been loaded and the backend is functional, a device corresponding to the given backend is loaded. Otherwise, the backend is chosen automatically. To update the backend preference, use Flux.gpu_backend!.

If there is no preference, then for each of the "CUDA", "AMDGPU", "Metal" and "CPU" backends in the given order, this function checks whether the given backend has been loaded via the corresponding trigger package, and whether the backend is functional. If so, the device corresponding to the backend is returned. If no GPU backend is available, a Flux.FluxCPUDevice is returned.

If verbose is set to true, then the function prints informative log messages.

Examples

For the example given below, the backend preference was set to "AMDGPU" via the gpu_backend! function.

julia> using Flux;
+("CUDA", "AMDGPU", "Metal", "CPU")
source
Flux.get_deviceFunction
Flux.get_device(; verbose=false)::Flux.AbstractDevice

Returns a device object for the most appropriate backend for the current Julia session.

First, the function checks whether a backend preference has been set via the Flux.gpu_backend! function. If so, an attempt is made to load this backend. If the corresponding trigger package has been loaded and the backend is functional, a device corresponding to the given backend is loaded. Otherwise, the backend is chosen automatically. To update the backend preference, use Flux.gpu_backend!.

If there is no preference, then for each of the "CUDA", "AMDGPU", "Metal" and "CPU" backends in the given order, this function checks whether the given backend has been loaded via the corresponding trigger package, and whether the backend is functional. If so, the device corresponding to the backend is returned. If no GPU backend is available, a Flux.FluxCPUDevice is returned.

If verbose is set to true, then the function prints informative log messages.

Examples

For the example given below, the backend preference was set to "AMDGPU" via the gpu_backend! function.

julia> using Flux;
 
 julia> model = Dense(2 => 3)
 Dense(2 => 3)       # 9 parameters
@@ -212,7 +212,7 @@
 3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
   0.820013   0.527131
  -0.915589   0.549048
-  0.290744  -0.0592499
source
Flux.get_device(backend::String, idx::Int = 0)::Flux.AbstractDevice

Get a device object for a backend specified by the string backend and idx. The currently supported values of backend are "CUDA", "AMDGPU" and "CPU". idx must be an integer value between 0 and the number of available devices.

Examples

julia> using Flux, CUDA;
+  0.290744  -0.0592499
source
Flux.get_device(backend::String, idx::Int = 0)::Flux.AbstractDevice

Get a device object for a backend specified by the string backend and idx. The currently supported values of backend are "CUDA", "AMDGPU" and "CPU". idx must be an integer value between 0 and the number of available devices.

Examples

julia> using Flux, CUDA;
 
 julia> CUDA.devices()
 CUDA.DeviceIterator() for 3 devices:
@@ -234,4 +234,4 @@
 
 julia> cpu_device = Flux.get_device("CPU")
 (::Flux.FluxCPUDevice) (generic function with 1 method)
-
source
Flux.gpu_backend!Function
gpu_backend!(backend::String)

Set the GPU backend to backend in the LocalPreferences.toml file in you project directory. After restarting Julia, the new backend will affect all subsequent calls to gpu and get_device.

The supported backends are "CUDA", "AMDGPU", "Metal" and "CPU".

source
+source
Flux.gpu_backend!Function
gpu_backend!(backend::String)

Set the GPU backend to backend in the LocalPreferences.toml file in you project directory. After restarting Julia, the new backend will affect all subsequent calls to gpu and get_device.

The supported backends are "CUDA", "AMDGPU", "Metal" and "CPU".

source
diff --git a/dev/guide/models/basics/index.html b/dev/guide/models/basics/index.html index 7ee2332bb4..b87accf0ba 100644 --- a/dev/guide/models/basics/index.html +++ b/dev/guide/models/basics/index.html @@ -109,4 +109,4 @@ return Affine(W, b) end -Affine(3 => 1, bias=false) |> gpu +Affine(3 => 1, bias=false) |> gpu diff --git a/dev/guide/models/custom_layers/index.html b/dev/guide/models/custom_layers/index.html index 700788ac60..88d9732f4e 100644 --- a/dev/guide/models/custom_layers/index.html +++ b/dev/guide/models/custom_layers/index.html @@ -104,4 +104,4 @@ # rms over all the mse ŷs = model(x) return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs))) -end
Note

This Split layer is available from the Fluxperimental.jl package.

+end
Note

This Split layer is available from the Fluxperimental.jl package.

diff --git a/dev/guide/models/overview/index.html b/dev/guide/models/overview/index.html index 043d3f7fe2..0f0c514554 100644 --- a/dev/guide/models/overview/index.html +++ b/dev/guide/models/overview/index.html @@ -56,4 +56,4 @@ julia> y_test 1×5 Matrix{Int64}: - 26 30 34 38 42

The predictions are good. Here's how we got there.

First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.

Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.

After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.

After we trained the model, we verified it with the test data to verify the results.

This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.

+ 26 30 34 38 42

The predictions are good. Here's how we got there.

First, we gathered real-world data into the variables x_train, y_train, x_test, and y_test. The x_* data defines inputs, and the y_* data defines outputs. The *_train data is for training the model, and the *_test data is for verifying the model. Our data was based on the function 4x + 2.

Then, we built a single input, single output predictive model, predict = Dense(1 => 1). The initial predictions weren't accurate, because we had not trained the model yet.

After building the model, we trained it with train!(loss, predict, data, opt). The loss function is first, followed by the model itself, the training data, and the Descent optimiser provided by Flux. We ran the training step once, and observed that the parameters changed and the loss went down. Then, we ran the train! many times to finish the training process.

After we trained the model, we verified it with the test data to verify the results.

This overall flow represents how Flux works. Let's drill down a bit to understand what's going on inside the individual layers of Flux.

diff --git a/dev/guide/models/quickstart/index.html b/dev/guide/models/quickstart/index.html index bc6d328dac..a6db7fa207 100644 --- a/dev/guide/models/quickstart/index.html +++ b/dev/guide/models/quickstart/index.html @@ -59,4 +59,4 @@ y_hat = m(x) Flux.logitcrossentropy(y_hat, y) end -end +end diff --git a/dev/guide/models/recurrence/index.html b/dev/guide/models/recurrence/index.html index 4dbed0c3ad..8d2bb0bcd5 100644 --- a/dev/guide/models/recurrence/index.html +++ b/dev/guide/models/recurrence/index.html @@ -99,4 +99,4 @@ true

In many situations, such as when dealing with a language model, the sentences in each batch are independent (i.e. the last item of the first sentence of the first batch is independent from the first item of the first sentence of the second batch), so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situations, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:

function loss(x, y)
   Flux.reset!(m)
   sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))
-end

A potential source of ambiguity with RNN in Flux can come from the different data layout compared to some common frameworks where data is typically a 3 dimensional array: (features, seq length, samples). In Flux, those 3 dimensions are provided through a vector of seq length containing a matrix (features, samples).

+end

A potential source of ambiguity with RNN in Flux can come from the different data layout compared to some common frameworks where data is typically a 3 dimensional array: (features, seq length, samples). In Flux, those 3 dimensions are provided through a vector of seq length containing a matrix (features, samples).

diff --git a/dev/guide/performance/index.html b/dev/guide/performance/index.html index 6e4c5e52cb..58861e483a 100644 --- a/dev/guide/performance/index.html +++ b/dev/guide/performance/index.html @@ -14,4 +14,4 @@ function loss_total(x_batch::Matrix, y_batch::Matrix) y_preds = model(x_batch) sum(loss.(y_preds, y_batch)) -end

When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.

Be aware of GPU memory inefficiencies

Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.

+end

When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penalty, and will hit the optimised reduce method.

Be aware of GPU memory inefficiencies

Currently, GPU memory is not handled as well as system memory. If your training loop is allocating significantly on the GPU, you can quickly fill your GPU memory and the piecemeal reclamation and shuffling of data between GPU and system memory can become extremely slow. If profiling shows that a significant portion of time is spent in the gpu function and your data sizes are not large, this may be the cause. Running an incremental garbage collection manually (GC.gc(false)) at regular intervals can keep your GPU memory free and responsive. See other tips for CUDA memory management here.

diff --git a/dev/guide/saving/index.html b/dev/guide/saving/index.html index e924e3718c..874994d835 100644 --- a/dev/guide/saving/index.html +++ b/dev/guide/saving/index.html @@ -59,4 +59,4 @@ Chain( Dense(10 => 5, relu), # 55 parameters Dense(5 => 2), # 12 parameters -) # Total: 4 arrays, 67 parameters, 524 bytes.
Warning

Saving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.

+) # Total: 4 arrays, 67 parameters, 524 bytes.
Warning

Saving models this way could lead to compatibility issues across julia versions and across Flux versions if some of the Flux layers' internals are changed. It is therefore not recommended for long term storage, use Flux.state instead.

diff --git a/dev/guide/training/training/index.html b/dev/guide/training/training/index.html index cd65104259..b9b17a849c 100644 --- a/dev/guide/training/training/index.html +++ b/dev/guide/training/training/index.html @@ -118,4 +118,4 @@ train!(loss, bimodel, data, opt_state) # Un-freeze the entire model: -Flux.thaw!(opt_state)

While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).

+Flux.thaw!(opt_state)

While adjust! and freeze!/thaw! make temporary modifications to the optimiser state, permanently removing some fields of a new layer type from training is usually done when defining the layer, by calling for example @layerNewLayer trainable=(weight,).

diff --git a/dev/index.html b/dev/index.html index b800eff5b4..e4b186e9eb 100644 --- a/dev/index.html +++ b/dev/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.
  • Extensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.
  • Play nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.

Installation

Download Julia 1.9 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.

Learning Flux

The quick start page trains a simple neural network.

This rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.

The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.

Community

Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.

+

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs. Instead, writing down the mathematical form will work – and be fast.
  • Extensible by default. Flux is written to be highly flexible while being performant. Extending Flux is as simple as using your own code as part of the model you want - it is all high-level Julia code.
  • Play nicely with others. Flux works well with unrelated Julia libraries from images to differential equation solvers, rather than duplicating them.

Installation

Download Julia 1.9 or later, preferably the current stable release. You can add Flux using Julia's package manager, by typing ] add Flux in the Julia prompt. For Nvidia GPU support, you will also need to install the CUDA and the cuDNN packages. For AMD GPU support, install the AMDGPU package. For acceleration on Apple Silicon, install the Metal package.

Learning Flux

The quick start page trains a simple neural network.

This rest of the guide provides a from-scratch introduction to Flux's take on models and how they work, starting with fitting a line. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

There are some tutorials about building particular models. The model zoo has starting points for many other common ones. And finally, the ecosystem page lists packages which define Flux models.

The reference section includes, beside Flux's own functions, those of some companion packages: Zygote.jl (automatic differentiation), Optimisers.jl (training) and others.

Community

Everyone is welcome to join our community on the Julia discourse forum, or the slack chat (channel #machine-learning). If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started, or our contributing guide.

diff --git a/dev/objects.inv b/dev/objects.inv index f361a5491c..af88217e97 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/reference/data/mlutils/index.html b/dev/reference/data/mlutils/index.html index 5c0c5c74e3..b65e548aa7 100644 --- a/dev/reference/data/mlutils/index.html +++ b/dev/reference/data/mlutils/index.html @@ -570,4 +570,4 @@ julia> zeros_like(x, Float64) 2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}: 0.0 0.0 - 0.0 0.0source + 0.0 0.0source diff --git a/dev/reference/data/onehot/index.html b/dev/reference/data/onehot/index.html index c3efd4beaf..8dd4a17c9c 100644 --- a/dev/reference/data/onehot/index.html +++ b/dev/reference/data/onehot/index.html @@ -73,4 +73,4 @@ 3 6 15 3 9 3 12 3 6 15 3source
OneHotArrays.OneHotArrayType
OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}
 OneHotArray(indices, L)

A one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.

Typically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.

source
OneHotArrays.OneHotVectorType
OneHotVector{T} = OneHotArray{T, 0, 1, T}
 OneHotVector(indices, L)

A one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.

source
OneHotArrays.OneHotMatrixType
OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}
-OneHotMatrix(indices, L)

A one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.

source
+OneHotMatrix(indices, L)

A one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.

source diff --git a/dev/reference/destructure/index.html b/dev/reference/destructure/index.html index 8cd663a592..fcab366ad7 100644 --- a/dev/reference/destructure/index.html +++ b/dev/reference/destructure/index.html @@ -106,7 +106,7 @@ L2 (generic function with 1 method) julia> L2(m2) isa Float32 -truesource

Save and Load

Flux.stateFunction
state(x)

Return an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).

Besides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.

This method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.

The state can be passed to loadmodel! to restore the model.

Examples

Copy the state into another model

julia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));
+true
source

Save and Load

Flux.stateFunction
state(x)

Return an object with the same nested structure as x according to Functors.children, but made only of basic containers (e.g. named tuples, tuples, arrays, and dictionaries).

Besides trainable and non-trainable arrays, the state will contain leaf nodes that are not arrays, such as numbers, symbols, strings, and nothing values. The leaf types that end up in the state could increase in the future.

This method is particularly useful for saving and loading models, since the state contain only simple data types that can be easily serialized.

The state can be passed to loadmodel! to restore the model.

Examples

Copy the state into another model

julia> m1 = Chain(Dense(1, 2, tanh; init=ones), Dense(2, 1; init=ones));
 
 julia> s = Flux.state(m1)
 (layers = ((weight = [1.0; 1.0;;], bias = [0.0, 0.0], σ = ()), (weight = [1.0 1.0], bias = [0.0], σ = ())),)
@@ -132,7 +132,7 @@
 
 julia> JLD2.jldsave("checkpoint.jld2", model_state = s)
 
-julia> Flux.loadmodel!(m2, JLD2.load("checkpoint.jld2", "model_state"))
source
Flux.loadmodel!Function
loadmodel!(dst, src)

Copy all the parameters (trainable and non-trainable) from src into dst.

Recursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).

See also Flux.state.

Examples

julia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))
+julia> Flux.loadmodel!(m2, JLD2.load("checkpoint.jld2", "model_state"))
source
Flux.loadmodel!Function
loadmodel!(dst, src)

Copy all the parameters (trainable and non-trainable) from src into dst.

Recursively walks dst and src together using Functors.children, and calling copyto! on parameter arrays or throwing an error when there is a mismatch. Non-array elements (such as activation functions) are not copied and need not match. Zero bias vectors and bias=false are considered equivalent (see extended help for more details).

See also Flux.state.

Examples

julia> dst = Chain(Dense(Flux.ones32(2, 5), Flux.ones32(2), tanh), Dense(2 => 1; bias = [1f0]))
 Chain(
   Dense(5 => 2, tanh),                  # 12 parameters
   Dense(2 => 1),                        # 3 parameters
@@ -149,7 +149,7 @@
 false
 
 julia> iszero(dst[2].bias)
-true

Extended help

Throws an error when:

  • dst and src do not share the same fields (at any level)
  • the sizes of leaf nodes are mismatched between dst and src
  • copying non-array values to/from an array parameter (except inactive parameters described below)
  • dst is a "tied" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values

Inactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.

source

KeyPath

Functors.KeyPathType
KeyPath(keys...)

A type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, or Int.

For custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.

For string and integer keys instead, the access is done with getindex.

See also getkeypath, haskeypath.

Examples

julia> kp = KeyPath(:b, 3)
+true

Extended help

Throws an error when:

  • dst and src do not share the same fields (at any level)
  • the sizes of leaf nodes are mismatched between dst and src
  • copying non-array values to/from an array parameter (except inactive parameters described below)
  • dst is a "tied" parameter (i.e. refers to another parameter) and loaded into multiple times with mismatched source values

Inactive parameters can be encoded by using the boolean value false instead of an array. If dst == false and src is an all-zero array, no error will be raised (and no values copied); however, attempting to copy a non-zero array to an inactive parameter will throw an error. Likewise, copying a src value of false to any dst array is valid, but copying a src value of true will error.

source

KeyPath

Functors.KeyPathType
KeyPath(keys...)

A type for representing a path of keys to a value in a nested structure. Can be constructed with a sequence of keys, or by concatenating other KeyPaths. Keys can be of type Symbol, String, or Int.

For custom types, access through symbol keys is assumed to be done with getproperty. For consistency, the method Base.propertynames is used to get the viable property names.

For string and integer keys instead, the access is done with getindex.

See also getkeypath, haskeypath.

Examples

julia> kp = KeyPath(:b, 3)
 KeyPath(:b, 3)
 
 julia> KeyPath(:a, kp, :c, 4) # construct mixing keys and keypaths
@@ -196,4 +196,4 @@
 true
 
 julia> haskeypath(x, KeyPath(:b, "d", 4))
-false
source
+falsesource diff --git a/dev/reference/models/activation/index.html b/dev/reference/models/activation/index.html index 27607fa70a..05a8bac091 100644 --- a/dev/reference/models/activation/index.html +++ b/dev/reference/models/activation/index.html @@ -17,7 +17,7 @@ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ julia> celu(-10f0) --0.9999546f0source
NNlib.eluFunction
elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)

Exponential Linear Unit activation function. See "Fast and Accurate Deep Network Learning by Exponential Linear Units". You can also specify the coefficient explicitly, e.g. elu(x, 1).

julia> lineplot(elu, -2, 2, height=7)
+-0.9999546f0
source
NNlib.eluFunction
elu(x, α=1) = x > 0 ? x : α * (exp(x) - 1)

Exponential Linear Unit activation function. See "Fast and Accurate Deep Network Learning by Exponential Linear Units". You can also specify the coefficient explicitly, e.g. elu(x, 1).

julia> lineplot(elu, -2, 2, height=7)
            ┌────────────────────────────────────────┐       
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ elu(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│       
@@ -34,7 +34,7 @@
 -0.9999546f0
 
 julia> elu(-10f0, 2)
--1.9999092f0
source
NNlib.geluFunction
gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))

Activation function from "Gaussian Error Linear Units".

julia> lineplot(gelu, -2, 2, height=7)
+-1.9999092f0
source
NNlib.geluFunction
gelu(x) = 0.5x * (1 + tanh(√(2/π) * (x + 0.044715x^3)))

Activation function from "Gaussian Error Linear Units".

julia> lineplot(gelu, -2, 2, height=7)
            ┌────────────────────────────────────────┐        
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊│ gelu(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠊⠁⠀⠀⠀│        
@@ -60,7 +60,7 @@
         -0.2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠓⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠇⠀⠀⠀│         
              └────────────────────────────────────────┘         
              ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0⠀         
-             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.hardsigmoidFunction
hardσ(x) = max(0, min(1, (x + 3) / 6))

Piecewise linear approximation of sigmoid.

julia> lineplot(hardsigmoid, -5, 5, height=7)
+             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.hardsigmoidFunction
hardσ(x) = max(0, min(1, (x + 3) / 6))

Piecewise linear approximation of sigmoid.

julia> lineplot(hardsigmoid, -5, 5, height=7)
           ┌────────────────────────────────────────┐         
         1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠉⠉⠉⠉⠉⠉⠉⠉│ hardσ(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
@@ -84,7 +84,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⠤⠤⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
           └────────────────────────────────────────┘     
           ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀     
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
source
NNlib.hardswishFunction
hardswish(x) = x * hardσ(x)

Hard-Swish activation function. See "Searching for MobileNetV3".

julia> lineplot(hardswish, -2, 5, height = 7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
source
NNlib.hardswishFunction
hardswish(x) = x * hardσ(x)

Hard-Swish activation function. See "Searching for MobileNetV3".

julia> lineplot(hardswish, -2, 5, height = 7)
            ┌────────────────────────────────────────┐             
          5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠒⠉│ hardswish(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠔⠒⠉⠁⠀⠀⠀⠀│             
@@ -114,7 +114,7 @@
 
 julia> hardswish.(-5:5)'
 1×11 adjoint(::Vector{Float64}) with eltype Float64:
- -0.0  -0.0  -0.0  -0.333333  -0.333333  0.0  0.666667  1.66667  3.0  4.0  5.0
source
NNlib.hardtanhFunction
hardtanh(x) = max(-1, min(1, x))

Segment-wise linear approximation of tanh, much cheaper to compute. See "Large Scale Machine Learning".

See also tanh_fast.

julia> lineplot(hardtanh, -2, 2, height=7)
+ -0.0  -0.0  -0.0  -0.333333  -0.333333  0.0  0.666667  1.66667  3.0  4.0  5.0
source
NNlib.hardtanhFunction
hardtanh(x) = max(-1, min(1, x))

Segment-wise linear approximation of tanh, much cheaper to compute. See "Large Scale Machine Learning".

See also tanh_fast.

julia> lineplot(hardtanh, -2, 2, height=7)
            ┌────────────────────────────────────────┐            
          1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⠔⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ hardtanh(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⣀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│            
@@ -138,7 +138,7 @@
         -1 │⣀⣀⣀⡠⠤⠤⠤⠖⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.leakyreluFunction
leakyrelu(x, a=0.01) = max(a*x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

julia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.leakyreluFunction
leakyrelu(x, a=0.01) = max(a*x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

julia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)
            ┌────────────────────────────────────────┐       
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ #42(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊⠉⠀⠀⠀⠀│       
@@ -155,7 +155,7 @@
 -2.0f0
 
 julia> leakyrelu(-10f0, 0.02)
--0.5f0
source
NNlib.lishtFunction
lisht(x) = x * tanh(x)

Activation function from "LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ..."

julia> lineplot(lisht, -2, 2, height=7)
+-0.5f0
source
NNlib.lishtFunction
lisht(x) = x * tanh(x)

Activation function from "LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ..."

julia> lineplot(lisht, -2, 2, height=7)
           ┌────────────────────────────────────────┐         
         2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x)
           │⠀⠈⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠊⠁⠀│         
@@ -179,7 +179,7 @@
         0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠪⠷⣦⣄⣀⣀⣇⣀⣀⣤⠶⠕⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│           
           └────────────────────────────────────────┘           
           ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀           
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logcoshFunction
logcosh(x)

Return log(cosh(x)) which is computed in a numerically stable way.

julia> lineplot(logcosh, -5, 5, height=7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logcoshFunction
logcosh(x)

Return log(cosh(x)) which is computed in a numerically stable way.

julia> lineplot(logcosh, -5, 5, height=7)
           ┌────────────────────────────────────────┐           
         5 │⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ logcosh(x)
           │⠉⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│           
@@ -190,7 +190,7 @@
         0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠢⢄⣀⣀⣇⣀⡠⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│           
           └────────────────────────────────────────┘           
           ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀           
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logsigmoidFunction
logσ(x)

Return log(σ(x)) which is computed in a numerically stable way.

julia> lineplot(logsigmoid, -5, 5, height=7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀           
source
NNlib.logsigmoidFunction
logσ(x)

Return log(σ(x)) which is computed in a numerically stable way.

julia> lineplot(logsigmoid, -5, 5, height=7)
            ┌────────────────────────────────────────┐        
          0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡧⠤⠔⠒⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ logσ(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
@@ -201,7 +201,7 @@
         -6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.mishFunction
mish(x) = x * tanh(softplus(x))

Activation function from "Mish: A Self Regularized Non-Monotonic Neural Activation Function".

julia> lineplot(mish, -5, 5, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.mishFunction
mish(x) = x * tanh(softplus(x))

Activation function from "Mish: A Self Regularized Non-Monotonic Neural Activation Function".

julia> lineplot(mish, -5, 5, height=7)
            ┌────────────────────────────────────────┐        
          5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋│ mish(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠒⠁⠀⠀⠀│        
@@ -212,7 +212,7 @@
         -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-5⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀5⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.reluFunction
relu(x) = max(0, x)

Rectified Linear Unit activation function.

julia> lineplot(relu, -2, 2, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.reluFunction
relu(x) = max(0, x)

Rectified Linear Unit activation function.

julia> lineplot(relu, -2, 2, height=7)
           ┌────────────────────────────────────────┐        
         2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ relu(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠊⠁⠀⠀│        
@@ -223,7 +223,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⠔⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
           └────────────────────────────────────────┘        
           ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀        
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.relu6Function
relu6(x) = min(max(0, x), 6)

Rectified Linear Unit activation function capped at 6. See "Convolutional Deep Belief Networks" from CIFAR-10.

julia> lineplot(relu6, -10, 10, height=7)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
source
NNlib.relu6Function
relu6(x) = min(max(0, x), 6)

Rectified Linear Unit activation function capped at 6. See "Convolutional Deep Belief Networks" from CIFAR-10.

julia> lineplot(relu6, -10, 10, height=7)
           ┌────────────────────────────────────────┐         
         6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠎⠉⠉⠉⠉⠉⠉⠉⠉│ relu6(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
@@ -234,7 +234,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⡧⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
           └────────────────────────────────────────┘         
           ⠀-10⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10⠀         
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.rreluFunction
rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.rreluFunction
rrelu(x, lo=1/8, hi=1/3) = max(a*x, x)
 # where `a` is randomly sampled from uniform distribution `U(lo, hi)`

Randomized Leaky Rectified Linear Unit activation function. See "Empirical Evaluation of Rectified Activations" You can also specify the bound explicitly, e.g. rrelu(x, 0.0, 1.0).

julia> lineplot(rrelu, -20, 10, height=7)
             ┌────────────────────────────────────────┐         
          10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ rrelu(x)
@@ -249,7 +249,7 @@
             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
 
 julia> extrema(rrelu.(fill(-10f0, 1000)))
-(-3.3316886f0, -1.2548422f0)
source
NNlib.seluFunction
selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))
+(-3.3316886f0, -1.2548422f0)
source
NNlib.seluFunction
selu(x) = λ * (x ≥ 0 ? x : α * (exp(x) - 1))
 
 λ ≈ 1.05070...
 α ≈ 1.67326...

Scaled exponential linear units. See "Self-Normalizing Neural Networks".

julia> lineplot(selu, -3, 2, height=7)
@@ -266,7 +266,7 @@
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
 
 julia> selu(-10f0)
--1.7580194f0
source
NNlib.sigmoidFunction
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function. Unicode σ can be entered as \sigma then tab, in many editors. The ascii name sigmoid is also exported.

See also sigmoid_fast.

julia> using UnicodePlots
+-1.7580194f0
source
NNlib.sigmoidFunction
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function. Unicode σ can be entered as \sigma then tab, in many editors. The ascii name sigmoid is also exported.

See also sigmoid_fast.

julia> using UnicodePlots
 
 julia> lineplot(sigmoid, -5, 5, height=7)
           ┌────────────────────────────────────────┐     
@@ -282,14 +282,14 @@
           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
 
 julia> sigmoid === σ
-true
source
NNlib.sigmoid_fastFunction
sigmoid_fast(x)

This is a faster, and very slightly less accurate, version of sigmoid. For `x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.

See also tanh_fast.

julia> sigmoid(0.2f0)
+true
source
NNlib.sigmoid_fastFunction
sigmoid_fast(x)

This is a faster, and very slightly less accurate, version of sigmoid. For `x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.

See also tanh_fast.

julia> sigmoid(0.2f0)
 0.54983395f0
 
 julia> sigmoid_fast(0.2f0)
 0.54983395f0
 
 julia> hardσ(0.2f0)
-0.53333336f0
source
NNlib.softplusFunction
softplus(x) = log(exp(x) + 1)

See "Deep Sparse Rectifier Neural Networks", JMLR 2011.

julia> lineplot(softplus, -3, 3, height=7)
+0.53333336f0
source
NNlib.softplusFunction
softplus(x) = log(exp(x) + 1)

See "Deep Sparse Rectifier Neural Networks", JMLR 2011.

julia> lineplot(softplus, -3, 3, height=7)
           ┌────────────────────────────────────────┐            
         4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠│            
@@ -316,7 +316,7 @@
           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀            
 
 julia> softplus(16f0)
-16.0f0
source
NNlib.softshrinkFunction
softshrink(x, λ=0.5) =
+16.0f0
source
NNlib.softshrinkFunction
softshrink(x, λ=0.5) =
     (x ≥ λ ? x - λ : (-λ ≥ x ? x + λ : 0))

See "Softshrink Activation Function".

julia> lineplot(softshrink, -2, 2, height=7)
            ┌────────────────────────────────────────┐              
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)
@@ -344,7 +344,7 @@
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀
 
 julia> softshrink.((-10f0, 10f0))
-(-9.5f0, 9.5f0)
source
NNlib.softsignFunction
softsign(x) = x / (1 + |x|)

See "Quadratic Polynomials Learn Better Image Features" (2009).

julia> lineplot(softsign, -5, 5, height=7)
+(-9.5f0, 9.5f0)
source
NNlib.softsignFunction
softsign(x) = x / (1 + |x|)

See "Quadratic Polynomials Learn Better Image Features" (2009).

julia> lineplot(softsign, -5, 5, height=7)
            ┌────────────────────────────────────────┐            
          1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣀⣀⠤⠤⠤⠤⠤│ softsign(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⣀⡤⠖⠒⠋⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│            
@@ -374,7 +374,7 @@
 0.5f0
 
 julia> softsign(100f0)
-0.990099f0
source
NNlib.swishFunction
swish(x) = x * σ(x)

Self-gated activation function. See "Swish: a Self-Gated Activation Function".

julia> lineplot(swish, -2, 2, height=7)
+0.990099f0
source
NNlib.swishFunction
swish(x) = x * σ(x)

Self-gated activation function. See "Swish: a Self-Gated Activation Function".

julia> lineplot(swish, -2, 2, height=7)
            ┌────────────────────────────────────────┐         
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤│ swish(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋⠁⠀│         
@@ -385,7 +385,7 @@
         -1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
            └────────────────────────────────────────┘         
            ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⠀         
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.tanhshrinkFunction
tanhshrink(x) = x - tanh(x)

See "Tanhshrink Activation Function".

julia> lineplot(tanhshrink, -3, 3, height=7)
+           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source
NNlib.tanhshrinkFunction
tanhshrink(x) = x - tanh(x)

See "Tanhshrink Activation Function".

julia> lineplot(tanhshrink, -3, 3, height=7)
            ┌────────────────────────────────────────┐              
          3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanhshrink(x)
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠖⠊│              
@@ -399,14 +399,14 @@
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀              
 
 julia> tanhshrink.((-10f0, 10f0))
-(-9.0f0, 9.0f0)
source
NNlib.tanh_fastFunction
tanh_fast(x)

This is a faster but slighly less accurate version of tanh.

Where Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit.

For x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.

See also sigmoid_fast.

julia> tanh(0.5f0)
+(-9.0f0, 9.0f0)
source
NNlib.tanh_fastFunction
tanh_fast(x)

This is a faster but slighly less accurate version of tanh.

Where Julia's tanh function has an error under 2 eps, this may be wrong by 5 eps, a reduction by less than one decimal digit.

For x::Float32 this is usually about 10 times faster, with a smaller speedup for x::Float64. For any other number types, it just calls tanh.

See also sigmoid_fast.

julia> tanh(0.5f0)
 0.46211717f0
 
 julia> tanh_fast(0.5f0)
 0.46211714f0
 
 julia> hard_tanh(0.5f0)
-0.5f0
source
NNlib.treluFunction
trelu(x, theta=1) = x > theta ? x : 0

Threshold gated rectified linear activation function. See "Zero-bias autoencoders and the benefits of co-adapting features"

julia> lineplot(trelu, -2, 4, height=7)
+0.5f0
source
NNlib.treluFunction
trelu(x, theta=1) = x > theta ? x : 0

Threshold gated rectified linear activation function. See "Zero-bias autoencoders and the benefits of co-adapting features"

julia> lineplot(trelu, -2, 4, height=7)
           ┌────────────────────────────────────────┐         
         4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ trelu(x)
           │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠁⠀⠀⠀│         
@@ -417,7 +417,7 @@
         0 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣇⣀⣀⣀⣀⣀⣀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│         
           └────────────────────────────────────────┘         
           ⠀-2⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀4⠀         
-          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source

One More

Julia's Base.Math also provides tanh, which can be used as an activation function.

Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.

julia> using UnicodePlots
+          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀         
source

One More

Julia's Base.Math also provides tanh, which can be used as an activation function.

Note that many Flux layers will automatically replace this with NNlib.tanh_fast when called, as Base's tanh is slow enough to sometimes be a bottleneck.

julia> using UnicodePlots
 
 julia> lineplot(tanh, -3, 3, height=7)
            ┌────────────────────────────────────────┐        
@@ -430,4 +430,4 @@
         -1 │⣀⣀⣀⣀⣀⣀⣀⣀⣀⡤⠤⠔⠒⠉⠁⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│        
            └────────────────────────────────────────┘        
            ⠀-3⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀3⠀        
-           ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀        
+ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ diff --git a/dev/reference/models/functors/index.html b/dev/reference/models/functors/index.html index 8c9b7f985a..0dd0aa4db0 100644 --- a/dev/reference/models/functors/index.html +++ b/dev/reference/models/functors/index.html @@ -23,7 +23,7 @@ Dense(2 => 1, tanh), # 3 parameters Dense(1 => 1; bias=false), # 1 parameters Dropout(0.4), -) # Total: 3 arrays, 4 parameters, 224 bytes.source
Functors.@functorMacro
@functor T
+)                   # Total: 3 arrays, 4 parameters, 224 bytes.
source
Functors.@functorMacro
@functor T
 @functor T (x,)

Adds methods to functor allowing recursion into objects of type T, and reconstruction. Assumes that T has a constructor accepting all of its fields, which is true unless you have provided an inner constructor which does not.

By default all fields of T are considered children; this can be restricted be restructed by providing a tuple of field names.

Examples

julia> struct Foo; x; y; end
 
 julia> @functor Foo
@@ -193,7 +193,7 @@
 julia> m.bias
 2-element Vector{Float32}:
  0.0
- 0.0
source
Flux.gpuMethod
gpu(m)

Copies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time).

On arrays, this calls CUDA's cu, which also changes arrays with Float64 elements to Float32 while copying them to the device (same for AMDGPU). To act on arrays within a struct, the struct type must be marked with @functor.

Use cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.

See the CUDA.jl docs to help identify the current device.

Example

julia> m = Dense(rand(2, 3))  # constructed with Float64 weight matrix
+ 0.0
source
Flux.gpuMethod
gpu(m)

Copies m to the current GPU device (using current GPU backend), if one is available. If no GPU is available, it does nothing (but prints a warning the first time).

On arrays, this calls CUDA's cu, which also changes arrays with Float64 elements to Float32 while copying them to the device (same for AMDGPU). To act on arrays within a struct, the struct type must be marked with @functor.

Use cpu to copy back to ordinary Arrays. See also f32 and f16 to change element type only.

See the CUDA.jl docs to help identify the current device.

Example

julia> m = Dense(rand(2, 3))  # constructed with Float64 weight matrix
 Dense(3 => 2)       # 8 parameters
 
 julia> typeof(m.weight)
@@ -203,7 +203,7 @@
 Dense(3 => 2)       # 8 parameters
 
 julia> typeof(m_gpu.weight)
-CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
source
Flux.gpuMethod
gpu(data::DataLoader)
+CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
source
Flux.gpuMethod
gpu(data::DataLoader)
 cpu(data::DataLoader)

Transforms a given DataLoader to apply gpu or cpu to each batch of data, when iterated over. (If no GPU is available, this does nothing.)

Example

julia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)
 4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)
   with first element:
@@ -223,4 +223,4 @@
  1.0  1.0  1.0

For large datasets, this is preferred over moving all the data to the GPU before creating the DataLoader, like this:

julia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)
 4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)
   with first element:
-  (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
Warning

This only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.

source
+ (; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
Warning

This only works if gpu is applied directly to the DataLoader. While gpu acts recursively on Flux models and many basic Julia structs, it will not work on (say) a tuple of DataLoaders.

source
diff --git a/dev/reference/models/layers/index.html b/dev/reference/models/layers/index.html index 47687fe8ef..ba03c7d3b2 100644 --- a/dev/reference/models/layers/index.html +++ b/dev/reference/models/layers/index.html @@ -23,7 +23,7 @@ julia> Flux.trainables(model2) # no trainable bias 1-element Vector{AbstractArray}: - [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]source
Flux.BilinearType
Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)
+ [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]
source
Flux.BilinearType
Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)
 Bilinear(W::AbstractArray, [bias, σ])

Creates a layer which is fully connected between two inputs and the output, and otherwise similar to Dense. Its output, given vectors x & y, is another vector z with, for all i ∈ 1:out:

z[i] = σ(x' * W[i,:,:] * y + bias[i])

If x and y are matrices, then each column of the output z = B(x, y) is of this form, with B the Bilinear layer.

If the second input y is not given, it is taken to be equal to x, i.e. B(x) == B(x, x)

The two inputs may also be provided as a tuple, B((x, y)) == B(x, y), which is accepted as the input to a Chain.

If the two input sizes are the same, in1 == in2, then you may write Bilinear(in => out, σ).

The initialisation works as for Dense layer, with W = init(out, in1, in2). By default the bias vector is zeros(Float32, out), option bias=false will switch off trainable bias. Either of these may be provided explicitly.

Examples

julia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);
 
 julia> B = Flux.Bilinear((5, 5) => 7)
@@ -44,7 +44,7 @@
 (3, 32)
 
 julia> Flux.Bilinear(rand(4,8,16), false, tanh)  # first dim of weight is the output
-Bilinear((8, 16) => 4, tanh; bias=false)  # 512 parameters
source
Flux.ScaleType
Scale(size::Integer..., σ=identity; bias=true, init=ones32)
+Bilinear((8, 16) => 4, tanh; bias=false)  # 512 parameters
source
Flux.ScaleType
Scale(size::Integer..., σ=identity; bias=true, init=ones32)
 Scale(scale::AbstractArray, [bias, σ])

Create an element-wise layer, whose forward pass is given by:

y = σ.(scale .* x .+ bias)

This uses .* instead of matrix multiplication * of Dense.

The learnable scale & bias are initialised init(size...) and zeros32(size...), with init=ones32 by default. You may specify the function init, turn off trainable bias with bias=false, or provide the array(s) explicitly.

Used by LayerNorm with affine=true.

Examples

julia> a = Flux.Scale(2)
 Scale(2)            # 4 parameters
 
@@ -68,7 +68,7 @@
 
 julia> Flux.trainables(b)
 1-element Vector{AbstractArray}:
- Float32[1.0 2.0 3.0 4.0]
source

Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.

Convolution Models

These layers are used to build convolutional neural networks (CNNs).

They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).

Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.

To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.

Flux.ConvType
Conv(filter, in => out, σ = identity;
+ Float32[1.0 2.0 3.0 4.0]
source

Perhaps Scale isn't quite fully connected, but it may be thought of as Dense(Diagonal(s.weights), s.bias), and LinearAlgebra's Diagonal is a matrix which just happens to contain many zeros.

Convolution Models

These layers are used to build convolutional neural networks (CNNs).

They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have size(x) == (50, 50, 3, 32). A single grayscale image might instead have size(x) == (28, 28, 1, 1).

Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have size(x) == (1000, 2, 1). They will also work with 3D data, ndims(x) == 5, where again the last two dimensions are channel and batch.

To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.

Flux.ConvType
Conv(filter, in => out, σ = identity;
      stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])

Standard convolutional layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Image data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.

To take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:

  • filter should be a tuple of N integers.
  • Keywords stride and dilation should each be either single integer, or a tuple with N integers.
  • Keyword pad specifies the number of elements added to the borders of the data array. It can be
    • a single integer for equal padding all around,
    • a tuple of N integers, to apply the same padding at begin/end of each spatial dimension,
    • a tuple of 2*N integers, for asymmetric padding, or
    • the singleton SamePad(), to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.
  • Keyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.

Keywords to control initialization of the layer:

  • init - Function used to generate initial weights. Defaults to glorot_uniform.
  • bias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).

See also ConvTranspose, DepthwiseConv, CrossCor.

Examples

julia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images
 
 julia> layer = Conv((5,5), 3 => 7, relu; bias = false)
@@ -87,7 +87,7 @@
 (130, 100, 7, 50)
 
 julia> Conv((5,5), 3 => 7; stride = 2, dilation = 4)(xs) |> size
-(42, 42, 7, 50)
source
Flux.ConvMethod
Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

julia> weight = rand(3, 4, 5);
+(42, 42, 7, 50)
source
Flux.ConvMethod
Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

julia> weight = rand(3, 4, 5);
 
 julia> bias = zeros(5);
 
@@ -98,7 +98,7 @@
 (98, 5, 64)
 
 julia> Flux.params(layer) |> length
-2
source
Flux.ConvTransposeType
ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])

Standard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.

Note that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.

To conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of 50 RGB images
+2
source
Flux.ConvTransposeType
ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, outpad=0, dilation=1, [bias, init])

Standard convolutional transpose layer. filter is a tuple of integers specifying the size of the convolutional kernel, while in and out specify the number of input and output channels.

Note that pad=SamePad() here tries to ensure size(output,d) == size(x,d) * stride.

To conserve Conv inversability when stride > 1, outpad can be used to increase the size of the output in the desired dimensions. Whereas pad is used to zero-pad the input, outpad only affects the output shape.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = ConvTranspose((5,5), 3 => 7, relu)
 ConvTranspose((5, 5), 3 => 7, relu)  # 532 parameters
@@ -113,7 +113,7 @@
 (204, 204, 7, 50)
 
 julia> ConvTranspose((5,5), 3 => 7, stride=3, pad=SamePad())(xs) |> size
-(300, 300, 7, 50)
source
Flux.ConvTransposeMethod
ConvTranspose(weight::AbstractArray, [bias, activation; stride, pad, outpad, dilation, groups])

Constructs a ConvTranspose layer with the given weight and bias. Accepts the same keywords and has the same defaults as ConvTranspose(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
+(300, 300, 7, 50)
source
Flux.ConvTransposeMethod
ConvTranspose(weight::AbstractArray, [bias, activation; stride, pad, outpad, dilation, groups])

Constructs a ConvTranspose layer with the given weight and bias. Accepts the same keywords and has the same defaults as ConvTranspose(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
 
 julia> bias = zeros(4);
 
@@ -124,7 +124,7 @@
 (102, 4, 64)
 
 julia> Flux.params(layer) |> length
-2
source
Flux.CrossCorType
CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])

Standard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
+2
source
Flux.CrossCorType
CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])

Standard cross correlation layer. filter is a tuple of integers specifying the size of the convolutional kernel; in and out specify the number of input and output channels.

Parameters are controlled by additional keywords, with defaults init=glorot_uniform and bias=true.

See also Conv for more detailed description of keywords.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)
 CrossCor((5, 5), 3 => 6, relu, bias=false)  # 450 parameters
@@ -133,7 +133,7 @@
 (96, 96, 6, 50)
 
 julia> CrossCor((5,5), 3 => 7, stride=3, pad=(2,0))(xs) |> size
-(34, 32, 7, 50)
source
Flux.CrossCorMethod
CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a CrossCor layer with the given weight and bias. Accepts the same keywords and has the same defaults as CrossCor(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
+(34, 32, 7, 50)
source
Flux.CrossCorMethod
CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Constructs a CrossCor layer with the given weight and bias. Accepts the same keywords and has the same defaults as CrossCor(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

Examples

julia> weight = rand(3, 4, 5);
 
 julia> bias = zeros(5);
 
@@ -141,7 +141,7 @@
 CrossCor((3,), 4 => 5, relu)  # 65 parameters
 
 julia> layer(randn(100, 4, 64)) |> size
-(98, 5, 64)
source
Flux.DepthwiseConvFunction
DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
+(98, 5, 64)
source
Flux.DepthwiseConvFunction
DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
 DepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])

Return a depthwise convolutional layer, that is a Conv layer with number of groups equal to the number of input channels.

See Conv for a description of the arguments.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # a batch of 50 RGB images
 
 julia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)
@@ -151,7 +151,7 @@
 (96, 96, 6, 50)
 
 julia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size
-(50, 50, 9, 50)
source
Flux.SamePadType
SamePad()

Passed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).

See also Conv, MaxPool.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of images
+(50, 50, 9, 50)
source
Flux.SamePadType
SamePad()

Passed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first N dimensions, the kernel or window) when stride==1. When stride≠1, the output size equals ceil(input_size/stride).

See also Conv, MaxPool.

Examples

julia> xs = rand32(100, 100, 3, 50);  # a batch of images
 
 julia> layer = Conv((2,2), 3 => 7, pad=SamePad())
 Conv((2, 2), 3 => 7, pad=(1, 0, 1, 0))  # 91 parameters
@@ -169,7 +169,7 @@
 Conv((5, 5), 3 => 7, pad=2, stride=2)  # 532 parameters
 
 julia> layer3(xs) |> size  # output size = `ceil(input_size/stride)` = 50
-(50, 50, 7, 50)
source
Flux.flattenFunction

flatten(x)

Same as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.

source

MultiHeadAttention

The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.

Flux.MultiHeadAttentionType
MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])

The multi-head dot-product attention layer used in Transformer architectures [1].

Returns the transformed input sequence and the attention scores.

[1] Vaswani et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Arguments

  • dims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.
  • nheads: number of heads. Default 8.
  • init: weight initializer for the Dense layers. Default glorot_uniform.
  • bias : whether pointwise QKVO dense transforms use bias. Default false.
  • dropout_prob: dropout probability for the attention scores. Default 0.0.

Forward

(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])

The arguments of the forward pass are:

  • q_in: Input query array of size (q_in_dim, q_len, batch_size).
  • k_in: Input key array of size (k_in_dim, kv_len, batch_size).
  • v_in: Input value array of size (v_in_dim, kv_len, batch_size).
  • bias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.
  • mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.

Alternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).

See also NNlib.dot_product_attention.

Examples

mha = MultiHeadAttention(64, nheads = 8)
+(50, 50, 7, 50)
source
Flux.flattenFunction

flatten(x)

Same as MLUtils.flatten, which should be prefered to this method existing only for backward compatibility.

source

MultiHeadAttention

The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.

Flux.MultiHeadAttentionType
MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])

The multi-head dot-product attention layer used in Transformer architectures [1].

Returns the transformed input sequence and the attention scores.

[1] Vaswani et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Arguments

  • dims: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a) (q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim. Can take also simpler forms as b) dims::Int; c) in_dim::Int => (qk_dim, v_dim) => out_dim; d) in_dim::Int => qkv_dim => out_dim.
  • nheads: number of heads. Default 8.
  • init: weight initializer for the Dense layers. Default glorot_uniform.
  • bias : whether pointwise QKVO dense transforms use bias. Default false.
  • dropout_prob: dropout probability for the attention scores. Default 0.0.

Forward

(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])

The arguments of the forward pass are:

  • q_in: Input query array of size (q_in_dim, q_len, batch_size).
  • k_in: Input key array of size (k_in_dim, kv_len, batch_size).
  • v_in: Input value array of size (v_in_dim, kv_len, batch_size).
  • bias: Bias array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before the softmax. Default nothing.
  • mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.

Alternative calling signatures are mha(q_in), equivalent to mha(q_in, q_in, q_in) (self-attention), and mha(q_in, k_in), equivalent to mha(q_in, k_in, k_in) (key and value are the same).

See also NNlib.dot_product_attention.

Examples

mha = MultiHeadAttention(64, nheads = 8)
 q = rand(Float32, (64, 10, 32))
 k = rand(Float32, (64, 20, 32))
 v = rand(Float32, (64, 20, 32))
@@ -180,13 +180,13 @@
 mha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)
 y, α = mha(q) # self-attention
 # [y] = [1024, 10, 32]
-# [α] = [10, 10, 8, 32]
source

Pooling

These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

Flux.AdaptiveMaxPoolType
AdaptiveMaxPool(out::NTuple)

Adaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+# [α] = [10, 10, 8, 32]
source

Pooling

These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

Flux.AdaptiveMaxPoolType
AdaptiveMaxPool(out::NTuple)

Adaptive max pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> AdaptiveMaxPool((25, 25))(xs) |> size
 (25, 25, 3, 50)
 
 julia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)
-true
source
Flux.MaxPoolType
MaxPool(window::NTuple; pad=0, stride=window)

Max pooling layer, which replaces all pixels in a block of size window with one.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+true
source
Flux.MaxPoolType
MaxPool(window::NTuple; pad=0, stride=window)

Max pooling layer, which replaces all pixels in a block of size window with one.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MeanPool, AdaptiveMaxPool, GlobalMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))
 Chain(
@@ -204,7 +204,7 @@
 MaxPool((5,), pad=2, stride=3)
 
 julia> layer(rand(Float32, 100, 7, 50)) |> size
-(34, 7, 50)
source
Flux.GlobalMaxPoolType
GlobalMaxPool()

Global max pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.

See also MaxPool, GlobalMeanPool.

julia> xs = rand(Float32, 100, 100, 3, 50);
+(34, 7, 50)
source
Flux.GlobalMaxPoolType
GlobalMaxPool()

Global max pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.

See also MaxPool, GlobalMeanPool.

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());
 
@@ -212,13 +212,13 @@
 (1, 1, 7, 50)
 
 julia> GlobalMaxPool()(rand(3,5,7)) |> size  # preserves 2 dimensions
-(1, 5, 7)
source
Flux.AdaptiveMeanPoolType
AdaptiveMeanPool(out::NTuple)

Adaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
+(1, 5, 7)
source
Flux.AdaptiveMeanPoolType
AdaptiveMeanPool(out::NTuple)

Adaptive mean pooling layer. Calculates the necessary window size such that its output has size(y)[1:N] == out.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(out).

See also MaxPool, AdaptiveMaxPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);  # batch of 50 RGB images
 
 julia> AdaptiveMeanPool((25, 25))(xs) |> size
 (25, 25, 3, 50)
 
 julia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)
-true
source
Flux.MeanPoolType
MeanPool(window::NTuple; pad=0, stride=window)

Mean pooling layer, averaging all pixels in a block of size window.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);
+true
source
Flux.MeanPoolType
MeanPool(window::NTuple; pad=0, stride=window)

Mean pooling layer, averaging all pixels in a block of size window.

Expects as input an array with ndims(x) == N+2, i.e. channel and batch dimensions, after the N feature dimensions, where N = length(window).

By default the window size is also the stride in each dimension. The keyword pad accepts the same options as for the Conv layer, including SamePad().

See also Conv, MaxPool, AdaptiveMeanPool.

Examples

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))
 Chain(
@@ -230,12 +230,12 @@
 (96, 96, 7, 50)
 
 julia> m(xs) |> size
-(20, 20, 7, 50)
source
Flux.GlobalMeanPoolType
GlobalMeanPool()

Global mean pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.

julia> xs = rand(Float32, 100, 100, 3, 50);
+(20, 20, 7, 50)
source
Flux.GlobalMeanPoolType
GlobalMeanPool()

Global mean pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.

julia> xs = rand(Float32, 100, 100, 3, 50);
 
 julia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());
 
 julia> m(xs) |> size
-(1, 1, 7, 50)
source

Upsampling

The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.

Flux.UpsampleType
Upsample(mode = :nearest; [scale, size]) 
+(1, 1, 7, 50)
source

Upsampling

The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.

Flux.UpsampleType
Upsample(mode = :nearest; [scale, size]) 
 Upsample(scale, mode = :nearest)

An upsampling layer. One of two keywords must be given:

If scale is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword size accepts a tuple, to directly specify the leading dimensions of the output.

Currently supported upsampling modes and corresponding NNlib's methods are:

Examples

julia> m = Upsample(scale = (2, 3))
 Upsample(:nearest, scale = (2, 3))
 
@@ -246,7 +246,7 @@
 Upsample(:bilinear, size = (4, 5))
 
 julia> m(ones(2, 2, 1, 1)) |> size
-(4, 5, 1, 1)
source
Flux.PixelShuffleType
PixelShuffle(r::Int)

Pixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.

See NNlib.pixel_shuffle.

Examples

julia> p = PixelShuffle(2);
+(4, 5, 1, 1)
source
Flux.PixelShuffleType
PixelShuffle(r::Int)

Pixel shuffling layer with upscale factor r. Usually used for generating higher resolution images while upscaling them.

See NNlib.pixel_shuffle.

Examples

julia> p = PixelShuffle(2);
 
 julia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]
 2×2×4×1 Array{Float64, 4}:
@@ -298,7 +298,7 @@
  4.1  4.3  5.1  5.3  6.1  6.3
  4.2  4.4  5.2  5.4  6.2  6.4
  7.1  7.3  8.1  8.3  9.1  9.3
- 7.2  7.4  8.2  8.4  9.2  9.4
source

Embedding Vectors

These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.

Flux.EmbeddingType
Embedding(in => out; init=randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.

This layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.

For indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).

Examples

julia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))
+ 7.2  7.4  8.2  8.4  9.2  9.4
source

Embedding Vectors

These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.

Flux.EmbeddingType
Embedding(in => out; init=randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in, as a trainable matrix.

This layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in 1:in, an array of indices, or the corresponding onehot encoding.

For indices x, the result is of size (out, size(x)...), allowing several batch dimensions. For one-hot ohx, the result is of size (out, size(ohx)[2:end]...).

Examples

julia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))
 Embedding(26 => 4)  # 104 parameters
 
 julia> emb(2)  # one column of e.weight (here not random!)
@@ -319,7 +319,7 @@
 true
 
 julia> emb(rand(1:26, (10, 1, 12))) |> size  # three batch dimensions
-(4, 10, 1, 12)
source
Flux.EmbeddingBagType
EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a "bag". Their individual embedding vectors are reduced to one, using mean or some other function.

Instead of acting on one "bag", such as x::Vector{Int}, the layer can also act on several:

  • Acting on a vector of "bags", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).

  • Any higher-rank array of integers is interpreted as a collection of "bags" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all "bags" have the same length.

  • A vector of "bags" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.

The "bag" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.

Examples

julia> vocab_size = 26;  # embed into 3 dimensions, with non-random vectors:
+(4, 10, 1, 12)
source
Flux.EmbeddingBagType
EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)

A lookup table that stores embeddings of dimension out for a vocabulary of size in. Differs from Embedding in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a "bag". Their individual embedding vectors are reduced to one, using mean or some other function.

Instead of acting on one "bag", such as x::Vector{Int}, the layer can also act on several:

  • Acting on a vector of "bags", it produces a matrix whose columns are the reduced vectors. More generally on x::Array{Vector{Int}}, its output is of size (out, size(x)...).

  • Any higher-rank array of integers is interpreted as a collection of "bags" each along the first dimension. Thus the output is mapslices(e, x; dims=1) when e::EmbeddingBag and x::Array{Int,N}. This method is more efficient, but requires that all "bags" have the same length.

  • A vector of "bags" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.

The "bag" may equivalently be represented as a OneHotMatrix. A collection of these, or one higher-rank OneHotArray, again produce a stack of embeddings. See details below.

Examples

julia> vocab_size = 26;  # embed into 3 dimensions, with non-random vectors:
 
 julia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))
 EmbeddingBag(26 => 3)  # 78 parameters
@@ -368,7 +368,7 @@
 3×2 Matrix{Float32}:
  33.3333    0.0
  66.6667    0.0
-  0.0     100.0
source

Dataflow Layers, or Containers

The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.

Flux.ChainType
Chain(layers...)
+  0.0     100.0
source

Dataflow Layers, or Containers

The basic Chain(F, G, H) applies the layers it contains in sequence, equivalent to H ∘ G ∘ F. Flux has some other layers which contain layers, but connect them up in a more complicated way: SkipConnection allows ResNet's residual connection.

Flux.ChainType
Chain(layers...)
 Chain(name = layer, ...)

Collects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, m[2] or m[1:end-1], and if names are given, m[:name] == m[1] etc.

Examples

julia> m = Chain(x -> x^2, x -> x+1);
 
 julia> m(5) == 26
@@ -385,12 +385,12 @@
                   dec = Dense(5 => 2));
 
 julia> m2(x) == (m2[:dec] ∘ m2[:enc])(x)
-true

For large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!

source
Flux.activationsFunction
activations(c::Chain, input)

Like calling a Chain, but saves the result of each layer as an output.

Examples

julia> using Flux: activations
+true

For large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers Chain([layer1, layer2, ...]). This feature is somewhat experimental, beware!

source
Flux.activationsFunction
activations(c::Chain, input)

Like calling a Chain, but saves the result of each layer as an output.

Examples

julia> using Flux: activations
 
 julia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);
 
 julia> activations(c, 1)
-(2, 4, 64)
source
Flux.MaxoutType
Maxout(layers...)
+(2, 4, 64)
source
Flux.MaxoutType
Maxout(layers...)
 Maxout(f, n_alts)

This contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.

Instead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.

Maxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio "Maxout Networks" https://arxiv.org/abs/1302.4389.

See also Parallel to reduce with other operators.

Examples

julia> m = Maxout(x -> abs2.(x), x -> x .* 3);
 
 julia> m([-2 -1 0 1 2])
@@ -405,7 +405,7 @@
 )                   # Total: 6 arrays, 126 parameters, 888 bytes.
 
 julia> Flux.outputsize(m3, (5, 11))
-(7, 11)
source
Flux.SkipConnectionType
SkipConnection(layer, connection)

Create a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, "skipped" input.

The simplest "ResNet"-type connection is just SkipConnection(layer, +). Here is a more complicated example:

julia> m = Conv((3,3), 4 => 7, pad=(1,1));
+(7, 11)
source
Flux.SkipConnectionType
SkipConnection(layer, connection)

Create a skip connection which consists of a layer or Chain of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given layer while the second is the unchanged, "skipped" input.

The simplest "ResNet"-type connection is just SkipConnection(layer, +). Here is a more complicated example:

julia> m = Conv((3,3), 4 => 7, pad=(1,1));
 
 julia> x = ones(Float32, 5, 5, 4, 10);
 
@@ -415,7 +415,7 @@
 julia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));
 
 julia> size(sm(x)) == (5, 5, 11, 10)
-true

See also Parallel, Maxout.

source
Flux.ParallelType
Parallel(connection, layers...)
+true

See also Parallel, Maxout.

source
Flux.ParallelType
Parallel(connection, layers...)
 Parallel(connection; name = layer, ...)

Create a layer which passes an input array to each path in layers, before reducing the output with connection.

Called with one input x, this is equivalent to connection([l(x) for l in layers]...). If called with multiple inputs, one is passed to each layer, thus Parallel(+, f, g)(x, y) = f(x) + g(y).

Like Chain, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: m[1] == m[:name] is the first layer.

See also SkipConnection which is Parallel with one identity, and Maxout which reduces by broadcasting max.

Examples

julia> model = Chain(Dense(3 => 5),
                      Parallel(vcat, Dense(5 => 4), Chain(Dense(5 => 7), Dense(7 => 4))),
                      Dense(8 => 17));
@@ -437,7 +437,7 @@
 (2,)
 
 julia> model2[:β] == model2[2]
-true
source
Flux.PairwiseFusionType
PairwiseFusion(connection, layers...)

Arguments

  • connection: A function taking 2 inputs and combining them into a single output
  • layers: The layers whose outputs are combined

Inputs

This layer behaves differently based on input type:

  1. If input x is a tuple of length N (or the input is xs with N x's), matching the number of layers,

then each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:

x1 → layer1 → y1 ↘
+true
source
Flux.PairwiseFusionType
PairwiseFusion(connection, layers...)

Arguments

  • connection: A function taking 2 inputs and combining them into a single output
  • layers: The layers whose outputs are combined

Inputs

This layer behaves differently based on input type:

  1. If input x is a tuple of length N (or the input is xs with N x's), matching the number of layers,

then each layer receives a new input x[i] combined with the previous output y[i-1] using connection. Thus (y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3)) may be drawn as:

x1 → layer1 → y1 ↘
                   connection → layer2 → y2 ↘
               x2 ↗                          connection → layer3 → y3
                                         x3 ↗

... or written as:

y1 = layer1(x1)
@@ -445,7 +445,7 @@
 y3 = layer3(connection(y2, x3))
  1. With just one input, each layer receives the same x combined with the previous output. Thus y = PairwiseFusion(connection, layers...)(x) obeys:
y[1] == layers[1](x)
 for i in 2:length(layers)
     y[i] == connection(layers[i](y[i-1]), x)
-end

Returns

A tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).

source

Recurrent Models

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction
RNN(in => out, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(RNNCell(a...)), and so RNNs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

Examples

julia> r = RNN(3 => 5)
+end

Returns

A tuple of length N with the output of each fusion ((y1, y2, ..., yN) in the example above).

source

Recurrent Models

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction
RNN(in => out, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(RNNCell(a...)), and so RNNs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

Examples

julia> r = RNN(3 => 5)
 Recur(
   RNNCell(3 => 5, tanh),                # 50 parameters
 )         # Total: 4 trainable arrays, 50 parameters,
@@ -484,7 +484,7 @@
 julia> r = Flux.Recur(Flux.RNNCell(tanh, rand(5, 4), Tridiagonal(rand(5, 5)), rand(5), rand(5, 1)))
 
 julia> r(rand(4, 10)) |> size # batch size of 10
-(5, 10)
source
Flux.LSTMFunction
LSTM(in => out)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(LSTMCell(a...)), and so LSTMs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> l = LSTM(3 => 5)
+(5, 10)
source
Flux.LSTMFunction
LSTM(in => out)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(LSTMCell(a...)), and so LSTMs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> l = LSTM(3 => 5)
 Recur(
   LSTMCell(3 => 5),                     # 190 parameters
 )         # Total: 5 trainable arrays, 190 parameters,
@@ -496,7 +496,7 @@
 julia> Flux.reset!(l);
 
 julia> l(rand(Float32, 3, 10)) |> size # batch size of 10
-(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

LSTMCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUFunction
GRU(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.

The integer arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUCell(a...)), and so GRUs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRU(3 => 5)
+(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

LSTMCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUFunction
GRU(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.

The integer arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUCell(a...)), and so GRUs are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRU(3 => 5)
 Recur(
   GRUCell(3 => 5),                      # 140 parameters
 )         # Total: 4 trainable arrays, 140 parameters,
@@ -508,7 +508,7 @@
 julia> Flux.reset!(g);
 
 julia> g(rand(Float32, 3, 10)) |> size # batch size of 10
-(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUv3Function
GRUv3(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUv3Cell(a...)), and so GRUv3s are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRUv3(3 => 5)
+(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUCells can be constructed directly by specifying the non-linear function, the Wi and Wh internal matrices, a bias vector b, and a learnable initial state state0. The Wi and Wh matrices do not need to be the same type. See the example in RNN.

source
Flux.GRUv3Function
GRUv3(in => out)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.

The arguments in and out describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length in or a batch of vectors represented as a in x B matrix and outputs a vector of length out or a batch of vectors of size out x B.

This constructor is syntactic sugar for Recur(GRUv3Cell(a...)), and so GRUv3s are stateful. Note that the state shape can change depending on the inputs, and so it is good to reset! the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

Examples

julia> g = GRUv3(3 => 5)
 Recur(
   GRUv3Cell(3 => 5),                    # 140 parameters
 )         # Total: 5 trainable arrays, 140 parameters,
@@ -520,7 +520,7 @@
 julia> Flux.reset!(g);
 
 julia> g(rand(Float32, 3, 10)) |> size # batch size of 10
-(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUv3Cells can be constructed directly by specifying the non-linear function, the Wi, Wh, and Wh_h internal matrices, a bias vector b, and a learnable initial state state0. The Wi, Wh, and Wh_h matrices do not need to be the same type. See the example in RNN.

source
Flux.RecurType
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs:

Examples

julia> accum(h, x) = (h + x, x)
+(5, 10)
Batch size changes

Failing to call reset! when the input batch size changes can lead to unexpected behavior. See the example in RNN.

Note:

GRUv3Cells can be constructed directly by specifying the non-linear function, the Wi, Wh, and Wh_h internal matrices, a bias vector b, and a learnable initial state state0. The Wi, Wh, and Wh_h matrices do not need to be the same type. See the example in RNN.

source
Flux.RecurType
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs:

Examples

julia> accum(h, x) = (h + x, x)
 accum (generic function with 1 method)
 
 julia> rnn = Flux.Recur(accum, 0)
@@ -571,7 +571,7 @@
 
 julia> rnn.state
 1×1 Matrix{Int64}:
- 60
source
Flux.reset!Function
reset!(rnn)

Reset the hidden state of a recurrent layer back to its original value.

Assuming you have a Recur layer rnn, this is roughly equivalent to:

rnn.state = hidden(rnn.cell)

Examples

julia> r = Flux.RNNCell(relu, ones(1,1), zeros(1,1), ones(1,1), zeros(1,1));  # users should use the RNN wrapper struct instead
+ 60
source
Flux.reset!Function
reset!(rnn)

Reset the hidden state of a recurrent layer back to its original value.

Assuming you have a Recur layer rnn, this is roughly equivalent to:

rnn.state = hidden(rnn.cell)

Examples

julia> r = Flux.RNNCell(relu, ones(1,1), zeros(1,1), ones(1,1), zeros(1,1));  # users should use the RNN wrapper struct instead
 
 julia> y = Flux.Recur(r, ones(1,1));
 
@@ -593,7 +593,7 @@
 
 julia> y.state
 1×1 Matrix{Float64}:
- 0.0
source

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.

Flux.BatchNormType
BatchNorm(channels::Integer, λ=identity;
+ 0.0
source

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.

Flux.BatchNormType
BatchNorm(channels::Integer, λ=identity;
           initβ=zeros32, initγ=ones32,
           affine=true, track_stats=true, active=nothing,
           eps=1f-5, momentum= 0.1f0)

Batch Normalization layer. channels should be the size of the channel dimension in your data (see below).

Given an array with N dimensions, call the N-1th the channel dimension. For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.

BatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input slice and normalises the input accordingly.

If affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.

After normalisation, elementwise activation λ is applied.

If track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

Use testmode! during inference.

Examples

julia> using Statistics
@@ -605,7 +605,7 @@
 julia> Flux.trainmode!(m);
 
 julia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))
-true
source
Flux.DropoutType
Dropout(p; [dims, rng, active])

Layer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.

While training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.

By default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.

By default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).

Keyword rng lets you specify a custom random number generator. (Only supported on the CPU.)

Examples

julia> m = Chain(Dense(ones(3,2)), Dropout(0.4))
+true
source
Flux.DropoutType
Dropout(p; [dims, rng, active])

Layer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.

While training, it sets each input to 0 (with probability p) or else scales it by 1 / (1 - p), using the NNlib.dropout function. While testing, it has no effect.

By default the mode will switch automatically, but it can also be controlled manually via Flux.testmode!, or by passing keyword active=true for training mode.

By default every input is treated independently. With the dims keyword, instead it takes a random choice only along that dimension. For example Dropout(p; dims = 3) will randomly zero out entire channels on WHCN input (also called 2D dropout).

Keyword rng lets you specify a custom random number generator. (Only supported on the CPU.)

Examples

julia> m = Chain(Dense(ones(3,2)), Dropout(0.4))
 Chain(
   Dense(2 => 3),                        # 9 parameters
   Dropout(0.4),
@@ -637,7 +637,7 @@
 1.9989999999999961
 
 julia> mean(iszero, y)  # is about 0.4
-0.4003
source
Flux.AlphaDropoutType
AlphaDropout(p; [rng, active])

A dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.

Does nothing to the input once testmode! is true.

Examples

julia> using Statistics
+0.4003
source
Flux.AlphaDropoutType
AlphaDropout(p; [rng, active])

A dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.

Does nothing to the input once testmode! is true.

Examples

julia> using Statistics
 
 julia> x = randn32(1000,1);
 
@@ -648,7 +648,7 @@
 julia> y = m(x);
 
 julia> isapprox(std(x), std(y), atol=0.2)
-true
source
Flux.LayerNormType
LayerNorm(size..., λ=identity; affine=true, eps=1f-5)

A normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.

In the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.

If affine=true, it also applies a learnable shift and rescaling using the Scale layer.

See also BatchNorm, InstanceNorm, GroupNorm, and normalise.

Examples

julia> using Statistics
+true
source
Flux.LayerNormType
LayerNorm(size..., λ=identity; affine=true, eps=1f-5)

A normalisation layer designed to be used with recurrent hidden states. The argument size should be an integer or a tuple of integers.

In the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation λ. The input is normalised along the first length(size) dimensions for tuple size, and along the first dimension for integer size. The input is expected to have first dimensions' size equal to size.

If affine=true, it also applies a learnable shift and rescaling using the Scale layer.

See also BatchNorm, InstanceNorm, GroupNorm, and normalise.

Examples

julia> using Statistics
 
 julia> xs = rand(3, 3, 3, 2);  # a batch of 2 images, each having 3 channels
 
@@ -657,7 +657,7 @@
 julia> y = m(xs);
 
 julia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)
-true
source
Flux.InstanceNormType
InstanceNorm(channels::Integer, λ=identity;
+true
source
Flux.InstanceNormType
InstanceNorm(channels::Integer, λ=identity;
              initβ=zeros32, initγ=ones32,
              affine=false, track_stats=false,
              eps=1f-5, momentum=0.1f0)

Instance Normalization layer. channels should be the size of the channel dimension in your data (see below).

Given an array with N > 2 dimensions, call the N-1th the channel dimension. For WHCN images it's the usual channel dimension.

InstanceNorm computes the mean and variance for each D_1×...×D_{N-2}×1×1 input slice and normalises the input accordingly.

If affine=true, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.

If track_stats=true, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

Warning: the defaults for affine and track_stats used to be true in previous Flux versions (< v0.12).

Examples

julia> using Statistics
@@ -669,7 +669,7 @@
 julia> y = m(xs);
 
 julia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)
-true
source
Flux.GroupNormType
GroupNorm(channels::Int, G::Int, λ = identity;
+true
source
Flux.GroupNormType
GroupNorm(channels::Int, G::Int, λ = identity;
           initβ = zeros32,
           initγ = ones32,
           affine = true,
@@ -686,7 +686,7 @@
 true
 
 julia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])
-true
source
Flux.normaliseFunction
normalise(x; dims=ndims(x), eps=1e-5)

Normalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the denominator for numerical stability.

Examples

julia> using Statistics
+true
source
Flux.normaliseFunction
normalise(x; dims=ndims(x), eps=1e-5)

Normalise x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension. eps is a small term added to the denominator for numerical stability.

Examples

julia> using Statistics
 
 julia> x = [90, 100, 110, 130, 70];
 
@@ -709,7 +709,7 @@
 julia> y = Flux.normalise(x, dims=1);
 
 julia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)
-true
source

Test vs. Train

Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.

Warning

This automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.

The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.

Flux.testmode!Method
testmode!(model, [mode]) -> model

Set a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.

If you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.

There is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.

Example

julia> d = Dropout(0.3)
+true
source

Test vs. Train

Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.

Warning

This automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.

The functions Flux.trainmode! and Flux.testmode! let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.

Flux.testmode!Method
testmode!(model, [mode]) -> model

Set a layer, or all layers in a model, to test mode. This disables the effect of Dropout and some other regularisation layers.

If you manually set a model into test mode, you need to manually place it back into train mode during training phase, using trainmode!.

There is an optional second argument, which takes a symbol :auto to reset all layers back to the default automatic mode.

Example

julia> d = Dropout(0.3)
 Dropout(0.3)
 
 julia> testmode!(d)   # dropout is now always disabled
@@ -719,4 +719,4 @@
 Dropout(0.3, active=true)
 
 julia> testmode!(d, :auto)  # back to default
-Dropout(0.3)
source
Flux.testmode!Method
testmode!(model, inactive)

This two-argument method is largely internal. It recurses into the model, and until a method like testmode!(d::Dropout, inactive) alters the activity of a layer. Custom layers can support manual testmode! / trainmode! switching by defining such a method.

Possible values of inactive are:

  • true for testing, i.e. active=false
  • false for training, same as trainmode!(m)
  • :auto or nothing for Flux to detect training automatically.
Compat

This method may be removed in a future breaking change, to separate the user-facing testmode! from the internal recursion.

source
Flux.trainmode!Function
trainmode!(model) -> model

Set a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.

source
trainmode!(m, active)
Warning

This two-argument method is deprecated.

Possible values of active are:

  • true for training, or
  • false for testing, same as testmode!(m)
  • :auto or nothing for Flux to detect training automatically.
source
+Dropout(0.3)source
Flux.testmode!Method
testmode!(model, inactive)

This two-argument method is largely internal. It recurses into the model, and until a method like testmode!(d::Dropout, inactive) alters the activity of a layer. Custom layers can support manual testmode! / trainmode! switching by defining such a method.

Possible values of inactive are:

  • true for testing, i.e. active=false
  • false for training, same as trainmode!(m)
  • :auto or nothing for Flux to detect training automatically.
Compat

This method may be removed in a future breaking change, to separate the user-facing testmode! from the internal recursion.

source
Flux.trainmode!Function
trainmode!(model) -> model

Set a layer, or all layers in a model, to training mode. Opposite to testmode!, see further details there.

source
trainmode!(m, active)
Warning

This two-argument method is deprecated.

Possible values of active are:

  • true for training, or
  • false for testing, same as testmode!(m)
  • :auto or nothing for Flux to detect training automatically.
source
diff --git a/dev/reference/models/losses/index.html b/dev/reference/models/losses/index.html index 9527e9c9cd..cdfe68e2eb 100644 --- a/dev/reference/models/losses/index.html +++ b/dev/reference/models/losses/index.html @@ -10,16 +10,16 @@ loss(ŷ, y, agg=identity) # no aggregation.

Function listing

Flux.Losses.maeFunction
mae(ŷ, y; agg = mean)

Return the loss corresponding to mean absolute error:

agg(abs.(ŷ .- y))

Example

julia> y_model = [1.1, 1.9, 3.1];
 
 julia> Flux.mae(y_model, 1:3)
-0.10000000000000009
source
Flux.Losses.mseFunction
mse(ŷ, y; agg = mean)

Return the loss corresponding to mean square error:

agg((ŷ .- y) .^ 2)

See also: mae, msle, crossentropy.

Example

julia> y_model = [1.1, 1.9, 3.1];
+0.10000000000000009
source
Flux.Losses.mseFunction
mse(ŷ, y; agg = mean)

Return the loss corresponding to mean square error:

agg((ŷ .- y) .^ 2)

See also: mae, msle, crossentropy.

Example

julia> y_model = [1.1, 1.9, 3.1];
 
 julia> y_true = 1:3;
 
 julia> Flux.mse(y_model, y_true)
-0.010000000000000018
source
Flux.Losses.msleFunction
msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

The loss corresponding to mean squared logarithmic errors, calculated as

agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)

The ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.

Example

julia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)
+0.010000000000000018
source
Flux.Losses.msleFunction
msle(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

The loss corresponding to mean squared logarithmic errors, calculated as

agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)) .^ 2)

The ϵ == eps term provides numerical stability. Penalizes an under-estimation more than an over-estimatation.

Example

julia> Flux.msle(Float32[1.1, 2.2, 3.3], 1:3)
 0.009084041f0
 
 julia> Flux.msle(Float32[0.9, 1.8, 2.7], 1:3)
-0.011100831f0
source
Flux.Losses.huber_lossFunction
huber_loss(ŷ, y; delta = 1, agg = mean)

Return the mean of the Huber loss given the prediction and true values y.

             | 0.5 * |ŷ - y|^2,            for |ŷ - y| <= δ
+0.011100831f0
source
Flux.Losses.huber_lossFunction
huber_loss(ŷ, y; delta = 1, agg = mean)

Return the mean of the Huber loss given the prediction and true values y.

             | 0.5 * |ŷ - y|^2,            for |ŷ - y| <= δ
 Huber loss = |
              |  δ * (|ŷ - y| - 0.5 * δ), otherwise

Example

julia> ŷ = [1.1, 2.1, 3.1];
 
@@ -27,7 +27,7 @@
 0.005000000000000009
 
 julia> Flux.huber_loss(ŷ, 1:3, delta=0.05)  # changes behaviour as |ŷ - y| > δ
-0.003750000000000005
source
Flux.Losses.label_smoothingFunction
label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)

Returns smoothed labels, meaning the confidence on label values are relaxed.

When y is given as one-hot vector or batch of one-hot, its calculated as

y .* (1 - α) .+ α / size(y, dims)

when y is given as a number or batch of numbers for binary classification, its calculated as

y .* (1 - α) .+ α / 2

in which case the labels are squeezed towards 0.5.

α is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.

dims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.

Example

julia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)
+0.003750000000000005
source
Flux.Losses.label_smoothingFunction
label_smoothing(y::Union{Number, AbstractArray}, α; dims::Int=1)

Returns smoothed labels, meaning the confidence on label values are relaxed.

When y is given as one-hot vector or batch of one-hot, its calculated as

y .* (1 - α) .+ α / size(y, dims)

when y is given as a number or batch of numbers for binary classification, its calculated as

y .* (1 - α) .+ α / 2

in which case the labels are squeezed towards 0.5.

α is a number in interval (0, 1) called the smoothing factor. Higher the value of α larger the smoothing of y.

dims denotes the one-hot dimension, unless dims=0 which denotes the application of label smoothing to binary distributions encoded in a single number.

Example

julia> y = Flux.onehotbatch([1, 1, 1, 0, 1, 0], 0:1)
 2×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  ⋅  ⋅  ⋅  1  ⋅  1
  1  1  1  ⋅  1  ⋅
@@ -51,7 +51,7 @@
 true
 
 julia> Flux.crossentropy(y_dis, y) > Flux.crossentropy(y_dis, y_smoothed)
-true
source
Flux.Losses.crossentropyFunction
crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)

Return the cross entropy between the given probability distributions; calculated as

agg(-sum(y .* log.(ŷ .+ ϵ); dims))

Cross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction is supposed to sum to one across dims, as would be the case with the output of a softmax operation.

For numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .

Use label_smoothing to smooth the true labels as preprocessing before computing the loss.

See also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.

Example

julia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)
+true
source
Flux.Losses.crossentropyFunction
crossentropy(ŷ, y; dims = 1, eps = eps(eltype(ŷ)), agg = mean)

Return the cross entropy between the given probability distributions; calculated as

agg(-sum(y .* log.(ŷ .+ ϵ); dims))

Cross entropy is typically used as a loss in multi-class classification, in which case the labels y are given in a one-hot format. dims specifies the dimension (or the dimensions) containing the class probabilities. The prediction is supposed to sum to one across dims, as would be the case with the output of a softmax operation.

For numerical stability, it is recommended to use logitcrossentropy rather than softmax followed by crossentropy .

Use label_smoothing to smooth the true labels as preprocessing before computing the loss.

See also: logitcrossentropy, binarycrossentropy, logitbinarycrossentropy.

Example

julia> y_label = Flux.onehotbatch([0, 1, 2, 1, 0], 0:2)
 3×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  1  ⋅  ⋅  ⋅  1
  ⋅  1  ⋅  1  ⋅
@@ -80,7 +80,7 @@
  0.05  0.05  0.9   0.05  0.05
 
 julia> Flux.crossentropy(y_model, y_smooth)
-1.5776052f0
source
Flux.Losses.logitcrossentropyFunction
logitcrossentropy(ŷ, y; dims = 1, agg = mean)

Return the cross entropy calculated by

agg(-sum(y .* logsoftmax(ŷ; dims); dims))

This is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.

See also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.

Example

julia> y_label = Flux.onehotbatch(collect("abcabaa"), 'a':'c')
+1.5776052f0
source
Flux.Losses.logitcrossentropyFunction
logitcrossentropy(ŷ, y; dims = 1, agg = mean)

Return the cross entropy calculated by

agg(-sum(y .* logsoftmax(ŷ; dims); dims))

This is mathematically equivalent to crossentropy(softmax(ŷ), y), but is more numerically stable than using functions crossentropy and softmax separately.

See also: binarycrossentropy, logitbinarycrossentropy, label_smoothing.

Example

julia> y_label = Flux.onehotbatch(collect("abcabaa"), 'a':'c')
 3×7 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
  1  ⋅  ⋅  1  ⋅  1  1
  ⋅  1  ⋅  ⋅  1  ⋅  ⋅
@@ -96,7 +96,7 @@
 1.5791205f0
 
 julia> Flux.crossentropy(softmax(y_model), y_label)
-1.5791197f0
source
Flux.Losses.binarycrossentropyFunction
binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the binary cross-entropy loss, computed as

agg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))

Where typically, the prediction is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.

Use label_smoothing to smooth the y value as preprocessing before computing the loss.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1]
+1.5791197f0
source
Flux.Losses.binarycrossentropyFunction
binarycrossentropy(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the binary cross-entropy loss, computed as

agg(@.(-y * log(ŷ + ϵ) - (1 - y) * log(1 - ŷ + ϵ)))

Where typically, the prediction is given by the output of a sigmoid activation. The ϵ == eps term is included to avoid infinity. Using logitbinarycrossentropy is recomended over binarycrossentropy for numerical stability.

Use label_smoothing to smooth the y value as preprocessing before computing the loss.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1]
 3-element Vector{Bool}:
  1
  0
@@ -119,7 +119,7 @@
  1  ⋅  1
 
 julia> Flux.crossentropy(y_prob, y_hot)
-0.43989f0
source
Flux.Losses.logitbinarycrossentropyFunction
logitbinarycrossentropy(ŷ, y; agg = mean)

Mathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1];
+0.43989f0
source
Flux.Losses.logitbinarycrossentropyFunction
logitbinarycrossentropy(ŷ, y; agg = mean)

Mathematically equivalent to binarycrossentropy(σ(ŷ), y) but is more numerically stable.

See also: crossentropy, logitcrossentropy.

Examples

julia> y_bin = Bool[1,0,1];
 
 julia> y_model = Float32[2, -1, pi]
 3-element Vector{Float32}:
@@ -131,7 +131,7 @@
 0.160832f0
 
 julia> Flux.binarycrossentropy(sigmoid.(y_model), y_bin)
-0.16083185f0
source
Flux.Losses.kldivergenceFunction
kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the Kullback-Leibler divergence between the given probability distributions.

The KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.

Example

julia> p1 = [1 0; 0 1]
+0.16083185f0
source
Flux.Losses.kldivergenceFunction
kldivergence(ŷ, y; agg = mean, eps = eps(eltype(ŷ)))

Return the Kullback-Leibler divergence between the given probability distributions.

The KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative, and zero only when both the distributions are equal.

Example

julia> p1 = [1 0; 0 1]
 2×2 Matrix{Int64}:
  1  0
  0  1
@@ -151,10 +151,10 @@
 0.0
 
 julia> Flux.kldivergence(p1, p2; eps = 0)  # about 17.3 with the regulator
-Inf
source
Flux.Losses.poisson_lossFunction
poisson_loss(ŷ, y; agg = mean)

Return how much the predicted distribution diverges from the expected Poisson distribution y; calculated as -

sum(ŷ .- y .* log.(ŷ)) / size(y, 2)

More information..

Example

julia> y_model = [1, 3, 3];  # data should only take integral values
+Inf
source
Flux.Losses.poisson_lossFunction
poisson_loss(ŷ, y; agg = mean)

Return how much the predicted distribution diverges from the expected Poisson distribution y; calculated as -

sum(ŷ .- y .* log.(ŷ)) / size(y, 2)

More information..

Example

julia> y_model = [1, 3, 3];  # data should only take integral values
 
 julia> Flux.poisson_loss(y_model, 1:3)
-0.5023128522198171
source
Flux.Losses.hinge_lossFunction
hinge_loss(ŷ, y; agg = mean)

Return the hinge_loss given the prediction and true labels y (containing 1 or -1); calculated as

sum(max.(0, 1 .- ŷ .* y)) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: squared_hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
+0.5023128522198171
source
Flux.Losses.hinge_lossFunction
hinge_loss(ŷ, y; agg = mean)

Return the hinge_loss given the prediction and true labels y (containing 1 or -1); calculated as

sum(max.(0, 1 .- ŷ .* y)) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: squared_hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
 
 julia> y_pred = [0.1, 0.3, 1, 1.5];
 
@@ -168,7 +168,7 @@
 true
 
 julia> Flux.hinge_loss(y_pred[2], y_true[2]) != 0 # opposite signs
-true
source
Flux.Losses.squared_hinge_lossFunction
squared_hinge_loss(ŷ, y)

Return the squared hinge_loss loss given the prediction and true labels y (containing 1 or -1); calculated as

sum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
+true
source
Flux.Losses.squared_hinge_lossFunction
squared_hinge_loss(ŷ, y)

Return the squared hinge_loss loss given the prediction and true labels y (containing 1 or -1); calculated as

sum((max.(0, 1 .- ŷ .* y)).^2) / size(y, 2)

Usually used with classifiers like Support Vector Machines. See also: hinge_loss

Example

julia> y_true = [1, -1, 1, 1];
 
 julia> y_pred = [0.1, 0.3, 1, 1.5];
 
@@ -182,13 +182,13 @@
 true
 
 julia> Flux.squared_hinge_loss(y_pred[2], y_true[2]) != 0
-true
source
Flux.Losses.dice_coeff_lossFunction
dice_coeff_loss(ŷ, y; smooth = 1)

Return a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:

1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)

Example

julia> y_pred = [1.1, 2.1, 3.1];
+true
source
Flux.Losses.dice_coeff_lossFunction
dice_coeff_loss(ŷ, y; smooth = 1)

Return a loss based on the dice coefficient. Used in the V-Net image segmentation architecture. The dice coefficient is similar to the F1_score. Loss calculated as:

1 - 2*sum(|ŷ .* y| + smooth) / (sum(ŷ.^2) + sum(y.^2) + smooth)

Example

julia> y_pred = [1.1, 2.1, 3.1];
 
 julia> Flux.dice_coeff_loss(y_pred, 1:3)
 0.000992391663909964
 
 julia> 1 - Flux.dice_coeff_loss(y_pred, 1:3)  # ~ F1 score for image segmentation
-0.99900760833609
source
Flux.Losses.tversky_lossFunction
tversky_loss(ŷ, y; beta = 0.7)

Return the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:

1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)
source
Flux.Losses.binary_focal_lossFunction
binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).

For gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.

See also: Losses.focal_loss for multi-class setting

Example

julia> y = [0  1  0
+0.99900760833609
source
Flux.Losses.tversky_lossFunction
tversky_loss(ŷ, y; beta = 0.7)

Return the Tversky loss. Used with imbalanced data to give more weight to false negatives. Larger β == beta weigh recall more than precision (by placing more emphasis on false negatives). Calculated as:

1 - sum(|y .* ŷ| + 1) / (sum(y .* ŷ + (1 - β)*(1 .- y) .* ŷ + β*y .* (1 .- ŷ)) + 1)
source
Flux.Losses.binary_focal_lossFunction
binary_focal_loss(ŷ, y; agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the binaryfocalloss The input, 'ŷ', is expected to be normalized (i.e. softmax output).

For gamma = 0, the loss is mathematically equivalent to Losses.binarycrossentropy.

See also: Losses.focal_loss for multi-class setting

Example

julia> y = [0  1  0
             1  0  1]
 2×3 Matrix{Int64}:
  0  1  0
@@ -201,7 +201,7 @@
  0.731059  0.5  0.731059
 
 julia> Flux.binary_focal_loss(ŷ, y) ≈ 0.0728675615927385
-true
source
Flux.Losses.focal_lossFunction
focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).

The modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.

Example

julia> y = [1  0  0  0  1
+true
source
Flux.Losses.focal_lossFunction
focal_loss(ŷ, y; dims=1, agg=mean, gamma=2, eps=eps(eltype(ŷ)))

Return the focal_loss which can be used in classification tasks with highly imbalanced classes. It down-weights well-classified examples and focuses on hard examples. The input, 'ŷ', is expected to be normalized (i.e. softmax output).

The modulating factor, γ == gamma, controls the down-weighting strength. For γ == 0, the loss is mathematically equivalent to Losses.crossentropy.

Example

julia> y = [1  0  0  0  1
             0  1  0  1  0
             0  0  1  0  0]
 3×5 Matrix{Int64}:
@@ -216,10 +216,10 @@
  0.665241   0.665241   0.665241   0.665241   0.665241
 
 julia> Flux.focal_loss(ŷ, y) ≈ 1.1277571935622628
-true

See also: Losses.binary_focal_loss for binary (not one-hot) labels

source
Flux.Losses.siamese_contrastive_lossFunction
siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)

Return the contrastive loss which can be useful for training Siamese Networks. It is given by

agg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)

Specify margin to set the baseline for distance at which pairs are dissimilar.

Example

julia> ŷ = [0.5, 1.5, 2.5];
+true

See also: Losses.binary_focal_loss for binary (not one-hot) labels

source
Flux.Losses.siamese_contrastive_lossFunction
siamese_contrastive_loss(ŷ, y; margin = 1, agg = mean)

Return the contrastive loss which can be useful for training Siamese Networks. It is given by

agg(@. (1 - y) * ŷ^2 + y * max(0, margin - ŷ)^2)

Specify margin to set the baseline for distance at which pairs are dissimilar.

Example

julia> ŷ = [0.5, 1.5, 2.5];
 
 julia> Flux.siamese_contrastive_loss(ŷ, 1:3)
 -4.833333333333333
 
 julia> Flux.siamese_contrastive_loss(ŷ, 1:3, margin = 2)
--4.0
source
+-4.0source diff --git a/dev/reference/models/nnlib/index.html b/dev/reference/models/nnlib/index.html index 750dc83275..695b9daf0a 100644 --- a/dev/reference/models/nnlib/index.html +++ b/dev/reference/models/nnlib/index.html @@ -4,7 +4,7 @@ gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash});

Neural Network primitives from NNlib.jl

Flux re-exports all of the functions exported by the NNlib package. This includes activation functions, described on their own page. Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.

Attention

Primitives for the MultiHeadAttention layer.

NNlib.dot_product_attentionFunction
dot_product_attention(query, key, value, [bias]; [fdrop, mask, nheads])

Multihead dot product attention used in transformer architectures.

The input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.

Returns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).

See also dot_product_attention_scores if you only need the attention scores.

Arguments

  • query: Query array of size (qk_dim, q_len, batch_size...).
  • key: Key array of size (qk_dim, kv_len, batch_size...).
  • value: Value array of size (v_dim, kv_len, batch_size...).
  • bias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.
  • fdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).
  • mask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.
  • nheads: Number of heads to split the input arrays into. Default 1.

Examples

q, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)
-y, α = dot_product_attention(q, k, v)
source
NNlib.make_causal_maskFunction
make_causal_mask(x, dims=2)

Return a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.

Can be used to mask the attention scores in dot_product_attention.

source

Softmax

Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.

NNlib.softmaxFunction
softmax(x; dims = 1)

Softmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:

softmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)

with additional manipulations enhancing numerical stability.

For a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.

See also logsoftmax.

Examples

julia> softmax([1, 2, 3])
+y, α = dot_product_attention(q, k, v)
source
NNlib.make_causal_maskFunction
make_causal_mask(x, dims=2)

Return a boolean square matrix m of the same type as x and of side size(x, dims). Its elements are set such that m[i, j] == i ≤ j.

Can be used to mask the attention scores in dot_product_attention.

source

Softmax

Flux's Flux.logitcrossentropy uses NNlib.logsoftmax internally.

NNlib.softmaxFunction
softmax(x; dims = 1)

Softmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:

softmax(x; dims = 1) = exp.(x) ./ sum(exp.(x), dims = dims)

with additional manipulations enhancing numerical stability.

For a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.

See also logsoftmax.

Examples

julia> softmax([1, 2, 3])
 3-element Vector{Float64}:
  0.09003057317038046
  0.24472847105479764
@@ -28,8 +28,8 @@
 (7, 13)
 
 julia> Dense(4 => 7, softmax)(x)
-ERROR: `softmax(x)` called with a number, but it expects an array. 
source
NNlib.logsoftmaxFunction
logsoftmax(x; dims = 1)

Computes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.

It is semantically equivalent to the following:

logsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))

See also softmax.

source

Pooling

Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.

NNlib.PoolDimsType
PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};
-        stride=k, padding=0, dilation=1)  where {M, L}

Dimensions for a "pooling" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.

source
NNlib.lpnormpoolFunction
lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)

Perform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • p is restricted to 0 < p < Inf.
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.

For all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.

Thus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).

source
NNlib.maxpoolFunction
maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform max pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source
NNlib.meanpoolFunction
meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform mean pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source

Padding

NNlib.pad_circularFunction
pad_circular(x, pad::Tuple; [dims])
+ERROR: `softmax(x)` called with a number, but it expects an array. 
source
NNlib.logsoftmaxFunction
logsoftmax(x; dims = 1)

Computes the log of softmax in a more numerically stable way than directly taking log.(softmax(xs)). Commonly used in computing cross entropy loss.

It is semantically equivalent to the following:

logsoftmax(x; dims = 1) = x .- log.(sum(exp.(x), dims = dims))

See also softmax.

source

Pooling

Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, and MeanPool use NNlib.PoolDims, NNlib.maxpool, and NNlib.meanpool as their backend.

NNlib.PoolDimsType
PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};
+        stride=k, padding=0, dilation=1)  where {M, L}

Dimensions for a "pooling" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.

source
NNlib.lpnormpoolFunction
lpnormpool(x, p::Real, k::NTuple{N, Integer}; pad=0, stride=k)

Perform Lp pool operation with value of the Lp norm p and window size k on input tensor x, also known as LPPool in pytorch. This pooling operator from Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • p is restricted to 0 < p < Inf.
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.

For all elements x in a size k window, lpnormpool computes (∑ᵢ xᵢ^p)^(1 / p) as an element of the output.

Thus lpnormpool(x, 1, k) ./ prod(k) ≈ meanpool(x, k) and lpnormpool(x, 2, k).^2 ./ prod(k) ≈ meanpool(x.^2, k).

source
NNlib.maxpoolFunction
maxpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform max pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and always length(k) == ndim(x) - 2
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source
NNlib.meanpoolFunction
meanpool(x, k::NTuple{N, Integer}; pad=0, stride=k)

Perform mean pool operation with window size k on input tensor x.

Arguments:

  • x and k: Expects ndim(x) ∈ 3:5, and alwayslength(k) == ndim(x) - 2`
  • pad: See pad_zeros for details.
  • stride: Either a tuple with the same length as k, or one integer for all directions. Default is k.
source

Padding

NNlib.pad_circularFunction
pad_circular(x, pad::Tuple; [dims])
 pad_circular(x, pad::Int; [dims])

Pad the array x "circularly" across the border by wrapping around values from the opposite side of x.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

The pad length on either side in any dimension must not exceed the size of x in that dimension, i.e. pad_circular is not able to create abitrary sized tilings of x.

See also pad_repeat, pad_reflect, pad_symmetric, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -43,7 +43,7 @@
  8  2  5  8  2  5
  9  3  6  9  3  6
  7  1  4  7  1  4
- 8  2  5  8  2  5
source
NNlib.pad_constantFunction
pad_constant(x, pad::Tuple, val = 0; [dims = :])
 pad_constant(x, pad::Int, val = 0; [dims = :])

Pad the array x with the constant value val.

pad can be a tuple of integers. If it is of some length 2 * length(dims) that specifies the left and right padding size for each of the dimensions in dims as (l1, r1, ..., ln, rn). If supplied with a tuple of length length(dims) instead, it applies symmetric padding. If dims is not given, it defaults to all dimensions.

For integer pad input, it is applied on both sides on every dimension in dims.

See also pad_zeros, pad_repeat, pad_reflect, pad_symmetric, and pad_circular.

julia> r = reshape(1:4, 2, 2)
 2×2 reshape(::UnitRange{Int64}, 2, 2) with eltype Int64:
  1  3
@@ -110,7 +110,7 @@
 julia> pad_constant(r, (2,1, 3), dims = (1,2)) # padding must always be either the same length as dims, or double it
 ERROR: ArgumentError: Could not parse padding (2, 1, 3) and dims (1, 2)
 Stacktrace:
-[...]
source
NNlib.pad_reflectFunction
pad_reflect(x, pad::Tuple; [dims])
 pad_reflect(x, pad::Int; [dims])

Pad the array x reflecting its values across the border.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

See also pad_repeat, pad_symmetric, pad_circular, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -124,7 +124,7 @@
  5  2  5  8  5  2
  6  3  6  9  6  3
  5  2  5  8  5  2
- 4  1  4  7  4  1
source
NNlib.pad_repeatFunction
pad_repeat(x, pad::Tuple; [dims])
 pad_repeat(x, pad::Int; [dims])

Pad the array x repeating the values on the border.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

See also pad_reflect, pad_symmetric, pad_circular, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -138,7 +138,7 @@
  2  2  2  2  5  8  8  8  8  8
  3  3  3  3  6  9  9  9  9  9
  3  3  3  3  6  9  9  9  9  9
- 3  3  3  3  6  9  9  9  9  9
source
NNlib.pad_symmetricFunction
pad_symmetric(x, pad::Tuple; [dims])
 pad_symmetric(x, pad::Int; [dims])

Pad the array x reflecting its values symmetrically across the border, i.e. the border values of x are present in the padding values, in contrast to pad_reflect.

pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.

If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).

See also pad_repeat, pad_reflect, pad_circular, and pad_constant.

julia> r = reshape(1:9, 3, 3)
 3×3 reshape(::UnitRange{Int64}, 3, 3) with eltype Int64:
  1  4  7
@@ -152,8 +152,8 @@
  2  2  5  8  8  5
  3  3  6  9  9  6
  3  3  6  9  9  6
- 2  2  5  8  8  5
source
NNlib.pad_zerosFunction
pad_zeros(x, pad::Tuple; [dims])
-pad_zeros(x, pad::Int; [dims])

Pad the array x with zeros. Equivalent to pad_constant with the constant equal to 0.

source

Convolution

Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally.

NNlib.convFunction
conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)

Apply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.

source
NNlib.ConvDimsType
ConvDims

Type system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.

source
NNlib.depthwiseconvFunction
depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)

Depthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.

source
NNlib.DepthwiseConvDimsType
DepthwiseConvDims

Concrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.

source

Dropout

NNlib.dropoutFunction
dropout([rng], A, p; [dims])

Returns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).

By default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.

Optional first argument is the random number generator used.

Examples

julia> dropout(ones(2, 10), 0.2)
+ 2  2  5  8  8  5
source
NNlib.pad_zerosFunction
pad_zeros(x, pad::Tuple; [dims])
+pad_zeros(x, pad::Int; [dims])

Pad the array x with zeros. Equivalent to pad_constant with the constant equal to 0.

source

Convolution

Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally.

NNlib.convFunction
conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)

Apply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.

source
NNlib.ConvDimsType
ConvDims

Type system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.

source
NNlib.depthwiseconvFunction
depthwiseconv(x, w; stride=1, pad=0, dilation=1, flipped=false)

Depthwise convolution operation with filter w on input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively.

source
NNlib.DepthwiseConvDimsType
DepthwiseConvDims

Concrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.

source

Dropout

NNlib.dropoutFunction
dropout([rng], A, p; [dims])

Returns an array in which each element of A is either replaced with zero, with probability p, or else multiplied by 1/(1-p).

By default every element is treated independently. With keyword dims=1, a choice is made for every value of the 1st index i.e. each row of a matrix is either zero or not.

Optional first argument is the random number generator used.

Examples

julia> dropout(ones(2, 10), 0.2)
 2×10 Matrix{Float64}:
  1.25  1.25  0.0   1.25  1.25  1.25  1.25  1.25  1.25  1.25
  1.25  1.25  1.25  0.0   1.25  1.25  0.0   1.25  1.25  1.25
@@ -172,7 +172,7 @@
 
 julia> mean(dropout(ones(10^4, 5), 0.3, dims=1), dims=1)
 1×5 Matrix{Float64}:
- 1.00571  1.00571  1.00571  1.00571  1.00571
source
NNlib.dropout!Function
dropout!(B, A, p; [dims])

This does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.

source

Upsampling

Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.

NNlib.dropout!Function
dropout!(B, A, p; [dims])

This does exactly B .= dropout(A, p; dims), or rather, it's the implementation of out-of-place dropout.

source

Upsampling

Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.

NNlib.upsample_nearestFunction
upsample_nearest(x, scale::NTuple{S,Int})
 upsample_nearest(x; size::NTuple{S,Int})

Upsamples the array x by integer multiples along the first S dimensions. Subsequent dimensions of x are not altered.

Either the scale factors or the final output size can be specified.

See also upsample_bilinear, for two dimensions of an N=4 array.

Example

julia> upsample_nearest([1 2 3; 4 5 6], (2, 3))
 4×9 Matrix{Int64}:
  1  1  1  2  2  2  3  3  3
@@ -191,8 +191,8 @@
  4  5  6
 
 julia> ans == upsample_nearest([1 2 3; 4 5 6], size=(4,))
-true
source
NNlib.upsample_linearFunction
upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)
-upsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)

Upsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.

The size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).

source
NNlib.∇upsample_linearFunction
∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_linearFunction
upsample_linear(x::AbstractArray{T,3}, scale::Real; align_corners::Bool = true)
+upsample_linear(x::AbstractArray{T,3}; size::Integer, align_corners::Bool = true)

Upsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.

The size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).

source
NNlib.∇upsample_linearFunction
∇upsample_linear(Δ::AbstractArray{T,3}; size::Integer, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_bilinearFunction
upsample_bilinear(x::AbstractArray{T,4}, scale::NTuple{2,Real}; align_corners::Bool = true)
 upsample_bilinear(x::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true)

Upsamples the first 2 dimensions of the array x by the upsample factors stored in scale, using bilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.

The size of the output is equal to (scale[1]*S1, scale[2]*S2, S3, S4), where S1, S2, S3, S4 = size(x).

Examples

julia> x = reshape(Float32[1 2 3; 4 5 6], (2,3,1,1))
 2×3×1×1 Array{Float32, 4}:
 [:, :, 1, 1] =
@@ -217,10 +217,10 @@
  1.75  1.97222  2.19444  2.41667  2.63889     3.08333  3.30556  3.52778  3.75
  2.5   2.72222  2.94444  3.16667  3.38889     3.83333  4.05556  4.27778  4.5
  3.25  3.47222  3.69444  3.91667  4.13889     4.58333  4.80556  5.02778  5.25
- 4.0   4.22222  4.44444  4.66667  4.88889     5.33333  5.55556  5.77778  6.0
source
NNlib.∇upsample_bilinearFunction
∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral (W,H) size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_trilinearFunction
upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)
+ 4.0   4.22222  4.44444  4.66667  4.88889     5.33333  5.55556  5.77778  6.0
source
NNlib.∇upsample_bilinearFunction
∇upsample_bilinear(Δ::AbstractArray{T,4}; size::NTuple{2,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral (W,H) size of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.upsample_trilinearFunction
upsample_trilinear(x::AbstractArray{T,5}, scale::NTuple{3,Real}; align_corners::Bool = true)
 upsample_trilinear(x::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true)

Upsamples the first 3 dimensions of the array x by the upsample factors stored in scale, using trilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.

The size of the output is equal to (scale[1]*S1, scale[2]*S2, scale[3]*S3, S4, S5), where S1, S2, S3, S4, S5 = size(x).

Examples

upsample_trilinear(x, (2, 3, 4))
 upsample_trilinear(x; size=(4, 9, 11))  # specify ouput size instead
-upsample_trilinear(x, (2.5, 3.5, pi))  # non-integer scaling factors are allowed
source
NNlib.∇upsample_trilinearFunction
∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral size & depth (W,H,D) of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.pixel_shuffleFunction
pixel_shuffle(x, r::Integer)

Pixel shuffling operation, upscaling by a factor r.

For 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.

Used in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., "Real-Time Single Image and Video Super-Resolution ...", CVPR 2016, https://arxiv.org/abs/1609.05158

Examples

julia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]
+upsample_trilinear(x, (2.5, 3.5, pi))  # non-integer scaling factors are allowed
source
NNlib.∇upsample_trilinearFunction
∇upsample_trilinear(Δ::AbstractArray{T,5}; size::NTuple{3,Integer}, align_corners::Bool = true) where T

Arguments

  • Δ: Incoming gradient array, backpropagated from downstream layers
  • size: Lateral size & depth (W,H,D) of the image upsampled in the first place

Outputs

  • dx: Downsampled version of Δ
source
NNlib.pixel_shuffleFunction
pixel_shuffle(x, r::Integer)

Pixel shuffling operation, upscaling by a factor r.

For 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.

Used in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., "Real-Time Single Image and Video Super-Resolution ...", CVPR 2016, https://arxiv.org/abs/1609.05158

Examples

julia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]
 2×3×4×1 Array{Float64, 4}:
 [:, :, 1, 1] =
  11.1  12.1  13.1
@@ -261,7 +261,7 @@
  2.1  2.3  2.5
  2.2  2.4  2.6
  3.1  3.3  3.5
- 3.2  3.4  3.6
source

Batched Operations

Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.

Batched Operations

Flux's Flux.Bilinear layer uses NNlib.batched_mul internally.

NNlib.batched_mulFunction
batched_mul(A, B) -> C
 A ⊠ B  # \boxtimes

Batched matrix multiplication. Result has C[:,:,k...] == A[:,:,k...] * B[:,:,k...] where k... represent any indices in the last dimensions.

If ndims(A) == ndims(B) == 3 and size(B,3) == 1 then instead C[:,:,k] == A[:,:,k] * B[:,:,1], and similarly for A.

To transpose each matrix, apply batched_transpose to the array, or batched_adjoint for conjugate-transpose:

julia> A, B = randn(2,5,17), randn(5,9,17);
 
 julia> A ⊠ B |> size
@@ -277,7 +277,7 @@
 (2, 9, 17)
 
 julia> batched_transpose(A) == PermutedDimsArray(A, (2,1,3))
-true

The equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.

However, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.

Both this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV["JULIA_DEBUG"] = NNlib will display them.

source
batched_mul(A::Array{T,3}, B::Matrix)
+true

The equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.

However, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.

Both this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV["JULIA_DEBUG"] = NNlib will display them.

source
batched_mul(A::Array{T,3}, B::Matrix)
 batched_mul(A::Matrix, B::Array{T,3})
 A ⊠ B

This is always matrix-matrix multiplication, but either A or B may lack a batch index.

  • When B is a matrix, result has C[:,:,k] == A[:,:,k] * B[:,:] for all k.

  • When A is a matrix, then C[:,:,k] == A[:,:] * B[:,:,k]. This can also be done by reshaping and calling *, for instance A ⊡ B using TensorCore.jl, but is implemented here using batched_gemm instead of gemm.

julia> randn(16,8,32) ⊠ randn(8,4) |> size
 (16, 4, 32)
@@ -286,19 +286,19 @@
 (16, 4, 32)
 
 julia> randn(16,8) ⊠ randn(8,4,32) |> size
-(16, 4, 32)

See also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].

source
NNlib.batched_mul!Function
batched_mul!(C, A, B) -> C
-batched_mul!(C, A, B, α=1, β=0)

In-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.

This will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.

For complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.

source
NNlib.batched_adjointFunction
batched_transpose(A::AbstractArray{T,3})
+(16, 4, 32)

See also batched_vec to regard B as a batch of vectors, A[:,:,k] * B[:,k].

source
NNlib.batched_mul!Function
batched_mul!(C, A, B) -> C
+batched_mul!(C, A, B, α=1, β=0)

In-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.

This will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either stride(X,1)==1 or stride(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.

For complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.

source
NNlib.batched_adjointFunction
batched_transpose(A::AbstractArray{T,3})
 batched_adjoint(A)

Equivalent to applying transpose or adjoint to each matrix A[:,:,k].

These exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.

PermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).

BatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}
-BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_transposeFunction
batched_transpose(A::AbstractArray{T,3})
+BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_transposeFunction
batched_transpose(A::AbstractArray{T,3})
 batched_adjoint(A)

Equivalent to applying transpose or adjoint to each matrix A[:,:,k].

These exist to control how batched_mul behaves, as it operates on such matrix slices of an array with ndims(A)==3.

PermutedDimsArray(A, (2,1,3)) is equivalent to batched_transpose(A), and is also understood by batched_mul (and more widely supported elsewhere).

BatchedTranspose{T, S} <: AbstractBatchedMatrix{T, 3}
-BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_vecFunction
batched_vec(A::Array{T,3}, B::Matrix)
+BatchedAdjoint{T, S}

Lazy wrappers analogous to Transpose and Adjoint, returned by batched_transpose etc.

source
NNlib.batched_vecFunction
batched_vec(A::Array{T,3}, B::Matrix)
 batched_vec(A::Array{T,3}, b::Vector)

Batched matrix-vector multiplication: the result has C[:,:,k] == A[:,:,k] * B[:,k] for all k, or else C[:,:,k] == A[:,:,k] * b for b::Vector.

With the same argument types, batched_mul(A, B) would regard B as a fixed matrix, not a batch of vectors. Both reshape and then call batched_mul(::Array{T,3}, ::Array{T,3}).

julia> A, B, b = randn(16,8,32), randn(8,32), randn(8);
 
 julia> batched_vec(A,B) |> size
 (16, 32)
 
 julia> batched_vec(A,b) |> size
-(16, 32)
source

Gather and Scatter

Flux's Embedding layer uses NNlib.gather as its backend.

NNlib.gatherFunction
NNlib.gather(src, idx) -> dst

Reverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather! for an in-place version.

Examples

julia> NNlib.gather([1,20,300,4000], [2,4,2])
+(16, 32)
source

Gather and Scatter

Flux's Embedding layer uses NNlib.gather as its backend.

NNlib.gatherFunction
NNlib.gather(src, idx) -> dst

Reverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather! for an in-place version.

Examples

julia> NNlib.gather([1,20,300,4000], [2,4,2])
 3-element Vector{Int64}:
    20
  4000
@@ -307,7 +307,7 @@
 julia> NNlib.gather([1 2 3; 4 5 6], [1,3,1,3,1])
 2×5 Matrix{Int64}:
  1  3  1  3  1
- 4  6  4  6  4
source
gather(src, IJK...)

Convert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).

Examples

julia> src = reshape([1:15;], 3, 5)
+ 4  6  4  6  4
source
gather(src, IJK...)

Convert the tuple of integer vectors IJK to a tuple of CartesianIndex and call gather on it: gather(src, CartesianIndex.(IJK...)).

Examples

julia> src = reshape([1:15;], 3, 5)
 3×5 Matrix{Int64}:
  1  4  7  10  13
  2  5  8  11  14
@@ -316,7 +316,7 @@
 julia> NNlib.gather(src, [1, 2], [2, 4])
 2-element Vector{Int64}:
   4
- 11
source
NNlib.gather!Function
NNlib.gather!(dst, src, idx)

Reverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather for an allocating version.

source
NNlib.scatterFunction
NNlib.scatter(op, src, idx; [init, dstsize])

Scatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.

  • If keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).

  • If dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.

See scatter! for full details on how idx works.

Examples

julia> NNlib.scatter(+, [10,100,1000], [3,1,2])
+ 11
source
NNlib.gather!Function
NNlib.gather!(dst, src, idx)

Reverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to

dst[:, ... , k] .= src[:, ... , idx[k]...]

Notice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to

dst[:, k] .= src[:, idx[k]]

and k will run over 1:length(idx).

The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.

See gather for an allocating version.

source
NNlib.scatterFunction
NNlib.scatter(op, src, idx; [init, dstsize])

Scatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.

  • If keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).

  • If dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.

See scatter! for full details on how idx works.

Examples

julia> NNlib.scatter(+, [10,100,1000], [3,1,2])
 3-element Vector{Int64}:
   100
  1000
@@ -334,7 +334,7 @@
     10
   2000
     10
-    10
source
NNlib.scatter!Function
NNlib.scatter!(op, dst, src, idx)

Scatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to

dst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])

See also scatter, gather.

Arguments

  • op: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.
  • dst: The destination for src to aggregate to. This argument will be mutated.
  • src: The source data for aggregating.
  • idx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.

Examples

julia> NNlib.scatter!(+, ones(3), [10,100], [1,3])
+    10
source
NNlib.scatter!Function
NNlib.scatter!(op, dst, src, idx)

Scatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to

dst[:, ..., idx[k]...] = (op).(dst[:, ..., idx[k]...], src[:, ..., k...])

See also scatter, gather.

Arguments

  • op: Operations to be applied on dst and src, e.g. +, -, *, /, max, min and mean.
  • dst: The destination for src to aggregate to. This argument will be mutated.
  • src: The source data for aggregating.
  • idx: The mapping for aggregation from source (index) to destination (value). The idx array can contain either integers or tuples.

Examples

julia> NNlib.scatter!(+, ones(3), [10,100], [1,3])
 3-element Vector{Float64}:
   11.0
    1.0
@@ -343,7 +343,7 @@
 julia> NNlib.scatter!(*, fill(0.5, 2, 4), [1 10; 100 1000], [3,2])
 2×4 Matrix{Float64}:
  0.5    5.0   0.5  0.5
- 0.5  500.0  50.0  0.5
source

Sampling

NNlib.grid_sampleFunction
grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)

Given input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.

This implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).

Arguments

  • input: Input array in (W_in, H_in, C, N) shape.

  • grid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.

    Therefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.

    Out-of-bound values are handled according to the padding_mode.

  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.

Returns

(W_out, H_out, C, N) sampled grid from input.

Examples

In the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.

julia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))
+ 0.5  500.0  50.0  0.5
source

Sampling

NNlib.grid_sampleFunction
grid_sample(input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros)

Given input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.

This implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).

Arguments

  • input: Input array in (W_in, H_in, C, N) shape.

  • grid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.

    Therefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.

    Out-of-bound values are handled according to the padding_mode.

  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.

Returns

(W_out, H_out, C, N) sampled grid from input.

Examples

In the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.

julia> x = reshape(collect(1.0:4.0), (2, 2, 1, 1))
 2×2×1×1 Array{Float64, 4}:
 [:, :, 1, 1] =
  1.0  3.0
@@ -375,4 +375,4 @@
 [:, :, 1, 1] =
  1.0  3.0
  1.5  3.5
- 2.0  4.0
source
NNlib.∇grid_sampleFunction
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T

Arguments

  • Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
  • input: Input from primal computation in (W_in, H_in, C, N) shape.
  • grid: Grid from primal computation in (2, W_out, H_out, N) shape.
  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.

Returns

dinput (same shape as input) and dgrid (same shape as grid) gradients.

source

Losses

NNlib.ctc_lossFunction
ctc_loss(ŷ, y)

Computes the connectionist temporal classification loss between and y. must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to , so must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with . The blank label is assumed to be the last label category in , so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.

source

Miscellaneous

NNlib.logsumexpFunction
logsumexp(x; dims = :)

Computes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.

See also logsoftmax.

source
+ 2.0 4.0source
NNlib.∇grid_sampleFunction
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T

Arguments

  • Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
  • input: Input from primal computation in (W_in, H_in, C, N) shape.
  • grid: Grid from primal computation in (2, W_out, H_out, N) shape.
  • padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.

Returns

dinput (same shape as input) and dgrid (same shape as grid) gradients.

source

Losses

NNlib.ctc_lossFunction
ctc_loss(ŷ, y)

Computes the connectionist temporal classification loss between and y. must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to , so must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with . The blank label is assumed to be the last label category in , so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.

source

Miscellaneous

NNlib.logsumexpFunction
logsumexp(x; dims = :)

Computes log.(sum(exp.(x); dims)) in a numerically stable way. Without dims keyword this returns a scalar.

See also logsoftmax.

source
NNlib.gluFunction
glu(x, dim = 1)

The gated linear unit from the "Language Modeling with Gated Convolutional Networks" paper.

Calculates a .* sigmoid(b), where x is split in half along given dimension dim to form a and b.

source
diff --git a/dev/reference/outputsize/index.html b/dev/reference/outputsize/index.html index 7c59b18372..a7479f5bb0 100644 --- a/dev/reference/outputsize/index.html +++ b/dev/reference/outputsize/index.html @@ -68,7 +68,7 @@ # plus 2 non-trainable, 10 parameters, summarysize 10.469 KiB. julia> outputsize(ans, (28, 28, 1, 32)) -(10, 32)

Limitations:

source
Flux.outputsizeFunction
outputsize(m, x_size, y_size, ...; padbatch=false)

For model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.

Examples

julia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);
+(10, 32)

Limitations:

  • While @autosize (5, 32) Flux.Bilinear(_ => 7) is OK, something like Bilinear((_, _) => 7) will fail.
  • While Scale(_) and LayerNorm(_) are fine (and use the first dimension), Scale(_,_) and LayerNorm(_,_) will fail if size(x,1) != size(x,2).
source
Flux.outputsizeFunction
outputsize(m, x_size, y_size, ...; padbatch=false)

For model or layer m accepting multiple arrays as input, this returns size(m((x, y, ...))) given size_x = size(x), etc.

Examples

julia> x, y = rand(Float32, 5, 64), rand(Float32, 7, 64);
 
 julia> par = Parallel(vcat, Dense(5 => 9), Dense(7 => 11));
 
@@ -81,4 +81,4 @@
 (13, 1)
 
 julia> par(x, y) == par((x, y)) == Chain(par, identity)((x, y))
-true

Notice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.

source
+true

Notice that Chain only accepts multiple arrays as a tuple, while Parallel also accepts them as multiple arguments; outputsize always supplies the tuple.

source diff --git a/dev/reference/training/callbacks/index.html b/dev/reference/training/callbacks/index.html index 0513a6e2bb..a4c1609869 100644 --- a/dev/reference/training/callbacks/index.html +++ b/dev/reference/training/callbacks/index.html @@ -10,7 +10,7 @@ sleep(1) end Flux -Fluxsource

Patience Helpers

Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.

For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.

# create a pseudo-loss that decreases for 4 calls, then starts increasing
+Flux
source

Patience Helpers

Flux provides utilities for controlling your training procedure according to some monitored condition and a maximum patience. For example, you can use early_stopping to stop training when the model is converging or deteriorating, or you can use plateau to check if the model is stagnating.

For example, below we create a pseudo-loss function that decreases, bottoms out, and then increases. The early stopping trigger will break the loop before the loss increases too much.

# create a pseudo-loss that decreases for 4 calls, then starts increasing
 # we call this like loss()
 loss = let t = 0
   () -> begin
@@ -61,7 +61,7 @@
        end
 [ Info: Epoch 1
 [ Info: Epoch 2
-[ Info: Epoch 3
source
Flux.early_stoppingFunction
early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)

Return a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.

Examples

julia> loss = let l = 0
+[ Info: Epoch 3
source
Flux.early_stoppingFunction
early_stopping(f, delay; distance = -, init_score = 0, min_dist = 0)

Return a function that internally counts by one when distance(best_score, f(...)) <= min_dist, where best_score is the last seen best value of f(...). If the count is greater than or equal to delay, the function returns true, otherwise it returns false. The count is reset when distance(best_score, f(...)) > min_dist.

Examples

julia> loss = let l = 0
          () -> l += 1
        end; # pseudo loss function that returns increasing values
 
@@ -74,7 +74,7 @@
        end
 [ Info: Epoch 1
 [ Info: Epoch 2
-[ Info: Epoch 3
source
Flux.plateauFunction
plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)

Return a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.

Examples

julia> f = let v = 10
+[ Info: Epoch 3
source
Flux.plateauFunction
plateau(f, width; distance = -, init_score = 0, min_dist = 1f-6)

Return a function that internally counts by one when abs(distance(last_score, f(...))) <= min_dist, where last_score holds the last value of f(...). If the count is greater than or equal to width, the function returns true, otherwise it returns false. The count is reset when abs(distance(last_score, f(...))) > min_dist.

Examples

julia> f = let v = 10
          () -> v = v / abs(v) - v
        end; # -9, 8, -7, 6, ...
 
@@ -88,4 +88,4 @@
 [ Info: Epoch 1
 [ Info: Epoch 2
 [ Info: Epoch 3
-[ Info: Epoch 4
source
+[ Info: Epoch 4source diff --git a/dev/reference/training/optimisers/index.html b/dev/reference/training/optimisers/index.html index 1e137510e3..0d08f99e80 100644 --- a/dev/reference/training/optimisers/index.html +++ b/dev/reference/training/optimisers/index.html @@ -40,4 +40,4 @@ for epoch in 1:100 opt.eta = next!(schedule) # your training code here -end

ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.

Decays

Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.

Optimisers.SignDecayType
SignDecay(λ = 1e-3)

Implements $L_1$ regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.

See also [WeightDecay] for $L_2$ normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source
Optimisers.WeightDecayType
WeightDecay(λ = 5e-4)

Implements $L_2$ regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.

See also [SignDecay] for $L_1$ normalisation.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source

Gradient Clipping

Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is

opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))
Optimisers.ClipGradType
ClipGrad(δ = 10)

Restricts every gradient component to obey -δ ≤ dx[i] ≤ δ.

Typically composed with other rules using OptimiserChain.

See also ClipNorm.

source
Optimisers.ClipNormType
ClipNorm(ω = 10, p = 2; throw = true)

Scales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).

Throws an error if the norm is infinite or NaN, which you can turn off with throw = false.

Typically composed with other rules using OptimiserChain.

See also ClipGrad.

source
+end

ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info.

Decays

Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.

Optimisers.SignDecayType
SignDecay(λ = 1e-3)

Implements $L_1$ regularisation, also known as LASSO regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* sign(x) to the gradient. This is equivalent to adding λ * sum(abs, x) == λ * norm(x, 1) to the loss.

See also [WeightDecay] for $L_2$ normalisation. They can be used together: OptimiserChain(SignDecay(0.012), WeightDecay(0.034), Adam()) is equivalent to adding 0.012 * norm(x, 1) + 0.017 * norm(x, 2)^2 to the loss function.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source
Optimisers.WeightDecayType
WeightDecay(λ = 5e-4)

Implements $L_2$ regularisation, also known as ridge regression, when composed with other rules as the first transformation in an OptimiserChain.

It does this by adding λ .* x to the gradient. This is equivalent to adding λ/2 * sum(abs2, x) == λ/2 * norm(x)^2 to the loss.

See also [SignDecay] for $L_1$ normalisation.

Parameters

  • Penalty (λ ≥ 0): Controls the strength of the regularisation.
source

Gradient Clipping

Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is

opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3))
Optimisers.ClipGradType
ClipGrad(δ = 10)

Restricts every gradient component to obey -δ ≤ dx[i] ≤ δ.

Typically composed with other rules using OptimiserChain.

See also ClipNorm.

source
Optimisers.ClipNormType
ClipNorm(ω = 10, p = 2; throw = true)

Scales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).

Throws an error if the norm is infinite or NaN, which you can turn off with throw = false.

Typically composed with other rules using OptimiserChain.

See also ClipGrad.

source
diff --git a/dev/reference/training/reference/index.html b/dev/reference/training/reference/index.html index addb3b3e11..a2e141291d 100644 --- a/dev/reference/training/reference/index.html +++ b/dev/reference/training/reference/index.html @@ -19,14 +19,14 @@ 10.19 julia> opt_state # mutated by Flux.train! -(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())source
Flux.Optimise.train!Method
train!(loss, model, data, opt_state)

Uses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.

If model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.

For example, with these definitions...

data = [(x1, y1), (x2, y2), (x3, y3)]
+(weight = Leaf(Momentum(0.1, 0.9), [-2.018 3.027]), bias = Leaf(Momentum(0.1, 0.9), [-10.09]), σ = ())
source
Flux.Optimise.train!Method
train!(loss, model, data, opt_state)

Uses a loss function and training data to improve the model's parameters according to a particular optimisation rule encoded in opt_state. Iterates through data once, evaluating for each d in data either loss(model, d...) if d isa Tuple, or else loss(model, d) for other d.

If model is an Enzyme.Duplicated and Enzyme.jl is loaded, gradients will be computed with Enzyme, otherwise they will be computed with Zygote.

For example, with these definitions...

data = [(x1, y1), (x2, y2), (x3, y3)]
 
 loss3(m, x, y) = norm(m(x) .- y)        # the model is the first argument
 
 opt_state = Flux.setup(Adam(), model)   # explicit setup of optimiser momenta

...calling Flux.train!(loss3, model, data, opt_state) runs a loop much like this:

for d in data
     ∂L∂m = gradient(loss3, model, d...)[1]
     update!(opt_state, model, ∂L∂m)
-end

You can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:

  • Stop with a DomainError if the loss is infinite or NaN at any point.

  • Show a progress bar using @withprogress.

New

This method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:

  • It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's "implicit" parameter handling, with Grads.)
  • Instead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.
  • opt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.
  • Callback functions are not supported. (But any code can be included in the above for loop.)
source
Optimisers.updateFunction
Optimisers.update(tree, model, gradient) -> (tree, model)

Uses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.

See also update!, which will be faster for models of ordinary Arrays or CuArrays.

Example

julia> m = (x = Float32[1,2,3], y = tanh);
+end

You can also write this loop yourself, if you need more flexibility. For this reason train! is not highly extensible. It adds only a few features to the loop above:

  • Stop with a DomainError if the loss is infinite or NaN at any point.

  • Show a progress bar using @withprogress.

New

This method was added in Flux 0.13.9. It has significant changes from the one used by Flux ≤ 0.13:

  • It now takes the model itself, not the result of Flux.params. (This is to move away from Zygote's "implicit" parameter handling, with Grads.)
  • Instead of loss being a function which accepts only the data, now it must also accept the model itself, as the first argument.
  • opt_state should be the result of Flux.setup. Using an optimiser such as Adam() without this step should give you a warning.
  • Callback functions are not supported. (But any code can be included in the above for loop.)
source
Optimisers.updateFunction
Optimisers.update(tree, model, gradient) -> (tree, model)

Uses the optimiser and the gradient to change the trainable parameters in the model. Returns the improved model, and the optimiser states needed for the next update. The initial tree of states comes from setup.

See also update!, which will be faster for models of ordinary Arrays or CuArrays.

Example

julia> m = (x = Float32[1,2,3], y = tanh);
 
 julia> t = Optimisers.setup(Descent(0.1), m)
 (x = Leaf(Descent(0.1), nothing), y = ())
@@ -115,4 +115,4 @@
 julia> Optimisers.thaw!(s)
 
 julia> s.x
-(Leaf(Momentum(0.01, 0.9), [0.0]), ())
source
Optimisers.thaw!Function
Optimisers.thaw!(tree)

The reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).

source
+(Leaf(Momentum(0.01, 0.9), [0.0]), ())source
Optimisers.thaw!Function
Optimisers.thaw!(tree)

The reverse of freeze!. Applies to all parameters, mutating every Leaf(rule, state, frozen = true) to Leaf(rule, state, frozen = false).

source
diff --git a/dev/reference/training/zygote/index.html b/dev/reference/training/zygote/index.html index 55f2a9068d..2fded768f4 100644 --- a/dev/reference/training/zygote/index.html +++ b/dev/reference/training/zygote/index.html @@ -203,4 +203,4 @@ # this definition of map is for any AD that only defines a reverse mode. # It is not as good as the rrule that can be used if the AD defines a forward-mode as well. -rrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...

For more details see rule configurations and calling back into AD.

source
ChainRulesCore.TangentType
Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent

This type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.

Tangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.

T is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.

For Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.

source
ChainRulesCore.canonicalizeFunction
canonicalize(tangent::Tangent{P}) -> Tangent{P}

Return the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().

source
+rrule(conf::RuleConfig{>:Union{NoForwardsMode, HasReverseMode}}, typeof(map), ::Vector) = ...

For more details see rule configurations and calling back into AD.

source
ChainRulesCore.TangentType
Tangent{P, T} <: StructuralTangent{P} <: AbstractTangent

This type represents the tangent for a struct/NamedTuple, or Tuple. P is the the corresponding primal type that this is a tangent for.

Tangent{P} should have fields (technically properties), that match to a subset of the fields of the primal type; and each should be a tangent type matching to the primal type of that field. Fields of the P that are not present in the Tangent are treated as Zero.

T is an implementation detail representing the backing data structure. For Tuple it will be a Tuple, and for everything else it will be a NamedTuple. It should not be passed in by user.

For Tangents of Tuples, iterate and getindex are overloaded to behave similarly to for a tuple. For Tangents of structs, getproperty is overloaded to allow for accessing values via tangent.fieldname. Any fields not explictly present in the Tangent are treated as being set to ZeroTangent(). To make a Tangent have all the fields of the primal the canonicalize function is provided.

source
ChainRulesCore.canonicalizeFunction
canonicalize(tangent::Tangent{P}) -> Tangent{P}

Return the canonical Tangent for the primal type P. The property names of the returned Tangent match the field names of the primal, and all fields of P not present in the input tangent are explictly set to ZeroTangent().

source
diff --git a/dev/reference/utilities/index.html b/dev/reference/utilities/index.html index 3cbca8caef..4163da594b 100644 --- a/dev/reference/utilities/index.html +++ b/dev/reference/utilities/index.html @@ -32,7 +32,7 @@ julia> ans.bias 2-element Vector{Float32}: 0.0 - 0.0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.glorot_normalFunction
glorot_normal([rng], size...; gain = 1) -> Array
+ 0.0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.glorot_normalFunction
glorot_normal([rng], size...; gain = 1) -> Array
 glorot_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers drawn from a normal distribution with standard deviation gain * sqrt(2 / (fan_in + fan_out)), using nfan.

This method is described in [1] and also known as Xavier initialization.

Examples

julia> using Statistics
 
 julia> round(std(Flux.glorot_normal(10, 1000)), digits=3)
@@ -48,7 +48,7 @@
 Dense(10 => 1000, tanh)  # 11_000 parameters
 
 julia> round(std(ans.weight), sigdigits=3)
-4.45f0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.kaiming_uniformFunction
kaiming_uniform([rng], size...; gain = √2) -> Array
+4.45f0

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

source
Flux.kaiming_uniformFunction
kaiming_uniform([rng], size...; gain = √2) -> Array
 kaiming_uniform([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers drawn from a uniform distribution on the interval [-x, x], where x = gain * sqrt(3/fan_in) using nfan.

This method is described in [1] and also known as He initialization.

Examples

julia> round.(extrema(Flux.kaiming_uniform(100, 10)), digits=3)
 (-0.774f0, 0.773f0)
 
@@ -56,7 +56,7 @@
 (-0.243f0, 0.245f0)
 
 julia> round.(extrema(Flux.kaiming_uniform(100, 100)), digits=3)
-(-0.245f0, 0.245f0)

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.kaiming_normalFunction
kaiming_normal([rng], size...; gain = √2) -> Array
+(-0.245f0, 0.245f0)

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.kaiming_normalFunction
kaiming_normal([rng], size...; gain = √2) -> Array
 kaiming_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size containing random numbers taken from a normal distribution standard deviation gain / sqrt(fan_in), using nfan.

This method is described in [1] and also known as He initialization.

Examples

julia> using Statistics
 
 julia> round(std(Flux.kaiming_normal(10, 1000)), digits=3)
@@ -66,7 +66,7 @@
 0.449f0
 
 julia> round(std(Flux.kaiming_normal(1000, 1000)), digits=3)
-0.045f0

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.truncated_normalFunction
truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array
+0.045f0

References

[1] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

source
Flux.truncated_normalFunction
truncated_normal([rng], size...; mean = 0, std = 1, lo = -2, hi = 2) -> Array
 truncated_normal([rng]; kw...) -> Function

Return an Array{Float32} of the given size where each element is drawn from a truncated normal distribution. The numbers are distributed like filter(x -> lo<=x<=hi, mean .+ std .* randn(100)).

The values are generated by sampling a Uniform(0, 1) (rand()) and then applying the inverse CDF of the truncated normal distribution. This method works best when lo ≤ mean ≤ hi.

Examples

julia> using Statistics
 
 julia> Flux.truncated_normal(3, 4) |> summary
@@ -76,7 +76,7 @@
 (-2.0f0, 2.0f0)
 
 julia> round(std(Flux.truncated_normal(10^6; lo = -100, hi = 100)))
-1.0f0
source
Flux.orthogonalFunction
orthogonal([rng], size...; gain = 1) -> Array
+1.0f0
source
Flux.orthogonalFunction
orthogonal([rng], size...; gain = 1) -> Array
 orthogonal([rng]; kw...) -> Function

Return an Array{Float32} of the given size which is a (semi) orthogonal matrix, as described in [1].

Cannot construct a vector, i.e. length(size) == 1 is forbidden. For length(size) > 2, a prod(size[1:(end - 1)]) by size[end] orthogonal matrix is computed before reshaping it to the original dimensions.

Examples

julia> W = Flux.orthogonal(5, 7);
 
 julia> summary(W)
@@ -96,7 +96,7 @@
 julia> W3 = Flux.orthogonal(3, 3, 2, 4);
 
 julia> transpose(reshape(W3, :, 4)) * reshape(W3, :, 4) ≈ I(4)
-true

References

[1] Saxe, McClelland, Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks", ICLR 2014, https://arxiv.org/abs/1312.6120

source
Flux.sparse_initFunction
sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array
+true

References

[1] Saxe, McClelland, Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks", ICLR 2014, https://arxiv.org/abs/1312.6120

source
Flux.sparse_initFunction
sparse_init([rng], rows, cols; sparsity, std = 0.01) -> Array
 sparse_init([rng]; kw...) -> Function

Return a Matrix{Float32} of size rows, cols where each column contains a fixed fraction of zero elements given by sparsity. Non-zero elements are normally distributed with a mean of zero and standard deviation std.

This method is described in [1].

Examples

julia> count(iszero, Flux.sparse_init(10, 10, sparsity=1/5))
 20
 
@@ -109,7 +109,7 @@
 
 julia> count(iszero, ans.weight, dims=1)
 1×3 Matrix{Int64}:
- 5  5  5

References

[1] Martens, J, "Deep learning via Hessian-free optimization" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.

source
Flux.identity_initFunction
identity_init(size...; gain=1, shift=0) -> Array
+ 5  5  5

References

[1] Martens, J, "Deep learning via Hessian-free optimization" Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010.

source
Flux.identity_initFunction
identity_init(size...; gain=1, shift=0) -> Array
 identity_init(; kw...) -> Function

Return an Array{Float32} of the given size which yields an identity mapping when used as parameters in most Flux layers. Use gain to scale the identity by a constant.

Often useful in the context of transfer learning, i.e when one wants to add more capacity to a model but start from the same mapping.

Has the following behaviour

  • 1D: A Vector of zeros (useful for an identity bias)
  • 2D: An identity matrix (useful for an identity matrix multiplication)
  • More than 2D: A dense block array of center tap spatial filters (useful for an identity convolution)

Some caveats:

  • Not all layers will be identity mapping when used with this init. Exceptions include recurrent layers and normalization layers.

  • Layers must have input_size == output_size for identity mapping to be possible. When this is not the case, extra dimensions of the array are padded with zeros.

  • For convolutional layers, in addition to the above, the kernel sizes must also be odd and padding must be applied so that output feature maps have the same size as input feature maps, e.g by using SamePad.

Use keyword shift (integer or tuple) to apply circular shift to the output, equivalent to Base.circshift(identity_init(size...), shift).

For consistency with other initialisers, it accepts rng::AbstractRNG as an optional first argument. But this is ignored, since the result is not random.

Examples

julia> Flux.identity_init(3,5)
 3×5 Matrix{Float32}:
  1.0  0.0  0.0  0.0  0.0
@@ -141,7 +141,7 @@
 [:, :, 1, 1] =
  10.0  20.0  30.0
  40.0  50.0  60.0
- 70.0  80.0  90.0
source
Flux.ones32Function
ones32(size...) = ones(Float32, size...)

Return an Array{Float32} of the given size filled with 1s.

source
Flux.zeros32Function
zeros32(size...) = zeros(Float32, size...)

Return an Array{Float32} of the given size filled with 0s.

source
Flux.rand32Function
rand32([rng], size...)

Return an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.

source
Flux.randn32Function
randn32([rng], size...)

Return an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.

source
Flux.create_biasFunction
create_bias(weights, bias, size...)

Return a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.

  • bias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.
  • bias == false returns false, which is understood by AD to be non-differentiable.
  • bias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.
source

These functions call:

Flux.rng_from_arrayFunction
rng_from_array(x)

Create an instance of the RNG most appropriate for x. The current defaults are:

  • x isa CuArray: CUDA.default_rng()
  • x isa AbstractArray: `Random.default_rng()
source
Flux.nfanFunction
nfan(n_out, n_in=1) -> Tuple
+ 70.0  80.0  90.0
source
Flux.ones32Function
ones32(size...) = ones(Float32, size...)

Return an Array{Float32} of the given size filled with 1s.

source
Flux.zeros32Function
zeros32(size...) = zeros(Float32, size...)

Return an Array{Float32} of the given size filled with 0s.

source
Flux.rand32Function
rand32([rng], size...)

Return an Array{Float32} of the given size, filled like rand. When the size is not provided, rand32(rng::AbstractRNG) returns a function.

source
Flux.randn32Function
randn32([rng], size...)

Return an Array{Float32} of the given size, filled like randn. When the size is not provided, randn32(rng::AbstractRNG) returns a function.

source
Flux.create_biasFunction
create_bias(weights, bias, size...)

Return a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.

  • bias == true creates a trainable array of the given size, of the same type as weights, initialised to zero.
  • bias == false returns false, which is understood by AD to be non-differentiable.
  • bias::AbstractArray uses the array provided, provided it has the correct size. It will also correct the eltype to match that of weights.
source

These functions call:

Flux.rng_from_arrayFunction
rng_from_array(x)

Create an instance of the RNG most appropriate for x. The current defaults are:

  • x isa CuArray: CUDA.default_rng()
  • x isa AbstractArray: `Random.default_rng()
source
Flux.nfanFunction
nfan(n_out, n_in=1) -> Tuple
 nfan(dims...)
 nfan(dims::Tuple)

For a layer characterized by dimensions dims, return a tuple (fan_in, fan_out), where fan_in is the number of input neurons connected to an output one, and fan_out is the number of output neurons connected to an input one.

This function is mainly used by weight initializers, e.g., kaiming_normal.

Examples

julia> layer = Dense(10, 20);
 
@@ -151,7 +151,7 @@
 julia> layer = Conv((3, 3), 2=>10);
 
 julia> Flux.nfan(size(layer.weight))
-(18, 90)
source

Changing the type of all parameters

The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):

Flux.f64Function
f64(m)

Converts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.

See also f32 and f16.

source
Flux.f32Function
f32(m)

Converts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.

See also f64 and f16.

source
Flux.f16Function
f16(m)

Converts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.

Support for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.

See also f32 and f64.

Example

julia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10))  # all Float32
+(18, 90)
source

Changing the type of all parameters

The default eltype for models is Float32 since models are often trained/run on GPUs. The eltype of model m can be changed to Float64 by f64(m):

Flux.f64Function
f64(m)

Converts the eltype of model's floating point parameters to Float64. Recurses into structs marked with @layer.

See also f32 and f16.

source
Flux.f32Function
f32(m)

Converts the eltype of model's floating point parameters to Float32 (which is Flux's default). Recurses into structs marked with @layer.

See also f64 and f16.

source
Flux.f16Function
f16(m)

Converts the eltype of model's floating point parameters to Float16. Recurses into structs marked with @layer.

Support for Float16 is limited on many CPUs. Julia may convert to Float32 for each operation, which is slow.

See also f32 and f64.

Example

julia> m = Chain(Dense(784, 2048, relu), Dense(2048, 10))  # all Float32
 Chain(
   Dense(784 => 2048, relu),             # 1_607_680 parameters
   Dense(2048 => 10),                    # 20_490 parameters
@@ -161,4 +161,4 @@
 Chain(
   Dense(784 => 2048, relu),             # 1_607_680 parameters
   Dense(2048 => 10),                    # 20_490 parameters
-)                   # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.
source
+) # Total: 4 arrays, 1_628_170 parameters, 3.106 MiB.source diff --git a/dev/tutorials/linear_regression/index.html b/dev/tutorials/linear_regression/index.html index 83eb3fedd1..bb4e8586f9 100644 --- a/dev/tutorials/linear_regression/index.html +++ b/dev/tutorials/linear_regression/index.html @@ -106,4 +106,4 @@ 27.1272f0

The loss went down significantly! It can be minimized further by choosing an even smaller δ.

Testing the Flux model

The last step of this tutorial would be to test our model using the testing data. We will first normalise the testing data and then calculate the corresponding loss.

julia> x_test_n = Flux.normalise(x_test);
 
 julia> loss(model, x_test_n, y_test)
-66.91015f0

The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!


Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.

Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.

After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.

Info

Originally published on 21 November 2022, by Saransh Chopra.

+66.91015f0

The loss is not as small as the loss of the training data, but it looks good! This also shows that our model is not overfitting!


Summarising this tutorial, we started by generating a random yet correlated dataset for our custom model. We then saw how a simple linear regression model could be built with and without Flux, and how they were almost identical.

Next, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. We also saw how Flux provides various wrapper functionalities and keeps the API extremely intuitive and simple for the users.

After getting familiar with the basics of Flux and Julia, we moved ahead to build a machine learning model for a real dataset. We repeated the exact same steps, but this time with a lot more features and data points, and by harnessing Flux's full capabilities. In the end, we developed a training loop that was smarter than the hardcoded one and ran the model on our normalised dataset to conclude the tutorial.

Info

Originally published on 21 November 2022, by Saransh Chopra.

diff --git a/dev/tutorials/logistic_regression/index.html b/dev/tutorials/logistic_regression/index.html index b9ea5cc819..79a55a5a6b 100644 --- a/dev/tutorials/logistic_regression/index.html +++ b/dev/tutorials/logistic_regression/index.html @@ -131,4 +131,4 @@ flux_accuracy(x, y) = 0.98 julia> flux_loss(flux_model, x, flux_y_onehot) -0.6952386604624324

We see a very similar final loss and accuracy.


Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.

Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!

Info

Originally published on 1st April 2023, by Saransh Chopra.

+0.6952386604624324

We see a very similar final loss and accuracy.


Summarising this tutorial, we saw how we can run a logistic regression algorithm in Julia with and without using Flux. We started by importing the classic Iris dataset, and one hot encoded the labels. Next, we defined our model, the loss function, and the accuracy, all by ourselves.

Finally, we trained the model by manually writing down the Gradient Descent algorithm and optimising the loss. Interestingly, we implemented most of the functions on our own, and then parallelly compared them with the functionalities provided by Flux!

Info

Originally published on 1st April 2023, by Saransh Chopra.

diff --git a/dev/tutorials/model_zoo/index.html b/dev/tutorials/model_zoo/index.html index 4588ecf825..6ecd4adef4 100644 --- a/dev/tutorials/model_zoo/index.html +++ b/dev/tutorials/model_zoo/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-36890222-9', {'page_path': location.pathname + location.search + location.hash}); -
+