diff --git a/dev/index.html b/dev/index.html index e2d1402..f72d30c 100644 --- a/dev/index.html +++ b/dev/index.html @@ -324,4 +324,4 @@ @show acc_build_val

Output:

acc_learn_train = 0.9983
 acc_learn_val = 0.6866
 acc_build_train = 1.0
-acc_build_val = 0.3284

Alternatively, we have a wrapper function incorporating all above functionalities. With this function, you can quickly explore datasets with different parameter settings. Please find more in the Test Combo Introduction.

Supports

There are two types of supports in outputs. An utterance level and a set of supports for each cue. The former support is also called "synthesis-by-analysis" support. This support is calculated by predicted S vector and original S vector and it is used to select the best paths. Cue level supports are slices of Yt matrices from each timestep. Those supports are used to determine whether a cue is eligible for constructing paths.

Acknowledgments

This project was supported by the ERC advanced grant WIDE-742545 and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC number 2064/1 - Project number 390727645.

Citation

If you find this package helpful, please cite it as follows:

Luo, X., Heitmeier, M., Chuang, Y. Y., Baayen, R. H. JudiLing: an implementation of the Discriminative Lexicon Model in Julia. Eberhard Karls Universität Tübingen, Seminar für Sprachwissenschaft.

The following studies have made use of several algorithms now implemented in JudiLing instead of WpmWithLdl:

+acc_build_val = 0.3284

Alternatively, we have a wrapper function incorporating all above functionalities. With this function, you can quickly explore datasets with different parameter settings. Please find more in the Test Combo Introduction.

Supports

There are two types of supports in outputs. An utterance level and a set of supports for each cue. The former support is also called "synthesis-by-analysis" support. This support is calculated by predicted S vector and original S vector and it is used to select the best paths. Cue level supports are slices of Yt matrices from each timestep. Those supports are used to determine whether a cue is eligible for constructing paths.

Acknowledgments

This project was supported by the ERC advanced grant WIDE-742545 and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC number 2064/1 - Project number 390727645.

Citation

If you find this package helpful, please cite it as follows:

Luo, X., Heitmeier, M., Chuang, Y. Y., Baayen, R. H. JudiLing: an implementation of the Discriminative Lexicon Model in Julia. Eberhard Karls Universität Tübingen, Seminar für Sprachwissenschaft.

The following studies have made use of several algorithms now implemented in JudiLing instead of WpmWithLdl:

diff --git a/dev/man/all_manual/index.html b/dev/man/all_manual/index.html index 1d2634c..85ef232 100644 --- a/dev/man/all_manual/index.html +++ b/dev/man/all_manual/index.html @@ -1,2 +1,2 @@ -All Manual index · JudiLing.jl
+All Manual index · JudiLing.jl
diff --git a/dev/man/cholesky/index.html b/dev/man/cholesky/index.html index e9dedc3..d7c5804 100644 --- a/dev/man/cholesky/index.html +++ b/dev/man/cholesky/index.html @@ -1,5 +1,5 @@ -Cholesky · JudiLing.jl

Cholesky

JudiLing.make_transform_facFunction

The first part of make transform matrix, usually used by the learn_paths function to save time and computing resources.

source
JudiLing.make_transform_matrixMethod
make_transform_matrix(fac::Union{LinearAlgebra.Cholesky, SuiteSparse.CHOLMOD.Factor}, X::Union{SparseMatrixCSC, Matrix}, Y::Union{SparseMatrixCSC, Matrix})

Second step in calculating the Cholesky decomposition for the transformation matrix.

source
JudiLing.make_transform_matrixMethod
make_transform_matrix(X::SparseMatrixCSC, Y::Matrix)

Use Cholesky decomposition to calculate the transformation matrix from X to Y, where X is a sparse matrix and Y is a dense matrix.

Obligatory Arguments

  • X::SparseMatrixCSC: the X matrix, where X is a sparse matrix
  • Y::Matrix: the Y matrix, where Y is a dense matrix

Optional Arguments

  • method::Symbol = :additive: whether :additive or :multiplicative decomposition is required
  • shift::Float64 = 0.02: shift value for :additive decomposition
  • multiplier::Float64 = 1.01: multiplier value for :multiplicative decomposition
  • output_format::Symbol = :auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Float64 = 0.05: the ratio to decide whether a matrix is sparse
  • verbose::Bool = false: if true, more information will be printed out

Examples

# additive mode
+Cholesky · JudiLing.jl

Cholesky

JudiLing.make_transform_facFunction

The first part of make transform matrix, usually used by the learn_paths function to save time and computing resources.

source
JudiLing.make_transform_matrixMethod
make_transform_matrix(fac::Union{LinearAlgebra.Cholesky, SuiteSparse.CHOLMOD.Factor}, X::Union{SparseMatrixCSC, Matrix}, Y::Union{SparseMatrixCSC, Matrix})

Second step in calculating the Cholesky decomposition for the transformation matrix.

source
JudiLing.make_transform_matrixMethod
make_transform_matrix(X::SparseMatrixCSC, Y::Matrix)

Use Cholesky decomposition to calculate the transformation matrix from X to Y, where X is a sparse matrix and Y is a dense matrix.

Obligatory Arguments

  • X::SparseMatrixCSC: the X matrix, where X is a sparse matrix
  • Y::Matrix: the Y matrix, where Y is a dense matrix

Optional Arguments

  • method::Symbol = :additive: whether :additive or :multiplicative decomposition is required
  • shift::Float64 = 0.02: shift value for :additive decomposition
  • multiplier::Float64 = 1.01: multiplier value for :multiplicative decomposition
  • output_format::Symbol = :auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Float64 = 0.05: the ratio to decide whether a matrix is sparse
  • verbose::Bool = false: if true, more information will be printed out

Examples

# additive mode
 JudiLing.make_transform_matrix(
     C,
     S,
@@ -20,7 +20,7 @@
   ...
     output_format = :auto,
     sparse_ratio = 0.05,
-  ...)
source
JudiLing.make_transform_matrixMethod
make_transform_matrix(X::Matrix, Y::Union{SparseMatrixCSC, Matrix})

Use the Cholesky decomposition to calculate the transformation matrix from X to Y, where X is a dense matrix and Y is either a dense matrix or a sparse matrix.

Obligatory Arguments

  • X::Matrix: the X matrix, where X is a dense matrix
  • Y::Union{SparseMatrixCSC, Matrix}: the Y matrix, where Y is either a sparse or a dense matrix

Optional Arguments

  • method::Symbol = :additive: whether :additive or :multiplicative decomposition is required
  • shift::Float64 = 0.02: shift value for :additive decomposition
  • multiplier::Float64 = 1.01: multiplier value for :multiplicative decomposition
  • output_format::Symbol = :auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Float64 = 0.05: the ratio to decide whether a matrix is sparse
  • verbose::Bool = false: if true, more information will be printed out

Examples

# additive mode
+  ...)
source
JudiLing.make_transform_matrixMethod
make_transform_matrix(X::Matrix, Y::Union{SparseMatrixCSC, Matrix})

Use the Cholesky decomposition to calculate the transformation matrix from X to Y, where X is a dense matrix and Y is either a dense matrix or a sparse matrix.

Obligatory Arguments

  • X::Matrix: the X matrix, where X is a dense matrix
  • Y::Union{SparseMatrixCSC, Matrix}: the Y matrix, where Y is either a sparse or a dense matrix

Optional Arguments

  • method::Symbol = :additive: whether :additive or :multiplicative decomposition is required
  • shift::Float64 = 0.02: shift value for :additive decomposition
  • multiplier::Float64 = 1.01: multiplier value for :multiplicative decomposition
  • output_format::Symbol = :auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Float64 = 0.05: the ratio to decide whether a matrix is sparse
  • verbose::Bool = false: if true, more information will be printed out

Examples

# additive mode
 JudiLing.make_transform_matrix(
     C,
     S,
@@ -41,7 +41,7 @@
     ...
     output_format = :auto,
     sparse_ratio = 0.05,
-    ...)
source
JudiLing.make_transform_matrixMethod
make_transform_matrix(X::SparseMatrixCSC, Y::SparseMatrixCSC)

Use the Cholesky decomposition to calculate the transformation matrix from X to Y, where X is a sparse matrix and Y is a sparse matrix.

Obligatory Arguments

  • X::SparseMatrixCSC: the X matrix, where X is a sparse matrix
  • Y::SparseMatrixCSC: the Y matrix, where Y is a sparse matrix

Optional Arguments

  • method::Symbol = :additive: whether :additive or :multiplicative decomposition is required
  • shift::Float64 = 0.02: shift value for :additive decomposition
  • multiplier::Float64 = 1.01: multiplier value for :multiplicative decomposition
  • output_format::Symbol = :auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Float64 = 0.05: the ratio to decide whether a matrix is sparse
  • verbose::Bool = false: if true, more information will be printed out

Examples

# additive mode
+    ...)
source
JudiLing.make_transform_matrixMethod
make_transform_matrix(X::SparseMatrixCSC, Y::SparseMatrixCSC)

Use the Cholesky decomposition to calculate the transformation matrix from X to Y, where X is a sparse matrix and Y is a sparse matrix.

Obligatory Arguments

  • X::SparseMatrixCSC: the X matrix, where X is a sparse matrix
  • Y::SparseMatrixCSC: the Y matrix, where Y is a sparse matrix

Optional Arguments

  • method::Symbol = :additive: whether :additive or :multiplicative decomposition is required
  • shift::Float64 = 0.02: shift value for :additive decomposition
  • multiplier::Float64 = 1.01: multiplier value for :multiplicative decomposition
  • output_format::Symbol = :auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Float64 = 0.05: the ratio to decide whether a matrix is sparse
  • verbose::Bool = false: if true, more information will be printed out

Examples

# additive mode
 JudiLing.make_transform_matrix(
     C,
     S,
@@ -62,4 +62,4 @@
     ...
     output_format = :auto,
     sparse_ratio = 0.05,
-    ...)
source
JudiLing.format_matrixFunction
format_matrix(M::Union{SparseMatrixCSC, Matrix}, output_format=:auto)

Convert output matrix format to either a dense matrix or a sparse matrix.

source
+ ...)
source
JudiLing.format_matrixFunction
format_matrix(M::Union{SparseMatrixCSC, Matrix}, output_format=:auto)

Convert output matrix format to either a dense matrix or a sparse matrix.

source
diff --git a/dev/man/deep_learning/index.html b/dev/man/deep_learning/index.html index 4e7c2f3..d3366e7 100644 --- a/dev/man/deep_learning/index.html +++ b/dev/man/deep_learning/index.html @@ -1,7 +1,7 @@ Deep learning · JudiLing.jl

Deep learning in JudiLing

JudiLing.predict_from_deep_modelMethod
predict_from_deep_model(model::Chain,
-                        X::Union{SparseMatrixCSC,Matrix})

Generates output of a model given input X.

Obligatory arguments

  • model::Chain: Model of type Flux.Chain, as generated by get_and_train_model
  • X::Union{SparseMatrixCSC,Matrix}: Input matrix of size (numberofsamples, inpdim) where inpdim is the input dimension of model
source
JudiLing.predict_shatMethod
predict_shat(model::Chain,
-             ci::Vector{Int})

Predicts semantic vector shat given a deep learning comprehension model model and a list of indices of ngrams ci.

Obligatory arguments

  • model::Chain: Deep learning comprehension model as generated by get_and_train_model
  • ci::Vector{Int}: Vector of indices of ngrams in c vector. Essentially, this is a vector indicating which ngrams in a c vector are absent and which are present.
source
JudiLing.get_and_train_modelMethod
get_and_train_model(X_train::Union{SparseMatrixCSC,Matrix},
+                        X::Union{SparseMatrixCSC,Matrix})

Generates output of a model given input X.

Obligatory arguments

  • model::Chain: Model of type Flux.Chain, as generated by get_and_train_model
  • X::Union{SparseMatrixCSC,Matrix}: Input matrix of size (numberofsamples, inpdim) where inpdim is the input dimension of model
source
JudiLing.predict_shatMethod
predict_shat(model::Chain,
+             ci::Vector{Int})

Predicts semantic vector shat given a deep learning comprehension model model and a list of indices of ngrams ci.

Obligatory arguments

  • model::Chain: Deep learning comprehension model as generated by get_and_train_model
  • ci::Vector{Int}: Vector of indices of ngrams in c vector. Essentially, this is a vector indicating which ngrams in a c vector are absent and which are present.
source
JudiLing.get_and_train_modelMethod
get_and_train_model(X_train::Union{SparseMatrixCSC,Matrix},
                     Y_train::Union{SparseMatrixCSC,Matrix},
                     X_val::Union{SparseMatrixCSC,Matrix,Missing},
                     Y_val::Union{SparseMatrixCSC,Matrix,Missing},
@@ -24,7 +24,7 @@
                     ...kargs
                     )

Trains a deep learning model from X_train to Y_train, saving the model with either the highest validation accuracy or lowest validation loss (depending on optimise_for_acc) to outpath.

The default model looks like this:

inp_dim = size(X_train, 2)
 out_dim = size(Y_train, 2)
-Chain(Dense(inp_dim => hidden_dim, relu), Dense(hidden_dim => out_dim))

Any other model with the same input and output dimensions can be provided to the function with the model argument. The default loss function is mean squared error, but any other loss function can be provded, as long as it fits with the model architecture.

By default the adam optimizer (Kingma and Ba, 2015) with learning rate 0.001 is used. You can provide any other optimizer. If you want to use a different learning rate, e.g. 0.01, provide optimizer=Flux.Adam(0.01). If you do not want to use an optimizer at all, and simply use normal gradient descent, provide optimizer=Descent(0.001), again replacing the learning rate with the learning rate of your preference.

Returns a named tuple with the following values:

  • model: the trained model
  • data_train: the training data, including any measures if computed by measures_func
  • data_val: the validation data, including any measures if computed by measures_func
  • losses_train: The losses of the training data for each epoch.
  • losses_val: The losses of the validation data after each epoch.
  • accs_train: The accuracies of the training data after each epoch, if return_train_acc=true.
  • accs_val: The accuracies of the validation data after each epoch.

Obligatory arguments

  • X_train::Union{SparseMatrixCSC,Matrix}: training input matrix of dimension m x n
  • Y_train::Union{SparseMatrixCSC,Matrix}: training output/target matrix of dimension m x k
  • X_train::Union{SparseMatrixCSC,Matrix}: validation input matrix of dimension l x n
  • Y_train::Union{SparseMatrixCSC,Matrix}: validation output/target matrix of dimension l x k
  • data_train::DataFrame: training data
  • data_val::DataFrame: validation data
  • target_col::Union{Symbol, String}: column with target wordforms in datatrain and dataval
  • model_outpath::String: filepath to where final model should be stored (in .bson format)

Optional arguments

  • hidden_dim::Int=1000: hidden dimension of the model
  • n_epochs::Int=100: number of epochs for which the model should be trained
  • batchsize::Int=64: batchsize during training
  • loss_func::Function=Flux.mse: Loss function. Per default this is the mse loss, but other options might be a crossentropy loss (Flux.crossentropy). Make sure the model makes sense with the loss function!
  • optimizer=Flux.Adam(0.001): optimizer to use for training
  • model::Union{Missing, Chain} = missing: A custom model can be provided for training. Its requirements are that it has to correspond to the input and output size of the training and validation data
  • early_stopping::Union{Missing, Int}=missing: If missing, no early stopping is used. Otherwise early_stopping indicates how many epochs have to pass without improvement in validation accuracy before the training is stopped.
  • optimise_for_acc::Bool=false: if true, keep model with highest validation accuracy. If false, keep model with lowest validation loss.
  • return_losses::Bool=false: whether additional to the model per-epoch losses for the training and test data as well as per-epoch accuracy on the validation data should be returned
  • verbose::Bool=true: Turn on verbose mode
  • measures_func::Union{Missing, Function}=missing: A measures function which is run at the end of every epoch. For more information see The measures_func argument. If a measure is tagged for each epoch, the one tagged with "final" will be the one for the finally returned model.
  • return_train_acc::Bool=false: If true, a vector with training accuracies is returned at the end of the training.
  • ...kargs: any additional keyword arguments are passed to the measures_func
source
JudiLing.get_and_train_modelMethod
get_and_train_model(X_train::Union{SparseMatrixCSC,Matrix},
+Chain(Dense(inp_dim => hidden_dim, relu), Dense(hidden_dim => out_dim))

Any other model with the same input and output dimensions can be provided to the function with the model argument. The default loss function is mean squared error, but any other loss function can be provded, as long as it fits with the model architecture.

By default the adam optimizer (Kingma and Ba, 2015) with learning rate 0.001 is used. You can provide any other optimizer. If you want to use a different learning rate, e.g. 0.01, provide optimizer=Flux.Adam(0.01). If you do not want to use an optimizer at all, and simply use normal gradient descent, provide optimizer=Descent(0.001), again replacing the learning rate with the learning rate of your preference.

Returns a named tuple with the following values:

  • model: the trained model
  • data_train: the training data, including any measures if computed by measures_func
  • data_val: the validation data, including any measures if computed by measures_func
  • losses_train: The losses of the training data for each epoch.
  • losses_val: The losses of the validation data after each epoch.
  • accs_train: The accuracies of the training data after each epoch, if return_train_acc=true.
  • accs_val: The accuracies of the validation data after each epoch.

Obligatory arguments

  • X_train::Union{SparseMatrixCSC,Matrix}: training input matrix of dimension m x n
  • Y_train::Union{SparseMatrixCSC,Matrix}: training output/target matrix of dimension m x k
  • X_train::Union{SparseMatrixCSC,Matrix}: validation input matrix of dimension l x n
  • Y_train::Union{SparseMatrixCSC,Matrix}: validation output/target matrix of dimension l x k
  • data_train::DataFrame: training data
  • data_val::DataFrame: validation data
  • target_col::Union{Symbol, String}: column with target wordforms in datatrain and dataval
  • model_outpath::String: filepath to where final model should be stored (in .bson format)

Optional arguments

  • hidden_dim::Int=1000: hidden dimension of the model
  • n_epochs::Int=100: number of epochs for which the model should be trained
  • batchsize::Int=64: batchsize during training
  • loss_func::Function=Flux.mse: Loss function. Per default this is the mse loss, but other options might be a crossentropy loss (Flux.crossentropy). Make sure the model makes sense with the loss function!
  • optimizer=Flux.Adam(0.001): optimizer to use for training
  • model::Union{Missing, Chain} = missing: A custom model can be provided for training. Its requirements are that it has to correspond to the input and output size of the training and validation data
  • early_stopping::Union{Missing, Int}=missing: If missing, no early stopping is used. Otherwise early_stopping indicates how many epochs have to pass without improvement in validation accuracy before the training is stopped.
  • optimise_for_acc::Bool=false: if true, keep model with highest validation accuracy. If false, keep model with lowest validation loss.
  • return_losses::Bool=false: whether additional to the model per-epoch losses for the training and test data as well as per-epoch accuracy on the validation data should be returned
  • verbose::Bool=true: Turn on verbose mode
  • measures_func::Union{Missing, Function}=missing: A measures function which is run at the end of every epoch. For more information see The measures_func argument. If a measure is tagged for each epoch, the one tagged with "final" will be the one for the finally returned model.
  • return_train_acc::Bool=false: If true, a vector with training accuracies is returned at the end of the training.
  • ...kargs: any additional keyword arguments are passed to the measures_func
source
JudiLing.get_and_train_modelMethod
get_and_train_model(X_train::Union{SparseMatrixCSC,Matrix},
                     Y_train::Union{SparseMatrixCSC,Matrix},
                     model_outpath::String;
                     data_train::Union{Missing, DataFrame}=missing,
@@ -41,7 +41,7 @@
                     return_train_acc::Bool=false,
                     ...kargs)

Trains a deep learning model from X_train to Y_train, saving the model after n_epochs epochs. The default model looks like this:

inp_dim = size(X_train, 2)
 out_dim = size(Y_train, 2)
-Chain(Dense(inp_dim => hidden_dim, relu), Dense(hidden_dim => out_dim))

Any other model with the same input and output dimensions can be provided to the function with the model argument. The default loss function is mean squared error, but any other loss function can be provded, as long as it fits with the model architecture.

By default the adam optimizer (Kingma and Ba, 2015) with learning rate 0.001 is used. You can provide any other optimizer. If you want to use a different learning rate, e.g. 0.01, provide optimizer=Flux.Adam(0.01). If you do not want to use an optimizer at all, and simply use normal gradient descent, provide optimizer=Descent(0.001), again replacing the learning rate with the learning rate of your preference.

Returns a named tuple with the following values:

  • model: the trained model
  • data_train: the data, including any measures if computed by measures_func
  • data_val: missing for this function
  • losses_train: The losses of the training data for each epoch.
  • losses_val: missing for this function
  • accs_train: The accuracies of the training data after each epoch, if return_train_acc=true.
  • accs_val: missing for this function

Obligatory arguments

  • X_train::Union{SparseMatrixCSC,Matrix}: training input matrix of dimension m x n
  • Y_train::Union{SparseMatrixCSC,Matrix}: training output/target matrix of dimension m x k
  • model_outpath::String: filepath to where final model should be stored (in .bson format)

Optional arguments

  • data_train::Union{Missing, DataFrame}=missing: The training data. Only necessary if a measuresfunc is included or returntrain_acc=true.
  • target_col::Union{Missing, Symbol, String}=missing: The column with target word forms in the training data. Only necessary if a measuresfunc is included or returntrain_acc=true.
  • hidden_dim::Int=1000: hidden dimension of the model
  • n_epochs::Int=100: number of epochs for which the model should be trained
  • batchsize::Int=64: batchsize during training
  • loss_func::Function=Flux.mse: Loss function. Per default this is the mse loss, but other options might be a crossentropy loss (Flux.crossentropy). Make sure the model makes sense with the loss function!
  • optimizer=Flux.Adam(0.001): optimizer to use for training
  • model::Union{Missing, Chain} = missing: A custom model can be provided for training. Its requirements are that it has to correspond to the input and output size of the training and validation data
  • return_losses::Bool=false: whether additional to the model per-epoch losses for the training and test data as well as per-epoch accuracy on the validation data should be returned
  • verbose::Bool=true: Turn on verbose mode
  • measures_func::Union{Missing, Function}=missing: A measures function which is run at the end of every epoch. For more information see The measures_func argument.
  • return_train_acc::Bool=false: If true, a vector with training accuracies is returned at the end of the training.
  • ...kargs: any additional keyword arguments are passed to the measures_func
source
JudiLing.fiddlMethod
fiddl(X_train::Union{SparseMatrixCSC,Matrix},
+Chain(Dense(inp_dim => hidden_dim, relu), Dense(hidden_dim => out_dim))

Any other model with the same input and output dimensions can be provided to the function with the model argument. The default loss function is mean squared error, but any other loss function can be provded, as long as it fits with the model architecture.

By default the adam optimizer (Kingma and Ba, 2015) with learning rate 0.001 is used. You can provide any other optimizer. If you want to use a different learning rate, e.g. 0.01, provide optimizer=Flux.Adam(0.01). If you do not want to use an optimizer at all, and simply use normal gradient descent, provide optimizer=Descent(0.001), again replacing the learning rate with the learning rate of your preference.

Returns a named tuple with the following values:

  • model: the trained model
  • data_train: the data, including any measures if computed by measures_func
  • data_val: missing for this function
  • losses_train: The losses of the training data for each epoch.
  • losses_val: missing for this function
  • accs_train: The accuracies of the training data after each epoch, if return_train_acc=true.
  • accs_val: missing for this function

Obligatory arguments

  • X_train::Union{SparseMatrixCSC,Matrix}: training input matrix of dimension m x n
  • Y_train::Union{SparseMatrixCSC,Matrix}: training output/target matrix of dimension m x k
  • model_outpath::String: filepath to where final model should be stored (in .bson format)

Optional arguments

  • data_train::Union{Missing, DataFrame}=missing: The training data. Only necessary if a measuresfunc is included or returntrain_acc=true.
  • target_col::Union{Missing, Symbol, String}=missing: The column with target word forms in the training data. Only necessary if a measuresfunc is included or returntrain_acc=true.
  • hidden_dim::Int=1000: hidden dimension of the model
  • n_epochs::Int=100: number of epochs for which the model should be trained
  • batchsize::Int=64: batchsize during training
  • loss_func::Function=Flux.mse: Loss function. Per default this is the mse loss, but other options might be a crossentropy loss (Flux.crossentropy). Make sure the model makes sense with the loss function!
  • optimizer=Flux.Adam(0.001): optimizer to use for training
  • model::Union{Missing, Chain} = missing: A custom model can be provided for training. Its requirements are that it has to correspond to the input and output size of the training and validation data
  • return_losses::Bool=false: whether additional to the model per-epoch losses for the training and test data as well as per-epoch accuracy on the validation data should be returned
  • verbose::Bool=true: Turn on verbose mode
  • measures_func::Union{Missing, Function}=missing: A measures function which is run at the end of every epoch. For more information see The measures_func argument.
  • return_train_acc::Bool=false: If true, a vector with training accuracies is returned at the end of the training.
  • ...kargs: any additional keyword arguments are passed to the measures_func
source
JudiLing.fiddlMethod
fiddl(X_train::Union{SparseMatrixCSC,Matrix},
         Y_train::Union{SparseMatrixCSC,Matrix},
         learn_seq::Vector,
         data::DataFrame,
@@ -57,4 +57,4 @@
         n_batch_eval::Int=100,
         compute_accuracy::Bool=true,
         measures_func::Union{Function, Missing}=missing,
-        kargs...)

Trains a deep learning model using the FIDDL method (frequency-informed deep discriminative learning). Optionally, after each n_batch_eval batches measures_func can be run to compute any measures which are then added to the data.

Note

If you get an OutOfMemory error, chances are that this is due to the eval_SC function being evaluated after each n_batch_eval batches. Setting compute_accuracy=false disables computing the mapping accuracy.

Returns a named tuple with the following values:

  • model: the trained model
  • data: the data, including any measures if computed by measures_func
  • losses_train: The losses of the data the model is trained on within each n_batch_eval batches.
  • losses: The losses of the full dataset after each n_batch_eval batches.
  • accs: The accuracies of the full dataset after each n_batch_eval batches.

Obligatory arguments

  • X_train::Union{SparseMatrixCSC,Matrix}: training input matrix of dimension m x n
  • Y_train::Union{SparseMatrixCSC,Matrix}: training output/target matrix of dimension m x k
  • learn_seq::Vector: List of indices in the order that the vectors in Xtrain and Ytrain should be presented to the model for training.
  • data::DataFrame: The full data.
  • target_col::Union{Symbol, String}: The column with target word forms in the data.
  • model_outpath::String: filepath to where final model should be stored (in .bson format)

Optional arguments

  • hidden_dim::Int=1000: hidden dimension of the model
  • n_epochs::Int=100: number of epochs for which the model should be trained
  • batchsize::Int=64: batchsize during training
  • loss_func::Function=Flux.mse: Loss function. Per default this is the mse loss, but other options might be a crossentropy loss (Flux.crossentropy). Make sure the model makes sense with the loss function!
  • optimizer=Flux.Adam(0.001): optimizer to use for training
  • model::Union{Missing, Chain} = missing: A custom model can be provided for training. Its requirements are that it has to correspond to the input and output size of the training and validation data
  • return_losses::Bool=false: whether additional to the model per-epoch losses for the training and test data as well as per-epoch accuracy on the validation data should be returned
  • verbose::Bool=true: Turn on verbose mode
  • n_batch_eval::Int=100: Loss, accuracy and measures_func are evaluated every n_batch_eval batches.
  • compute_accuracy::Bool=true: Whether accuracy should be computed every n_batch_eval batches.
  • measures_func::Union{Missing, Function}=missing: A measures function which is run each n_batch_eval batches. For more information see The measures_func argument.
source
+ kargs...)

Trains a deep learning model using the FIDDL method (frequency-informed deep discriminative learning). Optionally, after each n_batch_eval batches measures_func can be run to compute any measures which are then added to the data.

Note

If you get an OutOfMemory error, chances are that this is due to the eval_SC function being evaluated after each n_batch_eval batches. Setting compute_accuracy=false disables computing the mapping accuracy.

Returns a named tuple with the following values:

Obligatory arguments

Optional arguments

source diff --git a/dev/man/display/index.html b/dev/man/display/index.html index ef48501..0bbf2bd 100644 --- a/dev/man/display/index.html +++ b/dev/man/display/index.html @@ -1,5 +1,5 @@ -Display · JudiLing.jl

Cholesky

JudiLing.display_matrixMethod
display_matrix(data, target_col, cue_pS_obj, M, M_type)

Display matrix with rownames and colnames.

Obligatory Arguments

  • data::DataFrame: the dataset
  • target_col::Union{String, Symbol}: the target column name
  • cue_pS_obj::Union{Cue_Matrix_Struct,PS_Matrix_Struct}: the cue matrix or pS matrix structure
  • M::Union{SparseMatrixCSC, Matrix}: the matrix
  • M_type::Union{String, Symbol}: the type of the matrix, currently support :C, :S, :F, :G, :Chat, :Shat, :A, :R and :pS

Optional Arguments

  • nrow::Int64 = 6: the number of rows to display
  • ncol::Int64 = 6: the number of columns to display
  • return_matrix::Bool = false: whether the created dataframe should be returned (and not only displayed)

Examples

JudiLing.display_matrix(latin, :Word, cue_obj, cue_obj.C, :C)
+Display · JudiLing.jl

Cholesky

JudiLing.display_matrixMethod
display_matrix(data, target_col, cue_pS_obj, M, M_type)

Display matrix with rownames and colnames.

Obligatory Arguments

  • data::DataFrame: the dataset
  • target_col::Union{String, Symbol}: the target column name
  • cue_pS_obj::Union{Cue_Matrix_Struct,PS_Matrix_Struct}: the cue matrix or pS matrix structure
  • M::Union{SparseMatrixCSC, Matrix}: the matrix
  • M_type::Union{String, Symbol}: the type of the matrix, currently support :C, :S, :F, :G, :Chat, :Shat, :A, :R and :pS

Optional Arguments

  • nrow::Int64 = 6: the number of rows to display
  • ncol::Int64 = 6: the number of columns to display
  • return_matrix::Bool = false: whether the created dataframe should be returned (and not only displayed)

Examples

JudiLing.display_matrix(latin, :Word, cue_obj, cue_obj.C, :C)
 JudiLing.display_matrix(latin, :Word, cue_obj, S, :S)
 JudiLing.display_matrix(latin, :Word, cue_obj, G, :G)
 JudiLing.display_matrix(latin, :Word, cue_obj, Chat, :Chat)
@@ -7,4 +7,4 @@
 JudiLing.display_matrix(latin, :Word, cue_obj, Shat, :Shat)
 JudiLing.display_matrix(latin, :Word, cue_obj, A, :A)
 JudiLing.display_matrix(latin, :Word, cue_obj, R, :R)
-JudiLing.display_matrix(latin, :Word, pS_obj, pS_obj.pS, :pS)
source
+JudiLing.display_matrix(latin, :Word, pS_obj, pS_obj.pS, :pS)
source
diff --git a/dev/man/eval/index.html b/dev/man/eval/index.html index 05ce3af..abf3326 100644 --- a/dev/man/eval/index.html +++ b/dev/man/eval/index.html @@ -1,12 +1,12 @@ -Evaluation · JudiLing.jl

Evaluation

JudiLing.eval_SCFunction

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations on the diagonal of the pertinent correlation matrices. Homophones support option is implemented.

source
JudiLing.eval_SC_looseFunction

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Count it as correct if one of the top k candidates is correct. Homophones support option is implemented.

source
JudiLing.accuracy_comprehensionMethod
accuracy_comprehension(S, Shat, data)

Evaluate comprehension accuracy for training data.

Note

In case of homophones/homographs in the dataset, the correct/incorrect values for base and inflections may be misleading! See below for more information.

Obligatory Arguments

  • S::Matrix: the (gold standard) S matrix
  • Shat::Matrix: the (predicted) Shat matrix
  • data::DataFrame: the dataset

Optional Arguments

  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • base::Vector=nothing: base features (typically a lexeme)
  • inflections::Union{Nothing, Vector}=nothing: other features (typically in inflectional features)

Examples

accuracy_comprehension(
+Evaluation · JudiLing.jl

Evaluation

JudiLing.eval_SCFunction

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations on the diagonal of the pertinent correlation matrices. Homophones support option is implemented.

source
JudiLing.eval_SC_looseFunction

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Count it as correct if one of the top k candidates is correct. Homophones support option is implemented.

source
JudiLing.accuracy_comprehensionMethod
accuracy_comprehension(S, Shat, data)

Evaluate comprehension accuracy for training data.

Note

In case of homophones/homographs in the dataset, the correct/incorrect values for base and inflections may be misleading! See below for more information.

Obligatory Arguments

  • S::Matrix: the (gold standard) S matrix
  • Shat::Matrix: the (predicted) Shat matrix
  • data::DataFrame: the dataset

Optional Arguments

  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • base::Vector=nothing: base features (typically a lexeme)
  • inflections::Union{Nothing, Vector}=nothing: other features (typically in inflectional features)

Examples

accuracy_comprehension(
     S_train,
     Shat_train,
     latin_val,
     target_col=:Words,
     base=[:Lexeme],
     inflections=[:Person, :Number, :Tense, :Voice, :Mood]
-    )

Note

In case of homophones/homographs in the dataset, the correct/incorrect values for base and inflections may be misleading! Consider the following example: The wordform "Äpfel" in German can be nominative plural, genitive plural and accusative plural. Let's assume we have a dataset in which "Äpfel" occurs in all three case/number combinations (i.e. there are homographs). If all these wordforms have the same semantic vectors (e.g. because they are derived from word2vec or fasttext which typically have a single vector per unique wordform), the predicted semantic vector of the wordform "Äpfel" will be equally correlated with all three case/number combinations in the dataset. In such cases, while the algorithm in this function can unambiguously conclude that the correct surface form "Äpfel" was comprehended, which of the three possible rows is the correct one will be picked somewhat non-deterministically (see https://docs.julialang.org/en/v1/base/collections/#Base.argmax). It is thus possible that the algorithm will then use the genitive plural instead of the intended nominative plural as the ground plural, and will report that "case" was comprehended incorrectly.

source
JudiLing.accuracy_comprehensionMethod
accuracy_comprehension(
+    )

Note

In case of homophones/homographs in the dataset, the correct/incorrect values for base and inflections may be misleading! Consider the following example: The wordform "Äpfel" in German can be nominative plural, genitive plural and accusative plural. Let's assume we have a dataset in which "Äpfel" occurs in all three case/number combinations (i.e. there are homographs). If all these wordforms have the same semantic vectors (e.g. because they are derived from word2vec or fasttext which typically have a single vector per unique wordform), the predicted semantic vector of the wordform "Äpfel" will be equally correlated with all three case/number combinations in the dataset. In such cases, while the algorithm in this function can unambiguously conclude that the correct surface form "Äpfel" was comprehended, which of the three possible rows is the correct one will be picked somewhat non-deterministically (see https://docs.julialang.org/en/v1/base/collections/#Base.argmax). It is thus possible that the algorithm will then use the genitive plural instead of the intended nominative plural as the ground plural, and will report that "case" was comprehended incorrectly.

source
JudiLing.accuracy_comprehensionMethod
accuracy_comprehension(
     S_val,
     S_train,
     Shat_val,
@@ -24,27 +24,27 @@
     target_col=:Words,
     base=[:Lexeme],
     inflections=[:Person, :Number, :Tense, :Voice, :Mood]
-    )

Note

In case of homophones/homographs in the dataset, the correct/incorrect values for base and inflections may be misleading! Consider the following example: The wordform "Äpfel" in German can be nominative plural, genitive plural and accusative plural. Let's assume we have a dataset in which "Äpfel" occurs in all three case/number combinations (i.e. there are homographs). If all these wordforms have the same semantic vectors (e.g. because they are derived from word2vec or fasttext which typically have a single vector per unique wordform), the predicted semantic vector of the wordform "Äpfel" will be equally correlated with all three case/number combinations in the dataset. In such cases, while the algorithm in this function can unambiguously conclude that the correct surface form "Äpfel" was comprehended, which of the three possible rows is the correct one will be picked somewhat non-deterministically (see https://docs.julialang.org/en/v1/base/collections/#Base.argmax). It is thus possible that the algorithm will then use the genitive plural instead of the intended nominative plural as the ground plural, and will report that "case" was comprehended incorrectly.

source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and the one on the diagonal will not necessarily be selected as the most correlated. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C)
+    )

Note

In case of homophones/homographs in the dataset, the correct/incorrect values for base and inflections may be misleading! Consider the following example: The wordform "Äpfel" in German can be nominative plural, genitive plural and accusative plural. Let's assume we have a dataset in which "Äpfel" occurs in all three case/number combinations (i.e. there are homographs). If all these wordforms have the same semantic vectors (e.g. because they are derived from word2vec or fasttext which typically have a single vector per unique wordform), the predicted semantic vector of the wordform "Äpfel" will be equally correlated with all three case/number combinations in the dataset. In such cases, while the algorithm in this function can unambiguously conclude that the correct surface form "Äpfel" was comprehended, which of the three possible rows is the correct one will be picked somewhat non-deterministically (see https://docs.julialang.org/en/v1/base/collections/#Base.argmax). It is thus possible that the algorithm will then use the genitive plural instead of the intended nominative plural as the ground plural, and will report that "case" was comprehended incorrectly.

source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and the one on the diagonal will not necessarily be selected as the most correlated. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C)
 eval_SC(Chat_val, cue_obj_val.C)
 eval_SC(Shat_train, S_train)
-eval_SC(Shat_val, S_val)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, SC_rest::AbstractArray)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Note

The order is important. The fist gold standard matrix has to be corresponing to the SChat matrix, such as eval_SC(Shat_train, S_train, S_val) or eval_SC(Shat_val, S_val, S_train)

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and the one on the diagonal will not necessarily be selected as the most correlated. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the training/validation C or S matrix
  • SC_rest::Union{SparseMatrixCSC, Matrix}: the validation/training C or S matrix

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C, cue_obj_val.C)
+eval_SC(Shat_val, S_val)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, SC_rest::AbstractArray)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Note

The order is important. The fist gold standard matrix has to be corresponing to the SChat matrix, such as eval_SC(Shat_train, S_train, S_val) or eval_SC(Shat_val, S_val, S_train)

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and the one on the diagonal will not necessarily be selected as the most correlated. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the training/validation C or S matrix
  • SC_rest::Union{SparseMatrixCSC, Matrix}: the validation/training C or S matrix

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C, cue_obj_val.C)
 eval_SC(Chat_val, cue_obj_val.C, cue_obj_train.C)
 eval_SC(Shat_train, S_train, S_val)
-eval_SC(Shat_val, S_val, S_train)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, data::DataFrame, target_col::Union{String, Symbol})

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices. Support for homophones.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix
  • data::DataFrame: datasets
  • target_col::Union{String, Symbol}: target column name

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C, latin, :Word)
+eval_SC(Shat_val, S_val, S_train)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, data::DataFrame, target_col::Union{String, Symbol})

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices. Support for homophones.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix
  • data::DataFrame: datasets
  • target_col::Union{String, Symbol}: target column name

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C, latin, :Word)
 eval_SC(Chat_val, cue_obj_val.C, latin, :Word)
 eval_SC(Shat_train, S_train, latin, :Word)
-eval_SC(Shat_val, S_val, latin, :Word)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, SC_rest::AbstractArray, data::DataFrame, data_rest::DataFrame, target_col::Union{String, Symbol})

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Note

The order is important. The first gold standard matrix has to be corresponing to the SChat matrix, such as eval_SC(Shat_train, S_train, S_val, latin, :Word) or eval_SC(Shat_val, S_val, S_train, latin, :Word)

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the training/validation C or S matrix
  • SC_rest::Union{SparseMatrixCSC, Matrix}: the validation/training C or S matrix
  • data::DataFrame: the training/validation datasets
  • data_rest::DataFrame: the validation/training datasets
  • target_col::Union{String, Symbol}: target column name

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C, cue_obj_val.C, latin, :Word)
+eval_SC(Shat_val, S_val, latin, :Word)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, SC_rest::AbstractArray, data::DataFrame, data_rest::DataFrame, target_col::Union{String, Symbol})

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices.

If freq is added, token-based accuracy is computed. Token-based accuracy weighs accuracy values according to words' frequency, i.e. if a word has a frequency of 30 and overall there are 3000 tokens (the frequencies of all types sum to 3000), this token's accuracy will contribute 30/3000.

Note

The order is important. The first gold standard matrix has to be corresponing to the SChat matrix, such as eval_SC(Shat_train, S_train, S_val, latin, :Word) or eval_SC(Shat_val, S_val, S_train, latin, :Word)

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the training/validation C or S matrix
  • SC_rest::Union{SparseMatrixCSC, Matrix}: the validation/training C or S matrix
  • data::DataFrame: the training/validation datasets
  • data_rest::DataFrame: the validation/training datasets
  • target_col::Union{String, Symbol}: target column name

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • R::Bool=false: if true, pairwise correlation/distance/similarity matrix R is return
  • freq::Union{Missing, Array{Int64, 1}, Array{Float64,1}}=missing: list of frequencies of the wordforms in X and Y
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC(Chat_train, cue_obj_train.C, cue_obj_val.C, latin, :Word)
 eval_SC(Chat_val, cue_obj_val.C, cue_obj_train.C, latin, :Word)
 eval_SC(Shat_train, S_train, S_val, latin, :Word)
-eval_SC(Shat_val, S_val, S_train, latin, :Word)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, batch_size::Int64)

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations on the diagonal of the pertinent correlation matrices. For large datasets, pass batch_size to process evaluation in chunks.

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and the one on the diagonal will not necessarily be selected as the most correlated. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Note

Currently only available for correlation.

Obligatory Arguments

  • SChat: the Chat or Shat matrix
  • SC: the C or S matrix
  • data: datasets
  • target_col: target column name
  • batch_size: batch size

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed
eval_SC(Chat_train, cue_obj_train.C, latin, :Word)
+eval_SC(Shat_val, S_val, S_train, latin, :Word)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, batch_size::Int64)

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations on the diagonal of the pertinent correlation matrices. For large datasets, pass batch_size to process evaluation in chunks.

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and the one on the diagonal will not necessarily be selected as the most correlated. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Note

Currently only available for correlation.

Obligatory Arguments

  • SChat: the Chat or Shat matrix
  • SC: the C or S matrix
  • data: datasets
  • target_col: target column name
  • batch_size: batch size

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed
eval_SC(Chat_train, cue_obj_train.C, latin, :Word)
 eval_SC(Chat_val, cue_obj_val.C, latin, :Word)
 eval_SC(Shat_train, S_train, latin, :Word)
-eval_SC(Shat_val, S_val, latin, :Word)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, data::DataFrame, target_col::Union{String, Symbol}, batch_size::Int64)

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations on the diagonal of the pertinent correlation matrices. For large datasets, pass batch_size to process evaluation in chunks. Support homophones.

Note

Currently only available for correlation.

Obligatory Arguments

  • SChat::AbstractArray: the Chat or Shat matrix
  • SC::AbstractArray: the C or S matrix
  • data::DataFrame: datasets
  • target_col::Union{String, Symbol}: target column name
  • batch_size::Int64: batch size

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed
eval_SC(Chat_train, cue_obj_train.C, latin, :Word, 5000)
+eval_SC(Shat_val, S_val, latin, :Word)
source
JudiLing.eval_SCMethod
eval_SC(SChat::AbstractArray, SC::AbstractArray, data::DataFrame, target_col::Union{String, Symbol}, batch_size::Int64)

Assess model accuracy on the basis of the correlations of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations on the diagonal of the pertinent correlation matrices. For large datasets, pass batch_size to process evaluation in chunks. Support homophones.

Note

Currently only available for correlation.

Obligatory Arguments

  • SChat::AbstractArray: the Chat or Shat matrix
  • SC::AbstractArray: the C or S matrix
  • data::DataFrame: datasets
  • target_col::Union{String, Symbol}: target column name
  • batch_size::Int64: batch size

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed
eval_SC(Chat_train, cue_obj_train.C, latin, :Word, 5000)
 eval_SC(Chat_val, cue_obj_val.C, latin, :Word, 5000)
 eval_SC(Shat_train, S_train, latin, :Word, 5000)
-eval_SC(Shat_val, S_val, latin, :Word, 5000)
source
JudiLing.eval_SC_looseMethod
eval_SC_loose(SChat, SC, k)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices. Count it as correct if one of the top k candidates is correct.

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and it is not guaranteed that the target on the diagonal will be among the k neighbours. In particular, eval_SC and eval_SC_loose with k=1 are not guaranteed to give the same result. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix
  • k: top k candidates

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC_loose(Chat, cue_obj.C, k)
-eval_SC_loose(Shat, S, k)
source
JudiLing.eval_SC_looseMethod
eval_SC_loose(SChat, SC, k, data, target_col)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices. Count it as correct if one of the top k candidates is correct. Support for homophones.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix
  • k: top k candidates
  • data: datasets
  • target_col: target column name

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC_loose(Chat, cue_obj.C, k, latin, :Word)
-eval_SC_loose(Shat, S, k, latin, :Word)
source
JudiLing.eval_manualMethod
eval_manual(res, data, i2f)

Create extensive reports for the outputs from build_paths and learn_paths.

source
JudiLing.eval_accMethod
eval_acc(res, gold_inds::Array)

Evaluate the accuracy of the results from learn_paths or build_paths.

Obligatory Arguments

  • res::Array: the results from learn_paths or build_paths
  • gold_inds::Array: the gold paths' indices

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed

Examples

# evaluation on training data
+eval_SC(Shat_val, S_val, latin, :Word, 5000)
source
JudiLing.eval_SC_looseMethod
eval_SC_loose(SChat, SC, k)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices. Count it as correct if one of the top k candidates is correct.

Note

If there are homophones/homographs in the dataset, this evaluation method may be misleading: the predicted vector will be equally correlated with the target vector of both words and it is not guaranteed that the target on the diagonal will be among the k neighbours. In particular, eval_SC and eval_SC_loose with k=1 are not guaranteed to give the same result. In such cases, supplying the dataset and target_col is recommended which enables taking into account homophones/homographs.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix
  • k: top k candidates

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC_loose(Chat, cue_obj.C, k)
+eval_SC_loose(Shat, S, k)
source
JudiLing.eval_SC_looseMethod
eval_SC_loose(SChat, SC, k, data, target_col)

Assess model accuracy on the basis of the correlations (or Euclidean distances or Cosine Similarities) of row vectors of Chat and C or Shat and S. Ideally the target words have highest correlations (lowest distance/highest similarity) on the diagonal of the pertinent correlation (distance/similarity) matrices. Count it as correct if one of the top k candidates is correct. Support for homophones.

Obligatory Arguments

  • SChat::Union{SparseMatrixCSC, Matrix}: the Chat or Shat matrix
  • SC::Union{SparseMatrixCSC, Matrix}: the C or S matrix
  • k: top k candidates
  • data: datasets
  • target_col: target column name

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • method::Union{Symbol, String}=:correlation: Method for computing similarities, one of {:correlation, :euclidean, :cosine}.
eval_SC_loose(Chat, cue_obj.C, k, latin, :Word)
+eval_SC_loose(Shat, S, k, latin, :Word)
source
JudiLing.eval_manualMethod
eval_manual(res, data, i2f)

Create extensive reports for the outputs from build_paths and learn_paths.

source
JudiLing.eval_accMethod
eval_acc(res, gold_inds::Array)

Evaluate the accuracy of the results from learn_paths or build_paths.

Obligatory Arguments

  • res::Array: the results from learn_paths or build_paths
  • gold_inds::Array: the gold paths' indices

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed

Examples

# evaluation on training data
 acc_train = JudiLing.eval_acc(
     res_train,
     cue_obj_train.gold_ind,
@@ -56,7 +56,7 @@
     res_val,
     cue_obj_val.gold_ind,
     verbose=false
-)
source
JudiLing.eval_accMethod
eval_acc(res, cue_obj::Cue_Matrix_Struct)

Evaluate the accuracy of the results from learn_paths or build_paths.

Obligatory Arguments

  • res::Array: the results from learn_paths or build_paths
  • cue_obj::Cue_Matrix_Struct: the C matrix object

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed

Examples

acc = JudiLing.eval_acc(res, cue_obj)
source
JudiLing.eval_acc_looseMethod
eval_acc_loose(res, gold_inds)

Lenient evaluation of the accuracy of the results from learn_paths or build_paths, counting a prediction as correct when the correlation of the predicted and gold standard semantic vectors is among the n top correlations, where n is equal to max_can in the 'learnpaths' or `buildpaths` function.

Obligatory Arguments

  • res::Array: the results from learn_paths or build_paths
  • gold_inds::Array: the gold paths' indices

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed

Examples

# evaluation on training data
+)
source
JudiLing.eval_accMethod
eval_acc(res, cue_obj::Cue_Matrix_Struct)

Evaluate the accuracy of the results from learn_paths or build_paths.

Obligatory Arguments

  • res::Array: the results from learn_paths or build_paths
  • cue_obj::Cue_Matrix_Struct: the C matrix object

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed

Examples

acc = JudiLing.eval_acc(res, cue_obj)
source
JudiLing.eval_acc_looseMethod
eval_acc_loose(res, gold_inds)

Lenient evaluation of the accuracy of the results from learn_paths or build_paths, counting a prediction as correct when the correlation of the predicted and gold standard semantic vectors is among the n top correlations, where n is equal to max_can in the 'learnpaths' or `buildpaths` function.

Obligatory Arguments

  • res::Array: the results from learn_paths or build_paths
  • gold_inds::Array: the gold paths' indices

Optional Arguments

  • digits: the specified number of digits after the decimal place (or before if negative)
  • verbose::Bool=false: if true, more information is printed

Examples

# evaluation on training data
 acc_train_loose = JudiLing.eval_acc_loose(
     res_train,
     cue_obj_train.gold_ind,
@@ -68,4 +68,4 @@
     res_val,
     cue_obj_val.gold_ind,
     verbose=false
-)
source
JudiLing.extract_gpiFunction

extract_gpi(gpi, threshold=0.1, tolerance=(-1000.0))

Extract, using gold paths' information, how many n-grams for a gold path are below the threshold but above the tolerance.

source
+)
source
JudiLing.extract_gpiFunction

extract_gpi(gpi, threshold=0.1, tolerance=(-1000.0))

Extract, using gold paths' information, how many n-grams for a gold path are below the threshold but above the tolerance.

source
diff --git a/dev/man/find_path/index.html b/dev/man/find_path/index.html index ffc0ea4..3419e88 100644 --- a/dev/man/find_path/index.html +++ b/dev/man/find_path/index.html @@ -1,5 +1,5 @@ -Find Paths · JudiLing.jl

Find Paths

Structures

JudiLing.Gold_Path_Info_StructType

Store gold paths' information including indices and indices' support and total support. It can be used to evaluate how low the threshold needs to be set in order to find most of the correct paths or if set very low, all of the correct paths.

source

Build paths

JudiLing.build_pathsFunction

The build_paths function constructs paths by only considering those n-grams that are close to the target. It first takes the predicted c-hat vector and finds the closest n neighbors in the C matrix. Then it selects all n-grams of these neighbors, and constructs all valid paths with those n-grams. The path producing the best correlation with the target semantic vector (through synthesis by analysis) is selected.

source
JudiLing.build_pathsMethod
build_paths(
+Find Paths · JudiLing.jl

Find Paths

Structures

JudiLing.Gold_Path_Info_StructType

Store gold paths' information including indices and indices' support and total support. It can be used to evaluate how low the threshold needs to be set in order to find most of the correct paths or if set very low, all of the correct paths.

source

Build paths

JudiLing.build_pathsFunction

The build_paths function constructs paths by only considering those n-grams that are close to the target. It first takes the predicted c-hat vector and finds the closest n neighbors in the C matrix. Then it selects all n-grams of these neighbors, and constructs all valid paths with those n-grams. The path producing the best correlation with the target semantic vector (through synthesis by analysis) is selected.

source
JudiLing.build_pathsMethod
build_paths(
     data_val,
     C_train,
     S_val,
@@ -66,7 +66,7 @@
     pca_eval_M=Fo,
     n_neighbors=3,
     verbose=true
-    )
source

Learn paths

JudiLing.learn_pathsFunction

A sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.

source

Learn paths

JudiLing.learn_pathsFunction

A sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.

source
JudiLing.learn_pathsMethod
learn_paths(
     data::DataFrame,
     cue_obj::Cue_Matrix_Struct,
     S_val::Union{SparseMatrixCSC, Matrix},
@@ -80,7 +80,7 @@
     max_tolerance::Int = 3,
     activation::Union{Nothing, Function} = nothing,
     ignore_nan::Bool = true,
-    verbose::Bool = true)

A high-level wrapper function for learn_paths with much less control. It aims for users who is very new to JudiLing and learn_paths function.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • cue_obj::Cue_Matrix_Struct: the C matrix object containing all information with C
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix, Chain}: either the F matrix for training dataset, or a deep learning comprehension model trained on the training set
  • Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset

Optional Arguments

  • Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
  • check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
  • threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
  • is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
  • tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
  • max_tolerance::Int64=4: maximum number of n-grams allowed in a path
  • activation::Function=nothing: the activation function you want to pass
  • ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
  • verbose::Bool=false: if true, more information is printed

Examples

res = learn_paths(latin, cue_obj, S, F, Chat)
source
JudiLing.learn_pathsMethod
learn_paths(
+    verbose::Bool = true)

A high-level wrapper function for learn_paths with much less control. It aims for users who is very new to JudiLing and learn_paths function.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • cue_obj::Cue_Matrix_Struct: the C matrix object containing all information with C
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix, Chain}: either the F matrix for training dataset, or a deep learning comprehension model trained on the training set
  • Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset

Optional Arguments

  • Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
  • check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
  • threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
  • is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
  • tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
  • max_tolerance::Int64=4: maximum number of n-grams allowed in a path
  • activation::Function=nothing: the activation function you want to pass
  • ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
  • verbose::Bool=false: if true, more information is printed

Examples

res = learn_paths(latin, cue_obj, S, F, Chat)
source
JudiLing.learn_pathsMethod
learn_paths(
     data_train::DataFrame,
     data_val::DataFrame,
     C_train::Union{Matrix, SparseMatrixCSC},
@@ -227,7 +227,7 @@
 if_pca=true,
 pca_eval_M=Fo,
 verbose=true);
-
source
JudiLing.learn_paths_rpiMethod
learn_paths_rpi(
     data_train::DataFrame,
     data_val::DataFrame,
     C_train::Union{Matrix, SparseMatrixCSC},
@@ -260,5 +260,5 @@
     ignore_nan::Bool = true,
     check_threshold_stat::Bool = false,
     verbose::Bool = false
-)

Calculate learn_paths with results indices supports as well.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • C_train::Union{SparseMatrixCSC, Matrix}: the C matrix for training dataset
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix, Chain}: the F matrix for training dataset, or a deep learning comprehension model trained on the training data
  • Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset
  • A::SparseMatrixCSC: the adjacency matrix
  • i2f::Dict: the dictionary returning features given indices
  • f2i::Dict: the dictionary returning indices given features

Optional Arguments

  • gold_ind::Union{Nothing, Vector}=nothing: gold paths' indices
  • Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
  • check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
  • max_t::Int64=15: maximum timestep
  • max_can::Int64=10: maximum number of candidates to consider
  • threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
  • is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
  • tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
  • max_tolerance::Int64=4: maximum number of n-grams allowed in a path
  • grams::Int64=3: the number n of grams that make up an n-gram
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • keep_sep::Bool=false:if true, keep separators in cues
  • target_col::Union{String, :Symbol}=:Words: the column name for target strings
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • issparse::Union{Symbol, Bool}=:auto: control of whether output of Mt matrix is a dense matrix or a sparse matrix
  • sparse_ratio::Float64=0.05: the ratio to decide whether a matrix is sparse
  • if_pca::Bool=false: turn on to enable pca mode
  • pca_eval_M::Matrix=nothing: pass original F for pca mode
  • activation::Function=nothing: the activation function you want to pass
  • ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
  • check_threshold_stat::Bool=false: if true, return a threshold and torlerance proportion for each timestep
  • verbose::Bool=false: if true, more information is printed
source

Utility functions

JudiLing.eval_canMethod
eval_can(candidates, S, F::Union{Matrix,SparseMatrixCSC, Chain}, i2f, max_can, if_pca, pca_eval_M)

Calculate for each candidate path the correlation between predicted semantic vector and the gold standard semantic vector, and select as target for production the path with the highest correlation.

source
JudiLing.predict_shatMethod
predict_shat(F::Union{Matrix, SparseMatrixCSC},
-             ci::Vector{Int})

Predicts semantic vector shat given a comprehension matrix F and a list of indices of ngrams ci.

Obligatory arguments

  • F::Union{Matrix, SparseMatrixCSC}: Comprehension matrix F.
  • ci::Vector{Int}: Vector of indices of ngrams in c vector. Essentially, this is a vector indicating which ngrams in a c vector are absent and which are present.
source
+)

Calculate learn_paths with results indices supports as well.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • C_train::Union{SparseMatrixCSC, Matrix}: the C matrix for training dataset
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix, Chain}: the F matrix for training dataset, or a deep learning comprehension model trained on the training data
  • Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset
  • A::SparseMatrixCSC: the adjacency matrix
  • i2f::Dict: the dictionary returning features given indices
  • f2i::Dict: the dictionary returning indices given features

Optional Arguments

  • gold_ind::Union{Nothing, Vector}=nothing: gold paths' indices
  • Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
  • check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
  • max_t::Int64=15: maximum timestep
  • max_can::Int64=10: maximum number of candidates to consider
  • threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
  • is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
  • tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
  • max_tolerance::Int64=4: maximum number of n-grams allowed in a path
  • grams::Int64=3: the number n of grams that make up an n-gram
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • keep_sep::Bool=false:if true, keep separators in cues
  • target_col::Union{String, :Symbol}=:Words: the column name for target strings
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • issparse::Union{Symbol, Bool}=:auto: control of whether output of Mt matrix is a dense matrix or a sparse matrix
  • sparse_ratio::Float64=0.05: the ratio to decide whether a matrix is sparse
  • if_pca::Bool=false: turn on to enable pca mode
  • pca_eval_M::Matrix=nothing: pass original F for pca mode
  • activation::Function=nothing: the activation function you want to pass
  • ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
  • check_threshold_stat::Bool=false: if true, return a threshold and torlerance proportion for each timestep
  • verbose::Bool=false: if true, more information is printed
source

Utility functions

JudiLing.eval_canMethod
eval_can(candidates, S, F::Union{Matrix,SparseMatrixCSC, Chain}, i2f, max_can, if_pca, pca_eval_M)

Calculate for each candidate path the correlation between predicted semantic vector and the gold standard semantic vector, and select as target for production the path with the highest correlation.

source
JudiLing.predict_shatMethod
predict_shat(F::Union{Matrix, SparseMatrixCSC},
+             ci::Vector{Int})

Predicts semantic vector shat given a comprehension matrix F and a list of indices of ngrams ci.

Obligatory arguments

  • F::Union{Matrix, SparseMatrixCSC}: Comprehension matrix F.
  • ci::Vector{Int}: Vector of indices of ngrams in c vector. Essentially, this is a vector indicating which ngrams in a c vector are absent and which are present.
source
diff --git a/dev/man/input/index.html b/dev/man/input/index.html index a9141d6..59a0663 100644 --- a/dev/man/input/index.html +++ b/dev/man/input/index.html @@ -2,7 +2,7 @@ Loading data · JudiLing.jl

Loading data

JudiLing.load_datasetMethod
load_dataset(filepath::String;
             delim::String=",",
             kargs...)

Load a dataset from file, usually comma- or tab-separated. Returns a DataFrame.

Obligatory arguments

  • filepath::String: Path to file to be loaded.

Optional arguments

  • delim::String=",": Delimiter in the file (usually either "," or "\t").
  • kargs...: Further keyword arguments are passed to CSV.File().

Example

latin = JudiLing.load_dataset("latin.csv")
-first(latin, 10)
source
JudiLing.loading_data_randomly_splitMethod
loading_data_randomly_split(
     data_path::String,
     output_dir_path::String,
     data_prefix::String;
@@ -13,7 +13,7 @@
     "careful",
     "latin",
     ["Lexeme","Person","Number","Tense","Voice","Mood"]
-)
source
JudiLing.loading_data_careful_splitMethod
loading_data_careful_split(
     data_path::String,
     data_prefix::String,
     output_dir_path::String,
@@ -33,4 +33,4 @@
     "latin",
     "careful",
     ["Lexeme","Person","Number","Tense","Voice","Mood"]
-)
source
+)source diff --git a/dev/man/make_adjacency_matrix/index.html b/dev/man/make_adjacency_matrix/index.html index 99ac529..8a9f147 100644 --- a/dev/man/make_adjacency_matrix/index.html +++ b/dev/man/make_adjacency_matrix/index.html @@ -8,7 +8,7 @@ JudiLing.make_adjacency_matrix( i2f, tokenized=true, - sep_token="-")source
JudiLing.make_full_adjacency_matrixMethod
make_adjacency_matrix(i2f)

Make full adjacency matrix based only on the form of n-grams regardless of whether they are seen in the training data. This usually takes hours for large datasets, as all possible combinations are considered.

Obligatory Arguments

  • i2f::Dict: the dictionary returning features given indices

Optional Arguments

  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • verbose::Bool=false: if true, more information will be printed

Examples

# without tokenization
+    sep_token="-")
source
JudiLing.make_full_adjacency_matrixMethod
make_adjacency_matrix(i2f)

Make full adjacency matrix based only on the form of n-grams regardless of whether they are seen in the training data. This usually takes hours for large datasets, as all possible combinations are considered.

Obligatory Arguments

  • i2f::Dict: the dictionary returning features given indices

Optional Arguments

  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • verbose::Bool=false: if true, more information will be printed

Examples

# without tokenization
 i2f = Dict([(1, "#ab"), (2, "abc"), (3, "bc#"), (4, "#bc"), (5, "ab#")])
 JudiLing.make_adjacency_matrix(i2f)
 
@@ -17,11 +17,11 @@
 JudiLing.make_adjacency_matrix(
     i2f,
     tokenized=true,
-    sep_token="-")
source
JudiLing.make_combined_adjacency_matrixMethod
make_combined_adjacency_matrix(data_train, data_val)

Make combined adjacency matrix.

Obligatory Arguments

  • data_train::DataFrame: training dataset
  • data_val::DataFrame: validation dataset

Optional Arguments

  • grams=3: the number of grams for cues
  • target_col=:Words: the column name for target strings
  • tokenized=false:if true, the dataset target is assumed to be tokenized
  • sep_token=nothing: separator
  • keep_sep=false: if true, keep separators in cues
  • start_end_token="#": start and end token in boundary cues
  • verbose=false: if true, more information is printed

Examples

JudiLing.make_combined_adjacency_matrix(
+    sep_token="-")
source
JudiLing.make_combined_adjacency_matrixMethod
make_combined_adjacency_matrix(data_train, data_val)

Make combined adjacency matrix.

Obligatory Arguments

  • data_train::DataFrame: training dataset
  • data_val::DataFrame: validation dataset

Optional Arguments

  • grams=3: the number of grams for cues
  • target_col=:Words: the column name for target strings
  • tokenized=false:if true, the dataset target is assumed to be tokenized
  • sep_token=nothing: separator
  • keep_sep=false: if true, keep separators in cues
  • start_end_token="#": start and end token in boundary cues
  • verbose=false: if true, more information is printed

Examples

JudiLing.make_combined_adjacency_matrix(
     latin_train,
     latin_val,
     grams=3,
     target_col=:Word,
     tokenized=false,
     keep_sep=false
-    )
source
+ )source diff --git a/dev/man/make_cue_matrix/index.html b/dev/man/make_cue_matrix/index.html index 3e8bc69..64cd59f 100644 --- a/dev/man/make_cue_matrix/index.html +++ b/dev/man/make_cue_matrix/index.html @@ -1,5 +1,5 @@ -Make Cue Matrix · JudiLing.jl

Make Cue Matrix

JudiLing.Cue_Matrix_StructType

A structure that stores information created by makecuematrix: C is the cue matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices; goldind is a list of indices of gold paths; A is the adjacency matrix; grams is the number of grams for cues; targetcol is the column name for target strings; tokenized is whether the dataset target is tokenized; septoken is the separator; keepsep is whether to keep separators in cues; startendtoken is the start and end token in boundary cues.

source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame)

Make the cue matrix for training datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
+Make Cue Matrix · JudiLing.jl

Make Cue Matrix

JudiLing.Cue_Matrix_StructType

A structure that stores information created by makecuematrix: C is the cue matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices; goldind is a list of indices of gold paths; A is the adjacency matrix; grams is the number of grams for cues; targetcol is the column name for target strings; tokenized is whether the dataset target is tokenized; septoken is the separator; keepsep is whether to keep separators in cues; startendtoken is the start and end token in boundary cues.

source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame)

Make the cue matrix for training datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
 cue_obj_train = JudiLing.make_cue_matrix(
      latin_train,
     grams=3,
@@ -21,7 +21,7 @@
     start_end_token="#",
     keep_sep=true,
     verbose=false
-    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame, cue_obj::Cue_Matrix_Struct)

Make the cue matrix for validation datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset
  • cue_obj::Cue_Matrix_Struct: training cue object

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
+    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame, cue_obj::Cue_Matrix_Struct)

Make the cue matrix for validation datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset
  • cue_obj::Cue_Matrix_Struct: training cue object

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
 cue_obj_val = JudiLing.make_cue_matrix(
   latin_val,
   cue_obj_train,
@@ -45,7 +45,7 @@
     keep_sep=true,
     start_end_token="#",
     verbose=false
-    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data_train::DataFrame, data_val::DataFrame)

Make the cue matrix for traiing and validation datasets at the same time.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
+    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data_train::DataFrame, data_val::DataFrame)

Make the cue matrix for traiing and validation datasets at the same time.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
 cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
     latin_train,
     latin_val,
@@ -66,7 +66,7 @@
     keep_sep=true,
     start_end_token="#",
     verbose=false
-    )
source
JudiLing.make_combined_cue_matrixMethod
make_combined_cue_matrix(data_train, data_val)

Make the cue matrix for training and validation datasets at the same time, where the features and adjacencies are combined.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
+    )
source
JudiLing.make_combined_cue_matrixMethod
make_combined_cue_matrix(data_train, data_val)

Make the cue matrix for training and validation datasets at the same time, where the features and adjacencies are combined.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
 cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
     latin_train,
     latin_val,
@@ -87,9 +87,9 @@
     keep_sep=true,
     start_end_token="#",
     verbose=false
-    )
source
JudiLing.make_cue_matrix_from_CFBSMethod
make_cue_matrix_from_CFBS(features::Vector{Vector{T}};
                           pad_val::T = 0.,
-                          ncol::Union{Missing,Int}=missing) where {T}

Create a cue matrix from a vector of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val.

Obligatory arguments

  • features::Vector{Vector{T}}: vector of vectors containing C-FBS features

Optional arguments

  • pad_val::T = 0.: Value with which the feature vectors will be padded
  • ncol::Union{Missing,Int}=missing: Number of columns of the C matrix. If not set, will be set to the maximum number of features

Examples

C = JudiLing.make_cue_matrix_from_CFBS(features)
source
JudiLing.make_combined_cue_matrix_from_CFBSMethod
make_combined_cue_matrix_from_CFBS(features_train::Vector{Vector{T}},
+                          ncol::Union{Missing,Int}=missing) where {T}

Create a cue matrix from a vector of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val.

Obligatory arguments

  • features::Vector{Vector{T}}: vector of vectors containing C-FBS features

Optional arguments

  • pad_val::T = 0.: Value with which the feature vectors will be padded
  • ncol::Union{Missing,Int}=missing: Number of columns of the C matrix. If not set, will be set to the maximum number of features

Examples

C = JudiLing.make_cue_matrix_from_CFBS(features)
source
JudiLing.make_combined_cue_matrix_from_CFBSMethod
make_combined_cue_matrix_from_CFBS(features_train::Vector{Vector{T}},
                                    features_test::Vector{Vector{T}};
                                    pad_val::T = 0.,
-                                   ncol::Union{Missing,Int}=missing) where {T}

Create cue matrices from two vectors of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val. The cue matrices are set to have to the size of the maximum number of feature values in features_train and features_test.

Obligatory arguments

  • features_train::Vector{Vector{T}}: vector of vectors containing C-FBS features
  • features_test::Vector{Vector{T}}: vector of vectors containing C-FBS features

Optional arguments

  • pad_val::T = 0.: Value with which the feature vectors will be padded
  • ncol::Union{Missing,Int}=missing: Number of columns of the C matrices. If not set, will be set to the maximum number of features in features_train and features_test

Examples

C_train, C_test = JudiLing.make_combined_cue_matrix_from_CFBS(features_train, features_test)
source
JudiLing.make_ngramsMethod
make_ngrams(tokens, grams, keep_sep, sep_token, start_end_token)

Given a list of string tokens return a list of all n-grams for these tokens.

source
+ ncol::Union{Missing,Int}=missing) where {T}

Create cue matrices from two vectors of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val. The cue matrices are set to have to the size of the maximum number of feature values in features_train and features_test.

Obligatory arguments

  • features_train::Vector{Vector{T}}: vector of vectors containing C-FBS features
  • features_test::Vector{Vector{T}}: vector of vectors containing C-FBS features

Optional arguments

  • pad_val::T = 0.: Value with which the feature vectors will be padded
  • ncol::Union{Missing,Int}=missing: Number of columns of the C matrices. If not set, will be set to the maximum number of features in features_train and features_test

Examples

C_train, C_test = JudiLing.make_combined_cue_matrix_from_CFBS(features_train, features_test)
source
JudiLing.make_ngramsMethod
make_ngrams(tokens, grams, keep_sep, sep_token, start_end_token)

Given a list of string tokens return a list of all n-grams for these tokens.

source
diff --git a/dev/man/make_semantic_matrix/index.html b/dev/man/make_semantic_matrix/index.html index b26d69d..0e3bb2f 100644 --- a/dev/man/make_semantic_matrix/index.html +++ b/dev/man/make_semantic_matrix/index.html @@ -1,12 +1,12 @@ -Make Semantic Matrix · JudiLing.jl

Make Semantic Matrix

Make binary semantic vectors

JudiLing.PS_Matrix_StructType

A structure that stores the discrete semantic vectors: pS is the discrete semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.

source
JudiLing.make_pS_matrixMethod
make_pS_matrix(data)

Create a discrete semantic matrix given a dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset

Optional Arguments

  • features_col::Symbol=:CommunicativeIntention: the column name for target
  • sep_token::String="_": separator

Examples

s_obj_train = JudiLing.make_pS_matrix(
+Make Semantic Matrix · JudiLing.jl

Make Semantic Matrix

Make binary semantic vectors

JudiLing.PS_Matrix_StructType

A structure that stores the discrete semantic vectors: pS is the discrete semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.

source
JudiLing.make_pS_matrixMethod
make_pS_matrix(data)

Create a discrete semantic matrix given a dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset

Optional Arguments

  • features_col::Symbol=:CommunicativeIntention: the column name for target
  • sep_token::String="_": separator

Examples

s_obj_train = JudiLing.make_pS_matrix(
     utterance,
     features_col=:CommunicativeIntention,
-    sep_token="_")
source
JudiLing.make_pS_matrixMethod
make_pS_matrix(data_val, pS_obj)

Construct discrete semantic matrix for the validation datasets given by the exemplar in the dataframe, and given the S matrix for the training datasets.

Obligatory Arguments

  • data_val::DataFrame: the dataset
  • pS_obj::PS_Matrix_Struct: training PS object

Optional Arguments

  • features_col::Symbol=:CommunicativeIntention: the column name for target
  • sep_token::String="_": separator

Examples

s_obj_val = JudiLing.make_pS_matrix(
+    sep_token="_")
source
JudiLing.make_pS_matrixMethod
make_pS_matrix(data_val, pS_obj)

Construct discrete semantic matrix for the validation datasets given by the exemplar in the dataframe, and given the S matrix for the training datasets.

Obligatory Arguments

  • data_val::DataFrame: the dataset
  • pS_obj::PS_Matrix_Struct: training PS object

Optional Arguments

  • features_col::Symbol=:CommunicativeIntention: the column name for target
  • sep_token::String="_": separator

Examples

s_obj_val = JudiLing.make_pS_matrix(
     data_val,
     s_obj_train,
     features_col=:CommunicativeIntention,
-    sep_token="_")
source
JudiLing.make_combined_pS_matrixMethod
make_combined_pS_matrix(
     data_train,
     data_val;
     features_col = :CommunicativeIntention,
@@ -15,7 +15,7 @@
     data_train,
     data_val,
     features_col=:CommunicativeIntention,
-    sep_token="_")
source

Simulate semantic vectors

JudiLing.L_Matrix_StructType

A structure that stores Lexome semantic vectors: L is Lexome semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.

source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the training datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    sep_token="_")
source

Simulate semantic vectors

JudiLing.L_Matrix_StructType

A structure that stores Lexome semantic vectors: L is Lexome semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.

source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the training datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train = JudiLing.make_S_matrix(
     french,
     ["Lexeme"],
@@ -51,7 +51,7 @@
     sd_base=4,
     sd_inflection=4,
     sd_noise=1,
-    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the validation datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the validation datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train, S_val = JudiLing.make_S_matrix(
     french,
     french_val,
@@ -88,7 +88,7 @@
     sd_base=4,
     sd_inflection=4,
     sd_noise=1,
-    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector)

Create simulated semantic matrix for the training datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector)

Create simulated semantic matrix for the training datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train = JudiLing.make_S_matrix(
     french,
     ["Lexeme"],
@@ -123,7 +123,7 @@
     sd_base=4,
     sd_inflection=4,
     sd_noise=1,
-    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create simulated semantic matrix for the validation datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create simulated semantic matrix for the validation datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train, S_val = JudiLing.make_S_matrix(
     french,
     french_val,
@@ -159,7 +159,7 @@
     sd_base=4,
     sd_inflection=4,
     sd_noise=1,
-    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S1 = JudiLing.make_S_matrix(
     latin,
     ["Lexeme"],
@@ -168,7 +168,7 @@
      add_noise=true,
     sd_noise=1,
     normalized=false
-    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S1, S2 = JudiLing.make_S_matrix(
      latin,
     latin_val,
@@ -177,7 +177,7 @@
     add_noise=true,
     sd_noise=1,
     normalized=false
-    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S1 = JudiLing.make_S_matrix(
     latin,
     ["Lexeme"],
@@ -185,7 +185,7 @@
     add_noise=true,
     sd_noise=1,
     normalized=false
-    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S1, S2 = JudiLing.make_S_matrix(
     latin,
     latin_val,
@@ -195,51 +195,51 @@
     add_noise=true,
     sd_noise=1,
     normalized=false
-    )
source
JudiLing.make_L_matrixMethod
make_L_matrix(data::DataFrame, base::Vector)

Create Lexome Matrix with simulated semantic vectors where there are only base features.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
+    )
source
JudiLing.make_L_matrixMethod
make_L_matrix(data::DataFrame, base::Vector)

Create Lexome Matrix with simulated semantic vectors where there are only base features.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
 L = JudiLing.make_L_matrix(
     latin,
     ["Lexeme"],
-    ncol=200)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the Lexome Matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    ncol=200)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the Lexome Matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train, S_val = JudiLing.make_combined_S_matrix(
     latin_train,
     latin_val,
     ["Lexeme"],
     ["Person","Number","Tense","Voice","Mood"],
-    L)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the Lexome Matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    L)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the Lexome Matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train, S_val = JudiLing.make_combined_S_matrix(
     latin_train,
     latin_val,
     ["Lexeme"],
     ["Person","Number","Tense","Voice","Mood"],
-    L)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(  data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    L)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(  data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train, S_val = JudiLing.make_combined_S_matrix(
     latin_train,
     latin_val,
     ["Lexeme"],
     ["Person","Number","Tense","Voice","Mood"],
-    ncol=n_features)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
+    ncol=n_features)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
 S_train, S_val = JudiLing.make_combined_S_matrix(
     latin_train,
     latin_val,
     ["Lexeme"],
     ["Person","Number","Tense","Voice","Mood"],
-    ncol=n_features)
source
JudiLing.make_combined_L_matrixMethod
make_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
+    ncol=n_features)
source
JudiLing.make_combined_L_matrixMethod
make_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
 L = JudiLing.make_combined_L_matrix(
     latin_train,
     latin_val,
     ["Lexeme"],
     ["Person","Number","Tense","Voice","Mood"],
-    ncol=n_features)
source
JudiLing.make_combined_L_matrixMethod
make_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
+    ncol=n_features)
source
JudiLing.make_combined_L_matrixMethod
make_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
 L = JudiLing.make_combined_L_matrix(
     latin_train,
     latin_val,
     ["Lexeme"],
-    ncol=n_features)
source
JudiLing.L_Matrix_StructMethod
L_Matrix_Struct(L, sd_base, sd_base_mean, sd_inflection, sd_inflection_mean, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)

Construct LMatrixStruct with deep mode.

source
JudiLing.L_Matrix_StructMethod
L_Matrix_Struct(L, sd_base, sd_inflection, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)

Construct LMatrixStruct without deep mode.

source

Load from word2vec, fasttext or similar

JudiLing.L_Matrix_StructMethod
L_Matrix_Struct(L, sd_base, sd_base_mean, sd_inflection, sd_inflection_mean, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)

Construct LMatrixStruct with deep mode.

source
JudiLing.L_Matrix_StructMethod
L_Matrix_Struct(L, sd_base, sd_inflection, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)

Construct LMatrixStruct without deep mode.

source

Load from word2vec, fasttext or similar

JudiLing.load_S_matrix_from_fasttextMethod
load_S_matrix_from_fasttext(data::DataFrame,
                                  language::Symbol;
                                  target_col=:Word,
                                  default_file::Int=1)

Load semantic matrix from fasttext, loaded using the Embeddings.jl package. Subset fasttext vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available.

The last parameter, default_file, specifies which vectors are loaded. To learn about all available vectors, use the following commands:

using Embeddings
 language_files(FastText_Text{:nl})

replacing the language code (here :nl) with the language you are interested in. In general, for all languages other than English, these files are available:

  • default_file=1 loads from https://fasttext.cc/docs/en/crawl-vectors.html, paper: E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/
  • default_file=2 loads from https://fasttext.cc/docs/en/pretrained-vectors.html paper: P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/

Obligatory Arguments

  • data::DataFrame: the dataset
  • language::Symbol: the language of the words in the dataset, offically ISO 639-2 (see https://github.com/JuliaText/Embeddings.jl/issues/34#issuecomment-782604523) but practically it seems more like ISO 639-1 to me with ISO 639-2 only being used if ISO 639-1 isn't available (see https://en.wikipedia.org/wiki/ListofISO639-2codes)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
  • default_file::Int=1: source of vectors, for more information see above and here: https://github.com/JuliaText/Embeddings.jl#loading-different-embeddings

Examples

# basic usage
-latin_small, S = JudiLing.load_S_matrix_from_fasttext(latin, :la, target_col=:Word)
source
JudiLing.load_S_matrix_from_fasttextMethod
load_S_matrix_from_fasttext(data_train::DataFrame,
                                  data_val::DataFrame,
                                  language::Symbol;
                                  target_col=:Word,
@@ -248,14 +248,14 @@
 latin_small_train, latin_small_val, S_train, S_val = JudiLing.load_S_matrix_from_fasttext(latin_train,
                                                       latin_val,
                                                       :la,
-                                                      target_col=:Word)
source
JudiLing.load_S_matrix_from_word2vec_fileMethod
load_S_matrix_from_word2vec_file(data::DataFrame,
                             filepath::String;
-                            target_col=:Word)

Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted data and semantic matrix.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • filepath::String: path to file with word2vec vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_word2vec_fileMethod
load_S_matrix_from_word2vec_file(data_train::DataFrame,
+                            target_col=:Word)

Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted data and semantic matrix.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • filepath::String: path to file with word2vec vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_word2vec_fileMethod
load_S_matrix_from_word2vec_file(data_train::DataFrame,
                             data_val::DataFrame,
                             filepath::String;
-                            target_col=:Word)

Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col of data_train and data_val, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • filepath::String: path to file with word2vec vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_fasttext_fileMethod
load_S_matrix_from_fasttext_file(data::DataFrame,
+                            target_col=:Word)

Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col of data_train and data_val, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • filepath::String: path to file with word2vec vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_fasttext_fileMethod
load_S_matrix_from_fasttext_file(data::DataFrame,
                             filepath::String;
-                            target_col=:Word)

Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted data and semantic matrix.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • filepath::String: path to file with fasttext vectors in .txt or .vec (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_fasttext_fileMethod
load_S_matrix_from_fasttext_file(data_train::DataFrame,
+                            target_col=:Word)

Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted data and semantic matrix.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • filepath::String: path to file with fasttext vectors in .txt or .vec (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_fasttext_fileMethod
load_S_matrix_from_fasttext_file(data_train::DataFrame,
                             data_val::DataFrame,
                             filepath::String;
-                            target_col=:Word)

Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col of data_train and data_val, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • filepath::String: path to file with fasttext vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source

Utility functions

JudiLing.merge_f2iMethod
merge_f2i(base_f2i, infl_f2i, n_base_f, n_infl_f)

Merge base f2i dictionary and inflectional f2i dictionary.

source
JudiLing.make_StMethod
make_St(L, n, data, base, inflections)

Make S transpose matrix with inflections.

source
+ target_col=:Word)

Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col of data_train and data_val, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • filepath::String: path to file with fasttext vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source

Utility functions

JudiLing.merge_f2iMethod
merge_f2i(base_f2i, infl_f2i, n_base_f, n_infl_f)

Merge base f2i dictionary and inflectional f2i dictionary.

source
JudiLing.make_StMethod
make_St(L, n, data, base, inflections)

Make S transpose matrix with inflections.

source
diff --git a/dev/man/make_yt_matrix/index.html b/dev/man/make_yt_matrix/index.html index e6c5844..19d4ee4 100644 --- a/dev/man/make_yt_matrix/index.html +++ b/dev/man/make_yt_matrix/index.html @@ -1,3 +1,3 @@ -Make Yt Matrix · JudiLing.jl

Make Yt Matrix

JudiLing.make_Yt_matrixMethod
make_Yt_matrix(t, data, f2i)

Make Yt matrix for timestep t. A given column of the Yt matrix specifies the support for the corresponding n-gram predicted for timestep t for each of the observations (rows of Yt).

Obligatory Arguments

  • t::Int64: the timestep t
  • data::DataFrame: the dataset
  • f2i::Dict: the dictionary returning indices given features

Optional Arguments

  • tokenized::Bool=false: if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • verbose::Bool=false: if verbose, more information will be printed

Examples

latin = DataFrame(CSV.File(joinpath("data", "latin_mini.csv")))
-JudiLing.make_Yt_matrix(2, latin)
source
+Make Yt Matrix · JudiLing.jl

Make Yt Matrix

JudiLing.make_Yt_matrixMethod
make_Yt_matrix(t, data, f2i)

Make Yt matrix for timestep t. A given column of the Yt matrix specifies the support for the corresponding n-gram predicted for timestep t for each of the observations (rows of Yt).

Obligatory Arguments

  • t::Int64: the timestep t
  • data::DataFrame: the dataset
  • f2i::Dict: the dictionary returning indices given features

Optional Arguments

  • tokenized::Bool=false: if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • verbose::Bool=false: if verbose, more information will be printed

Examples

latin = DataFrame(CSV.File(joinpath("data", "latin_mini.csv")))
+JudiLing.make_Yt_matrix(2, latin)
source
diff --git a/dev/man/measures_func/index.html b/dev/man/measures_func/index.html index 7916f40..49e5504 100644 --- a/dev/man/measures_func/index.html +++ b/dev/man/measures_func/index.html @@ -71,4 +71,4 @@ Note: the `kargs` are just keyword arguments that are passed on from the parameters of `get_and_train_model` to the `measures_func`. For example, this could be a suffix that should be added to each added column in `measures_func`. ## Output -The function has to return the dataset. +The function has to return the dataset. diff --git a/dev/man/output/index.html b/dev/man/output/index.html index 047ddde..24da0bb 100644 --- a/dev/man/output/index.html +++ b/dev/man/output/index.html @@ -1,5 +1,5 @@ -Output · JudiLing.jl

Output

JudiLing.write2csvFunction

Write results into a csv file. This function takes as input the results from the learn_paths and build_paths functions, including the information on gold paths that is optionally returned as second output result.

source
JudiLing.write2dfFunction

Reformat results into a dataframe. This function takes as input the results from the learn_paths and build_paths functions, including the information on gold paths that is optionally returned as second output result.

source
JudiLing.write2csvMethod
write2csv(res, data, cue_obj_train, cue_obj_val, filename)

Write results into csv file for the results from learn_paths and build_paths.

Obligatory Arguments

  • res::Array{Array{Result_Path_Info_Struct,1},1}: the results from learn_paths or build_paths
  • data::DataFrame: the dataset
  • cue_obj_train::Cue_Matrix_Struct: the cue object for training dataset
  • cue_obj_val::Cue_Matrix_Struct: the cue object for validation dataset
  • filename::String: the filename

Optional Arguments

  • grams::Int64=3: the number n in n-gram cues
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • output_sep_token::Union{String, Char}="": output separator
  • path_sep_token::Union{String, Char}=":": path separator
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

# writing results for training data
+Output · JudiLing.jl

Output

JudiLing.write2csvFunction

Write results into a csv file. This function takes as input the results from the learn_paths and build_paths functions, including the information on gold paths that is optionally returned as second output result.

source
JudiLing.write2dfFunction

Reformat results into a dataframe. This function takes as input the results from the learn_paths and build_paths functions, including the information on gold paths that is optionally returned as second output result.

source
JudiLing.write2csvMethod
write2csv(res, data, cue_obj_train, cue_obj_val, filename)

Write results into csv file for the results from learn_paths and build_paths.

Obligatory Arguments

  • res::Array{Array{Result_Path_Info_Struct,1},1}: the results from learn_paths or build_paths
  • data::DataFrame: the dataset
  • cue_obj_train::Cue_Matrix_Struct: the cue object for training dataset
  • cue_obj_val::Cue_Matrix_Struct: the cue object for validation dataset
  • filename::String: the filename

Optional Arguments

  • grams::Int64=3: the number n in n-gram cues
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • output_sep_token::Union{String, Char}="": output separator
  • path_sep_token::Union{String, Char}=":": path separator
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

# writing results for training data
 JudiLing.write2csv(
     res_train,
     latin_train,
@@ -31,7 +31,7 @@
     path_sep_token=":",
     target_col=:Word,
     root_dir=".",
-    output_dir="test_out")
source
JudiLing.write2csvMethod
write2csv(gpi::Vector{Gold_Path_Info_Struct}, filename)

Write results into csv file for the gold paths' information optionally returned by learn_paths and build_paths.

Obligatory Arguments

  • gpi::Vector{Gold_Path_Info_Struct}: the gold paths' information
  • filename::String: the filename

Optional Arguments

  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

# write gold standard paths to csv for training data
+    output_dir="test_out")
source
JudiLing.write2csvMethod
write2csv(gpi::Vector{Gold_Path_Info_Struct}, filename)

Write results into csv file for the gold paths' information optionally returned by learn_paths and build_paths.

Obligatory Arguments

  • gpi::Vector{Gold_Path_Info_Struct}: the gold paths' information
  • filename::String: the filename

Optional Arguments

  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

# write gold standard paths to csv for training data
 JudiLing.write2csv(
     gpi_train,
     "gpi_latin_train.csv",
@@ -45,7 +45,7 @@
     "gpi_latin_val.csv",
     root_dir=".",
     output_dir="test_out"
-    )
source
JudiLing.write2csvMethod
write2csv(ts::Threshold_Stat_Struct, filename)

Write results into csv file for threshold and tolerance proportion for each timestep.

Obligatory Arguments

  • gpi::Vector{Gold_Path_Info_Struct}: the gold paths' information
  • filename::String: the filename

Optional Arguments

  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

JudiLing.write2csv(ts, "ts.csv", root_dir = @__DIR__, output_dir="out")
source
JudiLing.write2dfMethod
write2df(res, data, cue_obj_train, cue_obj_val)

Reformat results into a dataframe for the results form learn_paths and build_paths functions.

Obligatory Arguments

  • res: output of learn_paths or build_paths
  • data::DataFrame: the dataset
  • cue_obj_train: cue object of the training data set
  • cue_obj_val: cue object of the validation data set

Optional Arguments

  • grams::Int64=3: the number n in n-gram cues
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • output_sep_token::Union{String, Char}="": output separator
  • path_sep_token::Union{String, Char}=":": path separator
  • target_col::Union{String, Symbol}=:Words: the column name for target strings

Examples

# writing results for training data
+    )
source
JudiLing.write2csvMethod
write2csv(ts::Threshold_Stat_Struct, filename)

Write results into csv file for threshold and tolerance proportion for each timestep.

Obligatory Arguments

  • gpi::Vector{Gold_Path_Info_Struct}: the gold paths' information
  • filename::String: the filename

Optional Arguments

  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

JudiLing.write2csv(ts, "ts.csv", root_dir = @__DIR__, output_dir="out")
source
JudiLing.write2dfMethod
write2df(res, data, cue_obj_train, cue_obj_val)

Reformat results into a dataframe for the results form learn_paths and build_paths functions.

Obligatory Arguments

  • res: output of learn_paths or build_paths
  • data::DataFrame: the dataset
  • cue_obj_train: cue object of the training data set
  • cue_obj_val: cue object of the validation data set

Optional Arguments

  • grams::Int64=3: the number n in n-gram cues
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • output_sep_token::Union{String, Char}="": output separator
  • path_sep_token::Union{String, Char}=":": path separator
  • target_col::Union{String, Symbol}=:Words: the column name for target strings

Examples

# writing results for training data
 JudiLing.write2df(
     res_train,
     latin_train,
@@ -71,10 +71,10 @@
     start_end_token="#",
     output_sep_token="",
     path_sep_token=":",
-    target_col=:Word)
source
JudiLing.write2dfMethod
write2df(gpi::Vector{Gold_Path_Info_Struct})

Write results into a dataframe for the gold paths' information optionally returned by learn_paths and build_paths.

Obligatory Arguments

  • gpi::Vector{Gold_Path_Info_Struct}: the gold paths' information

Examples

# write gold standard paths to df for training data
+    target_col=:Word)
source
JudiLing.write2dfMethod
write2df(gpi::Vector{Gold_Path_Info_Struct})

Write results into a dataframe for the gold paths' information optionally returned by learn_paths and build_paths.

Obligatory Arguments

  • gpi::Vector{Gold_Path_Info_Struct}: the gold paths' information

Examples

# write gold standard paths to df for training data
 JudiLing.write2csv(gpi_train)
 
 # write gold standard paths to df for validation data
-JudiLing.write2csv(gpi_val)
source
JudiLing.write2dfMethod
write2df(ts::Threshold_Stat_Struct)

Write results into a dataframe for threshold and tolerance proportion for each timestep.

Obligatory Arguments

  • ts::Threshold_Stat_Struct: the threshold and tolerance proportion

Examples

JudiLing.write2df(ts)
source
JudiLing.write_comprehension_evalMethod
write_comprehension_eval(SChat, SC, data, target_col, filename)

Write comprehension evaluation into a CSV file, include target and predicted ids and indentifiers and their correlations.

Obligatory Arguments

  • SChat::Matrix: the Shat/Chat matrix
  • SC::Matrix: the S/C matrix
  • data::DataFrame: the data
  • target_col::Symbol: the name of target column
  • filename::String: the filename/filepath

Optional Arguments

  • k: top k candidates
  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

JudiLing.write_comprehension_eval(Chat, cue_obj.C, latin, :Word, "output.csv",
-    k=10, root_dir=@__DIR__, output_dir="out")
source
JudiLing.write_comprehension_evalMethod
write_comprehension_eval(SChat, SC, SC_rest, data, data_rest, target_col, filename)

Write comprehension evaluation into a CSV file for both training and validation datasets, include target and predicted ids and indentifiers and their correlations.

Obligatory Arguments

  • SChat::Matrix: the Shat/Chat matrix
  • SC::Matrix: the S/C matrix
  • SC_rest::Matrix: the rest S/C matrix
  • data::DataFrame: the data
  • data_rest::DataFrame: the rest data
  • target_col::Symbol: the name of target column
  • filename::String: the filename/filepath

Optional Arguments

  • k: top k candidates
  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

JudiLing.write_comprehension_eval(Shat_val, S_val, S_train, latin_val, latin_train,
-    :Word, "all_output.csv", k=10, root_dir=@__DIR__, output_dir="out")
source
JudiLing.save_L_matrixMethod
save_L_matrix(L, filename)

Save lexome matrix into csv file.

Obligatory Arguments

  • L::L_Matrix_Struct: the lexome matrix struct
  • filename::String: the filename/filepath

Examples

JudiLing.save_L_matrix(L, joinpath(@__DIR__, "L.csv"))
source
JudiLing.load_L_matrixMethod
load_L_matrix(filename)

Load lexome matrix from csv file.

Obligatory Arguments

  • filename::String: the filename/filepath

Optional Arguments

  • header::Bool=false: header in csv

Examples

L_load = JudiLing.load_L_matrix(joinpath(@__DIR__, "L.csv"))
source
JudiLing.save_S_matrixMethod
save_S_matrix(S, filename, data, target_col)

Save S matrix into a csv file.

Obligatory Arguments

  • S::Matrix: the S matrix
  • filename::String: the filename/filepath
  • data::DataFrame: the data
  • target_col::Symbol: the name of target column

Optional Arguments

  • sep::Bool=" ": separator in CSV file

Examples

JudiLing.save_S_matrix(S, joinpath(@__DIR__, "S.csv"), latin, :Word)
source
JudiLing.load_S_matrixMethod
load_S_matrix(filename)

Load S matrix from a csv file.

Obligatory Arguments

  • filename::String: the filename/filepath

Optional Arguments

  • header::Bool=false: header in csv
  • sep::Bool=" ": separator in CSV file

Examples

JudiLing.load_S_matrix(joinpath(@__DIR__, "S.csv"))
source
+JudiLing.write2csv(gpi_val)
source
JudiLing.write2dfMethod
write2df(ts::Threshold_Stat_Struct)

Write results into a dataframe for threshold and tolerance proportion for each timestep.

Obligatory Arguments

  • ts::Threshold_Stat_Struct: the threshold and tolerance proportion

Examples

JudiLing.write2df(ts)
source
JudiLing.write_comprehension_evalMethod
write_comprehension_eval(SChat, SC, data, target_col, filename)

Write comprehension evaluation into a CSV file, include target and predicted ids and indentifiers and their correlations.

Obligatory Arguments

  • SChat::Matrix: the Shat/Chat matrix
  • SC::Matrix: the S/C matrix
  • data::DataFrame: the data
  • target_col::Symbol: the name of target column
  • filename::String: the filename/filepath

Optional Arguments

  • k: top k candidates
  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

JudiLing.write_comprehension_eval(Chat, cue_obj.C, latin, :Word, "output.csv",
+    k=10, root_dir=@__DIR__, output_dir="out")
source
JudiLing.write_comprehension_evalMethod
write_comprehension_eval(SChat, SC, SC_rest, data, data_rest, target_col, filename)

Write comprehension evaluation into a CSV file for both training and validation datasets, include target and predicted ids and indentifiers and their correlations.

Obligatory Arguments

  • SChat::Matrix: the Shat/Chat matrix
  • SC::Matrix: the S/C matrix
  • SC_rest::Matrix: the rest S/C matrix
  • data::DataFrame: the data
  • data_rest::DataFrame: the rest data
  • target_col::Symbol: the name of target column
  • filename::String: the filename/filepath

Optional Arguments

  • k: top k candidates
  • root_dir::String=".": dir path for project root dir
  • output_dir::String=".": output dir inside root dir

Examples

JudiLing.write_comprehension_eval(Shat_val, S_val, S_train, latin_val, latin_train,
+    :Word, "all_output.csv", k=10, root_dir=@__DIR__, output_dir="out")
source
JudiLing.save_L_matrixMethod
save_L_matrix(L, filename)

Save lexome matrix into csv file.

Obligatory Arguments

  • L::L_Matrix_Struct: the lexome matrix struct
  • filename::String: the filename/filepath

Examples

JudiLing.save_L_matrix(L, joinpath(@__DIR__, "L.csv"))
source
JudiLing.load_L_matrixMethod
load_L_matrix(filename)

Load lexome matrix from csv file.

Obligatory Arguments

  • filename::String: the filename/filepath

Optional Arguments

  • header::Bool=false: header in csv

Examples

L_load = JudiLing.load_L_matrix(joinpath(@__DIR__, "L.csv"))
source
JudiLing.save_S_matrixMethod
save_S_matrix(S, filename, data, target_col)

Save S matrix into a csv file.

Obligatory Arguments

  • S::Matrix: the S matrix
  • filename::String: the filename/filepath
  • data::DataFrame: the data
  • target_col::Symbol: the name of target column

Optional Arguments

  • sep::Bool=" ": separator in CSV file

Examples

JudiLing.save_S_matrix(S, joinpath(@__DIR__, "S.csv"), latin, :Word)
source
JudiLing.load_S_matrixMethod
load_S_matrix(filename)

Load S matrix from a csv file.

Obligatory Arguments

  • filename::String: the filename/filepath

Optional Arguments

  • header::Bool=false: header in csv
  • sep::Bool=" ": separator in CSV file

Examples

JudiLing.load_S_matrix(joinpath(@__DIR__, "S.csv"))
source
diff --git a/dev/man/pickle/index.html b/dev/man/pickle/index.html index 8b6183b..e292e03 100644 --- a/dev/man/pickle/index.html +++ b/dev/man/pickle/index.html @@ -1,2 +1,2 @@ -Pickle · JudiLing.jl
+Pickle · JudiLing.jl
diff --git a/dev/man/preprocess/index.html b/dev/man/preprocess/index.html index 848503e..d53d8a3 100644 --- a/dev/man/preprocess/index.html +++ b/dev/man/preprocess/index.html @@ -1,2 +1,2 @@ -Preprocess · JudiLing.jl

Preprocess

+Preprocess · JudiLing.jl

Preprocess

diff --git a/dev/man/pyndl/index.html b/dev/man/pyndl/index.html index 689b2fe..6bfcfa1 100644 --- a/dev/man/pyndl/index.html +++ b/dev/man/pyndl/index.html @@ -3,12 +3,12 @@ using JudiLing

Calling pyndl from JudiLing

JudiLing.Pyndl_Weight_StructType
Pyndl_Weight_Struct
     cues::Vector{String}
     outcomes::Vector{String}
-    weight::Matrix{Float64}
  • cues::Vector{String}: Vector of cues, in the order that they appear in the weight matrix.
  • outcomes::Vector{String}: Vector of outcomes, in the order that they appear in the weight matrix.
  • weight::Matrix{Float64}: Weight matrix.
source
JudiLing.pyndlMethod
pyndl(
+    weight::Matrix{Float64}
  • cues::Vector{String}: Vector of cues, in the order that they appear in the weight matrix.
  • outcomes::Vector{String}: Vector of outcomes, in the order that they appear in the weight matrix.
  • weight::Matrix{Float64}: Weight matrix.
source
JudiLing.pyndlMethod
pyndl(
     data_path::String;
     alpha::Float64 = 0.1,
     betas::Tuple{Float64,Float64} = (0.1, 0.1),
     method::String = "openmp"
-)

Compute weights using pyndl. See the documentation of pyndl for more information: https://pyndl.readthedocs.io/en/latest/

Obligatory arguments

  • data_path::String: Path to an events file as generated by pyndl's preprocess.createeventfile

Optional arguments

  • alpha::Float64 = 0.1: α learning rate.
  • betas::Tuple{Float64,Float64} = (0.1, 0.1): β1 and β2 learning rates
  • method::String = "openmp": One of {"openmp", "threading"}. "openmp" only works on Linux.

Example

weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
source

Translating output of pyndl to cue and semantic matrices in JudiLing

With the weights in hand, the cue and semantic matrices can be computed:

JudiLing.make_cue_matrixMethod
make_cue_matrix(
+)

Compute weights using pyndl. See the documentation of pyndl for more information: https://pyndl.readthedocs.io/en/latest/

Obligatory arguments

  • data_path::String: Path to an events file as generated by pyndl's preprocess.createeventfile

Optional arguments

  • alpha::Float64 = 0.1: α learning rate.
  • betas::Tuple{Float64,Float64} = (0.1, 0.1): β1 and β2 learning rates
  • method::String = "openmp": One of {"openmp", "threading"}. "openmp" only works on Linux.

Example

weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
source

Translating output of pyndl to cue and semantic matrices in JudiLing

With the weights in hand, the cue and semantic matrices can be computed:

JudiLing.make_cue_matrixMethod
make_cue_matrix(
     data::DataFrame,
     pyndl_weights::Pyndl_Weight_Struct;
     grams = 3,
@@ -21,7 +21,7 @@
 )

Make the cue matrix based on a dataframe and weights computed with pyndl. Practically this means that the cues are extracted from the weights object and translated to the JudiLing format.

Obligatory arguments

  • data::DataFrame: Dataset with all the word types on which the weights were trained.
  • pyndl_weights::Pyndl_Weight_Struct: Weights trained with JudiLing.pyndl

Optional argyments

  • grams = 3: N-gram size (has to match the n-gram granularity of the cues on which the weights were trained).
  • target_col = "Words": Column with target words.
  • tokenized = false: Whether the target words are already tokenized
  • sep_token = nothing: The string separating the tokens (only used if tokenized=true).
  • keep_sep = false: Whether the sep_token should be retained in the cues.
  • start_end_token = "#": The string with which to mark word boundaries.
  • verbose = false: Verbose mode.

Example

weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
 cue_obj = JudiLing.make_cue_matrix("latin_train.csv", weights,
                                     grams = 3,
-                                    target_col = "Word")
source
JudiLing.make_S_matrixMethod
make_S_matrix(
+                                    target_col = "Word")
source
JudiLing.make_S_matrixMethod
make_S_matrix(
     data::DataFrame,
     pyndl_weights::Pyndl_Weight_Struct,
     n_features_columns::Vector;
@@ -31,7 +31,7 @@
 S = JudiLing.make_S_matrix(data,
                             weights_latin,
                             ["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"],
-                            tokenized=false)
source
JudiLing.make_S_matrixMethod
make_S_matrix(
+                            tokenized=false)
source
JudiLing.make_S_matrixMethod
make_S_matrix(
     data_train::DataFrame,
     data_val::DataFrame,
     pyndl_weights::Pyndl_Weight_Struct,
@@ -43,4 +43,4 @@
                             val,
                             weights_latin,
                             ["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"],
-                            tokenized=false)
source
+ tokenized=false)source diff --git a/dev/man/test_combo/index.html b/dev/man/test_combo/index.html index 2ed9f1f..debc534 100644 --- a/dev/man/test_combo/index.html +++ b/dev/man/test_combo/index.html @@ -1,2 +1,2 @@ -Test Combo · JudiLing.jl

Test Combo

JudiLing.test_comboMethod
test_combo(test_mode;kwargs...)

A wrapper function for a full model for a specific combination of parameters. A detailed introduction is in Test Combo Introduction

Note

testcombo: testcombo is deprecated. While it will remain in the package it is no longer actively maintained.

Obligatory Arguments

  • test_mode::Symbol: which test mode, currently supports :trainonly, :presplit, :carefulsplit and :randomsplit.

Optional Arguments

  • train_sample_size::Int64=0: the desired number of training data
  • val_sample_size::Int64=0: the desired number of validation data
  • val_ratio::Float64=0.0: the desired portion of validation data, if works only if :valsamplesize is 0.0.
  • extension::String=".csv": the extension for data nfeaturesinflections
  • n_grams_target_col::Union{String, Symbol}=:Word: the column name for target strings
  • n_grams_tokenized::Boolean=false: if true, the dataset target is assumed to be tokenized
  • n_grams_sep_token::String=nothing: separator
  • grams::Int64=3: the number of grams for cues
  • n_grams_keep_sep::Boolean=false: if true, keep separators in cues
  • start_end_token::String=":": start and end token in boundary cues
  • path_sep_token::String=":": path separator in the assembled path
  • random_seed::Int64=314: the random seed
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • isdeep::Boolean=true: if true, mean of each feature is also randomized
  • add_noise::Boolean=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Boolean=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
  • if_combined::Boolean=false: if true, then features are combined with both training and validation data
  • learn_mode::Int64=:cholesky: which learning mode, currently supports :cholesky and :wh
  • method::Int64=:additive: whether :additive or :multiplicative decomposition is required
  • shift::Int64=0.02: shift value for :additive decomposition
  • multiplier::Int64=1.01: multiplier value for :multiplicative decomposition
  • output_format::Int64=:auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Int64=0.05: the ratio to decide whether a matrix is sparse
  • wh_freq::Vector=nothing: the learning sequence
  • init_weights::Matrix=nothing: the initial weights
  • eta::Float64=0.1: the learning rate
  • n_epochs::Int64=1: the number of epochs to be trained
  • max_t::Int64=0: the number of epochs to be trained
  • A::Matrix=nothing: the number of epochs to be trained
  • A_mode::Symbol=:combined: the adjacency matrix mode, currently supports :combined or :train_only
  • max_can::Int64=10: the max number of candidate path to keep in the output
  • threshold_train::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration for training data
  • is_tolerant_train::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path for training data
  • tolerance_train::Float64=-0.1: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path for training data
  • max_tolerance_train::Int64=2: maximum number of n-grams allowed in a path for training data
  • threshold_val::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration for validation data
  • is_tolerant_val::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path for validation data
  • tolerance_val::Float64=-0.1: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path for validation data
  • max_tolerance_val::Int64=2: maximum number of n-grams allowed in a path for validation data
  • n_neighbors_train::Int64=10: the top n form neighbors to be considered for training data
  • n_neighbors_val::Int64=20: the top n form neighbors to be considered for validation data
  • issparse::Bool=false: if true, keep sparse matrix format when learning paths
  • output_dir::String="out": the output directory
  • verbose::Bool=false: if true, more information will be printed
source
+Test Combo · JudiLing.jl

Test Combo

JudiLing.test_comboMethod
test_combo(test_mode;kwargs...)

A wrapper function for a full model for a specific combination of parameters. A detailed introduction is in Test Combo Introduction

Note

testcombo: testcombo is deprecated. While it will remain in the package it is no longer actively maintained.

Obligatory Arguments

  • test_mode::Symbol: which test mode, currently supports :trainonly, :presplit, :carefulsplit and :randomsplit.

Optional Arguments

  • train_sample_size::Int64=0: the desired number of training data
  • val_sample_size::Int64=0: the desired number of validation data
  • val_ratio::Float64=0.0: the desired portion of validation data, if works only if :valsamplesize is 0.0.
  • extension::String=".csv": the extension for data nfeaturesinflections
  • n_grams_target_col::Union{String, Symbol}=:Word: the column name for target strings
  • n_grams_tokenized::Boolean=false: if true, the dataset target is assumed to be tokenized
  • n_grams_sep_token::String=nothing: separator
  • grams::Int64=3: the number of grams for cues
  • n_grams_keep_sep::Boolean=false: if true, keep separators in cues
  • start_end_token::String=":": start and end token in boundary cues
  • path_sep_token::String=":": path separator in the assembled path
  • random_seed::Int64=314: the random seed
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • isdeep::Boolean=true: if true, mean of each feature is also randomized
  • add_noise::Boolean=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Boolean=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
  • if_combined::Boolean=false: if true, then features are combined with both training and validation data
  • learn_mode::Int64=:cholesky: which learning mode, currently supports :cholesky and :wh
  • method::Int64=:additive: whether :additive or :multiplicative decomposition is required
  • shift::Int64=0.02: shift value for :additive decomposition
  • multiplier::Int64=1.01: multiplier value for :multiplicative decomposition
  • output_format::Int64=:auto: to force output format to dense(:dense) or sparse(:sparse), make it auto(:auto) to determined by the program
  • sparse_ratio::Int64=0.05: the ratio to decide whether a matrix is sparse
  • wh_freq::Vector=nothing: the learning sequence
  • init_weights::Matrix=nothing: the initial weights
  • eta::Float64=0.1: the learning rate
  • n_epochs::Int64=1: the number of epochs to be trained
  • max_t::Int64=0: the number of epochs to be trained
  • A::Matrix=nothing: the number of epochs to be trained
  • A_mode::Symbol=:combined: the adjacency matrix mode, currently supports :combined or :train_only
  • max_can::Int64=10: the max number of candidate path to keep in the output
  • threshold_train::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration for training data
  • is_tolerant_train::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path for training data
  • tolerance_train::Float64=-0.1: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path for training data
  • max_tolerance_train::Int64=2: maximum number of n-grams allowed in a path for training data
  • threshold_val::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration for validation data
  • is_tolerant_val::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path for validation data
  • tolerance_val::Float64=-0.1: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path for validation data
  • max_tolerance_val::Int64=2: maximum number of n-grams allowed in a path for validation data
  • n_neighbors_train::Int64=10: the top n form neighbors to be considered for training data
  • n_neighbors_val::Int64=20: the top n form neighbors to be considered for validation data
  • issparse::Bool=false: if true, keep sparse matrix format when learning paths
  • output_dir::String="out": the output directory
  • verbose::Bool=false: if true, more information will be printed
source
diff --git a/dev/man/utils/index.html b/dev/man/utils/index.html index 54d4806..559f334 100644 --- a/dev/man/utils/index.html +++ b/dev/man/utils/index.html @@ -1,13 +1,13 @@ -Utils · JudiLing.jl

Utils

JudiLing.is_truly_sparseFunction

Check whether a matrix is truly sparse regardless its format, where M is originally a sparse matrix format.

source

Check whether a matrix is truly sparse regardless its format, where M is originally a dense matrix format.

source
JudiLing.cal_max_timestepFunction
function cal_max_timestep(
+Utils · JudiLing.jl

Utils

JudiLing.is_truly_sparseFunction

Check whether a matrix is truly sparse regardless its format, where M is originally a sparse matrix format.

source

Check whether a matrix is truly sparse regardless its format, where M is originally a dense matrix format.

source
JudiLing.cal_max_timestepFunction
function cal_max_timestep(
     data_train::DataFrame,
     data_val::DataFrame,
     target_col::Union{String, Symbol};
     tokenized::Bool = false,
     sep_token::Union{Nothing, String, Char} = "",
-)

Calculate the max timestep given training and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • target_col::Union{String, Symbol}: the column with the target word forms

Optional Arguments

  • tokenized::Bool = false: Whether the word forms in the target_col are already tokenized
  • sep_token::Union{Nothing, String, Char} = "": The token with which the word forms are tokenized

Examples

JudiLing.cal_max_timestep(latin_train, latin_val, target_col=:Word)
source
function cal_max_timestep(
+)

Calculate the max timestep given training and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • target_col::Union{String, Symbol}: the column with the target word forms

Optional Arguments

  • tokenized::Bool = false: Whether the word forms in the target_col are already tokenized
  • sep_token::Union{Nothing, String, Char} = "": The token with which the word forms are tokenized

Examples

JudiLing.cal_max_timestep(latin_train, latin_val, target_col=:Word)
source
function cal_max_timestep(
     data::DataFrame,
     target_col::Union{String, Symbol};
     tokenized::Bool = false,
     sep_token::Union{Nothing, String, Char} = "",
-)

Calculate the max timestep given training dataset.

Obligatory Arguments

  • data::DataFrame: the dataset
  • target_col::Union{String, Symbol}: the column with the target word forms

Optional Arguments

  • tokenized::Bool = false: Whether the word forms in the target_col are already tokenized
  • sep_token::Union{Nothing, String, Char} = "": The token with which the word forms are tokenized

Examples

JudiLing.cal_max_timestep(latin, target_col=:Word)
source
+)

Calculate the max timestep given training dataset.

Obligatory Arguments

  • data::DataFrame: the dataset
  • target_col::Union{String, Symbol}: the column with the target word forms

Optional Arguments

  • tokenized::Bool = false: Whether the word forms in the target_col are already tokenized
  • sep_token::Union{Nothing, String, Char} = "": The token with which the word forms are tokenized

Examples

JudiLing.cal_max_timestep(latin, target_col=:Word)
source
diff --git a/dev/man/wh/index.html b/dev/man/wh/index.html index bf39baa..ca04ec3 100644 --- a/dev/man/wh/index.html +++ b/dev/man/wh/index.html @@ -10,4 +10,4 @@ history_cols = nothing, history_rows = nothing, verbose = false, - )

Widrow-Hoff Learning.

Obligatory Arguments

Optional Arguments

source
JudiLing.make_learn_seqMethod
make_learn_seq(freq; random_seed = 314)

Make Widrow-Hoff learning sequence from frequencies. Creates a randomly ordered sequences of indices where each index appears according to its frequncy.

Obligatory arguments

  • freq: Vector with frequencies.

Optional arguments

  • random_seed = 314: Random seed to control randomness.

Example

learn_seq = JudiLing.make_learn_seq(data.frequency)
source
+ )

Widrow-Hoff Learning.

Obligatory Arguments

Optional Arguments

source
JudiLing.make_learn_seqMethod
make_learn_seq(freq; random_seed = 314)

Make Widrow-Hoff learning sequence from frequencies. Creates a randomly ordered sequences of indices where each index appears according to its frequncy.

Obligatory arguments

  • freq: Vector with frequencies.

Optional arguments

  • random_seed = 314: Random seed to control randomness.

Example

learn_seq = JudiLing.make_learn_seq(data.frequency)
source
diff --git a/dev/search/index.html b/dev/search/index.html index f6a4643..38522e3 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -1,2 +1,2 @@ -Search · JudiLing.jl

Loading search...

    +Search · JudiLing.jl

    Loading search...