You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are many parameters available to control the different types of sampling, and the interactions between them are more complex than can be clearly expressed in the documentation in any individual parameter's docs at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
I believe such documentation would significantly improve users' understanding of how LightGBM works, and help them to make informed decisions about values for LightGBM's parameters.
Description
My idea for this is several paragraphs like the following, mixing an explanation of LightGBM processes with the names of specific parameters that can be used to control it.
LightGBM does not perform boosting directly on the raw values in input data. Instead, it performs some pre-processing such as binning continuous features into histograms, bundling sparse features together, and performing target encoding on categorical features.
This pre-processing creates an object called a Dataset. To improve the speed of Dataset construction, LightGBM samples the input data to determine characteristics like histogram bin boundaries. Use parameter bin_construct_sample_cnt (default=200000) to control how many observations are sample during this process, and data_random_seed to make the process reproducible.
Other themes that I think should be covered
explaining that the sampling during dataset construction and sampling during boosting are different and separate from each other
explaining the difference between goss and bagging
why choose one over the other?
how do the relevant parameters affect the process? (e.g. bagging_fraction)
sampling features (feature_fraction)
sampling splits to evaluate (extra_trees)
how sampling is a core part of distributed training
e.g. with tree_learner=data_parallel, the work of determining bin boundaries for features is split up over partitions of the data
Summary
There are several points in the process of training a LightGBM model where less than the full training data is used.
I think it would be valuable to add a section called "Sampling" or similar at https://lightgbm.readthedocs.io/en/latest/Features.html, describing these concepts.
Motivation
There are many parameters available to control the different types of sampling, and the interactions between them are more complex than can be clearly expressed in the documentation in any individual parameter's docs at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
I believe such documentation would significantly improve users' understanding of how LightGBM works, and help them to make informed decisions about values for LightGBM's parameters.
Description
My idea for this is several paragraphs like the following, mixing an explanation of LightGBM processes with the names of specific parameters that can be used to control it.
Other themes that I think should be covered
goss
andbagging
bagging_fraction
)feature_fraction
)extra_trees
)tree_learner=data_parallel
, the work of determining bin boundaries for features is split up over partitions of the dataReferences
Created this based on the discussion in #4827.
The text was updated successfully, but these errors were encountered: