Releases: webis-de/small-text
v2.0.0.dev1
This intermediate release serves as a preliminary version of the upcoming v2.0.0. Consider it an alpha release, where interface changes are still possible.
Added
- General
- Python requirements raised to Python 3.8 since Python 3.7 has reached end of life on 2023-06-27.
- Dropped torchtext as an integration dependency. For individual use cases it can of course still be used.
- Added environment variables
SMALL_TEXT_PROGRESS_BARS
andSMALL_TEXT_OFFLINE
to control the default behavior for progress bars and model downloading.
- PoolBasedActiveLearner:
initialize_data()
has been replaced byinitialize()
which can now also be used to provide an initial model in cold start scenarios. (#10)
- Classification:
- All PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) now support
torch.compile()
which can be enabled on demand. (Requires PyTorch >= 2.0.0). - All PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) now support Automatic Mixed Precision.
SetFitClassification.__init__()
now has a verbosity parameter (similar toTransformerBasedClassification
) through which you can control the progress bar output ofSetFitClassification.fit()
.- TransformerBasedClassification:
- Removed unnecessary
token_type_ids
keyword argument in model call. - Additional keyword args for config, tokenizer, and model can now be configured.
- Removed unnecessary
- All PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) now support
- Embeddings:
- Prevented unnecessary gradient computations for some embedding types and unified code structure.
- Pytorch:
- Added an
inference_mode()
context manager that appliestorch.inference_mode
ortorch.no_grad
for older Pytorch versions.
- Added an
- Query Strategies:
- New strategies: DiscriminativeRepresentationLearning, LabelCardinalityInconsistency, ClassBalancer, and ProbCover.
- Query strategies now have a tie-breaking mechanism to randomly permutate when there is a tie in scores.
- Added
ScoringMixin
to enable a reusable scoring mechanism for query strategies. - LightweightCoreset can now process input in batches. (#23)
- Vector Index Functionality:
- A new vector index API provides implementations over a unified interface to use different implementations for k-nearest neighbor search.
- Existing strategies that used a hard-coded vector search ([ContrastiveActiveLearning][contrastive_active_learning], [SEALS][seals], [AnchorSubsampling][anchor_subsampling]) have been adapted and can now be used with different vector index implementations.
Fixed
- Fixed a bug where the
clone()
operation wrapped the labels, which then raised an error. This affected the single-label scenario for PytorchTextClassificationDataset and TransformersDataset. (#35) - Fixed a bug where the batching in
greedy_coreset()
andlightweight_coreset()
resulted in incorrect batch sizes. (#50) - Fixed a bug where
lightweight_coreset()
failed when computing the norm of the elementwise mean vector.
Changed
- General
- Moved
split_data()
method fromsmall_text.data.datasets
tosmall_text.data.splits
.
- Moved
- Dependencies
- Raised setfit version to 1.1.0.
- Classification:
- The
initialize()
methods of all PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) are now more unified. (#57) - KimCNNClassifier / TransformerBasedClassification: model selection is now disabled by default. Also, it no longer saves models when disabled, thereby greatly reducing the runtime.
- The
- Utils
init_kmeans_plusplus_safe()
now supports weighted kmeans++ initialization forscikit-learn>=1.3.0
.
Removed
- Deprecated functionality
- Removed
default_tensor_type()
method. - Removed
small_text.utils.labels.get_flattened_unique_labels()
. - Removed
small_text.integrations.pytorch.utils.labels.get_flattened_unique_labels()
. - Classification
- Removed early stopping legacy arguments in
__init__()
for KimCNN and TransformerBasedClassification. (Usefit()
keyword arguments instead.) - Removed model selection legacy argument in
TransformerBasedClassification.__init__()
.
- Removed early stopping legacy arguments in
- Removed
- The explicit installation instruction for conda was removed, but the small-text conda-forge package will remain.
v1.4.1
Bugfix release.
Fixed
- Fixed an out of bounds error that occurred when
DiscriminativeActiveLearning
queries all remaining unlabeled data. - Fixed typos/wording in PoolBasedActiveLearner docstrings.
- Pinned SetFit version in notebook example. (#64)
- Fixed an out of bounds error that could occur in
SetFitClassification
for both 32bit systems and Windows. (#66) - Fixed errors in notebook examples that occurred with more recent seaborn / matplotlib versions.
Changed
- Documentation: added links to bibliography. (#65)
v1.4.0
Fixes SetFit seed control and adds the AnchorSubsampling query strategy.
Added
- New query strategy: AnchorSubsampling.
Fixed
- Changed the way how the seed is controlled in
SetFitClassification
since the seed was fixed unless explicitly set via the respective trainer keyword argument.
Changed
- Documentation: Added a section where compatible transformer models are listed.
- Documentation: Updated showcase section.
v1.3.3
v1.3.2
v1.3.1
v1.3.0
SetFitClassification now also supports dropout sampling (like KimCNNClassifier and TransformerBasedClassification).
Added
- Added dropout sampling to SetFitClassification.
Fixed
- Fixed broken link in README.md.
- Fixed typo in README.md. (#26)
Changed
Stopping Criteria
- The ClassificationChange stopping criterion now supports multi-label classification.
Documentation
- Updated the active learning setup figure.
- The documentation of integrations has been reorganized.
Contributors
v1.2.0
This release adds a SetFit classifier, the BALD query strategy, and two new example notebooks.
Added
Active Learning
- PoolBasedActiveLearner now handles keyword arguments passed to the classifier's
fit()
during theupdate()
step. - New strategy: BALD.
- SubsamplingQueryStrategy now uses the remaining unlabeled pool when more samples are requested than are available.
Classification
- Added new classifier: SetFitClassification which wraps huggingface/setfit.
Examples
- Revised both existing notebook examples.
- Added a notebook example for active learning with SetFit classifiers.
- Added a notebook example for cold start initialization with SetFit classifiers.
Documentation
- A showcase section has been added to the documentation.
Fixed
- Distances in lightweight_coreset were not correctly projected onto the [0, 1] interval (but ranking was unaffected).
Changed
- Coreset implementations now use the distance-based (as opposed to the similarity-based) formulation.
v1.1.1
v1.1.0
This release adds a conda package, more convenient imports, and improves many aspects of the classifcation functionality. Moreover, one new query strategy and three stopping criteria have been added.
Added
General
- Small-Text package is now available via conda-forge.
- Imports have been reorganized. You can import all public classes and methods from the top-level package (
small_text
):from small_text import PoolBasedActiveLearner
Classification
- All classifiers now support weighting of training samples.
- Early stopping has been reworked, improved, and documented (#18).
- Model selection has been reworked and documented.
- [!]
KimCNNClassifier.__init()__
: The default value of the (now deprecated) keyword argumentearly_stopping_acc
has been changed from0.98
to-1
in order to matchTransformerBasedClassification
. - [!] Removed weight renormalization after gradient clipping.
Datasets
- The
target_labels
keyword argument in__init()__
will now raise a warning if not passed. - Added
from_arrays()
toSklearnDataset
,PytorchTextClassificationDataset
, andTransformersDataset
to construct datasets more conveniently.
Query Strategies
- New multi-label strategy: CategoryVectorInconsistencyAndRanking.
Stopping Criteria
- New stopping criteria: ClassificationChange, OverallUncertainty, and MaxIterations.
Deprecated
small_text.integrations.pytorch.utils.misc.default_tensor_type()
is deprecated without replacement (#2).TransformerBasedClassification
andKimCNNClassifier
:
The keyword arguments for early stopping (early_stopping / early_stopping_no_improvement, early_stopping_acc) that are passed to__init__()
are now deprecated. Use theearly_stopping
keyword argument in thefit()
method instead (#18).
Fixed
Classification
KimCNNClassifier.fit()
andTransformerBasedClassification.fit()
now correctly
process thescheduler
keyword argument (#16).
Removed
- Removed the strict check that every target label has to occur in the training data.
(This is intended for multi-label settings with many labels; apart from that it is still recommended to make sure that all labels occur.)