Skip to content

Commit

Permalink
nd format
Browse files Browse the repository at this point in the history
  • Loading branch information
PhilipMay committed Nov 14, 2023
1 parent 37a23df commit a7e0fa9
Show file tree
Hide file tree
Showing 7 changed files with 23 additions and 13 deletions.
2 changes: 1 addition & 1 deletion source/blog/2022-02-20-lightgbm-optuna-demo.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This week I published a project to show how to combine
LightGBM and Optuna efficiently to train good models.
The purpose of this work is to be able to be reused as a template for new projects.

:::{figure} ../_static/img/lightgbm-optuna.png
:::{figure} ../\_static/img/lightgbm-optuna.png
:width: 50 %

LightGBM & Optuna
Expand Down
2 changes: 1 addition & 1 deletion source/blog/2022-02-22-german-wikipedia-corpus-released.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.

:::{figure} ../_static/img/wikipedia.png
:::{figure} ../\_static/img/wikipedia.png
:width: 50 %

Wikipedia
Expand Down
2 changes: 1 addition & 1 deletion source/blog/2022-02-23-mlsum-anomalies.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ my colleague [Michal Harakal](https://www.harakal.de/) and I noticed that in man
sentence of the input text.
Instead, it should generate an independent summary of the whole text.

:::{figure} ../_static/img/text-unsplash.jpg
:::{figure} ../\_static/img/text-unsplash.jpg
:width: 50 %

Photo by [Sandy Millar](https://unsplash.com/@sandym10?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash) on [Unsplash](https://unsplash.com/photos/a-close-up-of-a-book-with-some-type-of-text-Kl4LNdg6on4?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash)
Expand Down
6 changes: 6 additions & 0 deletions source/blog/2022-07-23-python-conda-pip.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ It is a subjective article and represents my own opinion and experience.
The article is structured by several recommendations.

## Recommendation 1: Never install Python

This sounds a bit strange but the first recommendation is to never install Python itself.
The reason is that otherwise you would commit to a single very concrete Python version.
However, you don't want that in principle, because there are different packages that have
Expand All @@ -13,6 +14,7 @@ different version requirements.
But how do you install Python without installing it?

## Recommendation 2: Use conda to install and manage Python

You should use [conda](https://docs.conda.io/) to install and manage Python:

> Conda is an open source package management system and environment management system that
Expand All @@ -30,19 +32,22 @@ More details about the use and installation of conda you can find on my
[conda page](/python/conda/).

## Recommendation 3: Disable conda automatic base Activation

After the conda installation, the so-called base environment is automatically activated in every shell.
If you now install a package - without explicitly activating another environment before - then
the package will be installed into this base environment. This clutters up the base environment and
is annoying. So to force an explicit environment activation you can disable conda automatic base activation.
This is done with the following command: `conda config --set auto_activate_base false`

## Recommendation 4: Never install Anaconda

Anaconda also includes conda. During the installation, however, numerous other packages are installed
completely unnecessarily. This is the reason why Anaconda is just an unnecessary and
completely bloated software that I cannot recommend to anyone.
Nothing more needs to be said about this.

## Recommendation 5: Do not use conda to install Packages

Conda can be used not only to manage environments and
different Python versions, but also to install Python packages like NumPy or pandas.

Expand All @@ -57,6 +62,7 @@ Many maintainers release only unofficially or not at all a conda version of thei
Then the conda package is maintained by someone completely different.

## Recommendation 6: Use pip to install Packages

To avoid the problem described above, I always use [pip](https://pip.pypa.io/en/stable/)
for package installation.
Conda is then only used to create and manage the environments and to install Python.
Expand Down
3 changes: 3 additions & 0 deletions source/blog/2022-10-12-date-encoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,23 @@ The general options to encode the time dimension like the birth date of a custom
3. relative to "today" - e.g. number of days before today

## Pros and cons: separate encoding of year, month and maybe also day and weekday

If you believe in astrology this might be your favorite to encode a birth date since the month is preserved. If you want to encode a *production date* it also might be useful to encode the weekday. That is because there might be a relation between product quality in production and weekday. Parts manufactured on Mondays may have the most severe quality variations.

The disadvantage is that you need multiple columns to encode the date.
Furthermore, this approach also suffers from a
[concept drift](https://en.wikipedia.org/wiki/Concept_drift) problem (see below).

## Pros and cons: relative to a certain point in time in the past

This is easy to calculate because the "point in time in the past" (January 1st 1900 for example) is a fixed point in time. This contrasts with the encoding which is relative to "today". But the problem with this encoding is the following:

There are circumstances that in reality are not related to the date itself, but much more to the age. The remaining service life of a technical device is much more directly related to its age than to its production date. Whether a customer is interested in an airplane trip or a train ticket also depends on age and not so much on the date of birth. So, if you represent the date of birth relative to a time in the past, then the resulting model would have a built-in [concept drift](https://en.wikipedia.org/wiki/Concept_drift).

For example, two predictions are made for the same person with his or her date of birth. One prediction on January 2022 and one on January 2023. Then the person is obviously one year older at the second prediction in January 2023. But this would not be visible in the encoding of the date of birth (if you encode it relative to a point in time in the past). The model would therefore experience a concept drift and would have to be re-trained.

## Pros and cons: relative to "today"

This would be the encoding of choice if there is a relation between age and prediction. It would prevent the concept drift described above. The disadvantage of the coding is that the reference day "today" is very dynamic and not fixed. So you have to be very careful how you set "today".

A distinction is made between the generation of the training data (validation- and testdata) and the prediction at production time. The prediction at production time is easy to understand. The "today" is just the day where the prediction is made. However, generating the training data is a bit more difficult. "Today" must not be the day on which the training data was generated. Instead, the day "today" is relative to the day on which the label was "created". The easiest way to explain this is to use an example:
Expand Down
19 changes: 10 additions & 9 deletions source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Talk about this model:\
:::

:::{grid-item-card} German T5 models in 3 different sizes

- [GermanT5/t5-efficient-gc4-all-german-large-nl36](https://huggingface.co/GermanT5/t5-efficient-gc4-all-german-large-nl36)
- [GermanT5/t5-efficient-gc4-german-base-nl36](https://huggingface.co/GermanT5/t5-efficient-gc4-german-base-nl36)
- [GermanT5/t5-efficient-gc4-all-german-small-el32](https://huggingface.co/GermanT5/t5-efficient-gc4-all-german-small-el32)
Expand Down Expand Up @@ -103,11 +104,11 @@ Includes also a prepared corpus for English and German language.
This repository contains two datasets:

1. A labeled multi-domain (21 domains) German and
English dataset with 25K user utterances for human-robot interaction.
It is also available as a Hugging Face dataset:
[deutsche-telekom/NLU-Evaluation-Data-en-de](https://huggingface.co/datasets/deutsche-telekom/NLU-Evaluation-Data-en-de)
English dataset with 25K user utterances for human-robot interaction.
It is also available as a Hugging Face dataset:
[deutsche-telekom/NLU-Evaluation-Data-en-de](https://huggingface.co/datasets/deutsche-telekom/NLU-Evaluation-Data-en-de)
2. A dataset with 1,127 German sentence pairs with a similarity score. The sentences originate from the first data set.
:::
:::

:::{grid-item-card} [deutsche-telekom/NLU-few-shot-benchmark-en-de](https://huggingface.co/datasets/deutsche-telekom/NLU-few-shot-benchmark-en-de)
This is a few-shot training dataset from the domain of human-robot interaction.
Expand Down Expand Up @@ -197,22 +198,22 @@ An [Arch Linux](https://archlinux.org/) package ([AUR](https://wiki.archlinux.or
- refactor slow sentencepiece tokenizers and add tests: [#11716](https://github.com/huggingface/transformers/pull/11716),
[#11737](https://github.com/huggingface/transformers/pull/11737)
- [more fixes and improvements](https://github.com/huggingface/transformers/pulls?q=is%3Apr+author%3APhilipMay)
:::
:::

:::{grid-item-card} [Optuna](https://github.com/optuna/optuna)

- add MLflow integration callback: [#1028](https://github.com/optuna/optuna/pull/1028)
- trial level suggest for same variable with different parameters give warning: [#908](https://github.com/optuna/optuna/pull/908)
- [more fixes and improvements](https://github.com/optuna/optuna/pulls?q=is%3Apr+author%3APhilipMay)
:::
:::

:::{grid-item-card} [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)

- add callback so we can do pruning and check for nan values: [#327](https://github.com/UKPLab/sentence-transformers/pull/327)
- add option to pass params to tokenizer: [#342](https://github.com/UKPLab/sentence-transformers/pull/342)
- always store best_score: [#439](https://github.com/UKPLab/sentence-transformers/pull/439)
- fix for OOM problems on GPU with large datasets: [#525](https://github.com/UKPLab/sentence-transformers/pull/525)
:::
:::

:::{grid-item-card} [SetFit - Efficient Few-shot Learning with Sentence Transformers](https://github.com/huggingface/setfit)

Expand All @@ -222,7 +223,7 @@ An [Arch Linux](https://archlinux.org/) package ([AUR](https://wiki.archlinux.or
- add option to use amp / FP16 [#134](https://github.com/huggingface/setfit/pull/134)
- add num_epochs to train_step calculation [#139](https://github.com/huggingface/setfit/pull/134)
- add more loss function options [#159](https://github.com/huggingface/setfit/pull/159)
:::
:::

:::{grid-item-card} Other Fixes and Improvements

Expand All @@ -233,7 +234,7 @@ An [Arch Linux](https://archlinux.org/) package ([AUR](https://wiki.archlinux.or
- [deepset-ai/FARM](https://github.com/deepset-ai/FARM): [various fixes and improvements](https://github.com/deepset-ai/FARM/pulls?q=is%3Apr+author%3APhilipMay)
- [hyperopt/hyperopt](https://github.com/hyperopt/hyperopt): add progressbar with tqdm [#455](https://github.com/hyperopt/hyperopt/pull/455)
- [mlflow/mlflow](https://github.com/mlflow/mlflow): add possibility to use client cert. with tracking API [#2843](https://github.com/mlflow/mlflow/pull/2843)
:::
:::

::::

Expand Down
2 changes: 1 addition & 1 deletion source/it/freifunk.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@
- USB Buchse ausgelötet - siehe Foto unten
- WPS und WLAN Schalter abgekniffen - siehe Foto unten

:::{figure} ../_static/img/passiv-poe-umbau-fritz-box-4020.jpg
:::{figure} ../\_static/img/passiv-poe-umbau-fritz-box-4020.jpg

Photo of the hardware modification
:::
Expand Down

0 comments on commit a7e0fa9

Please sign in to comment.