Skip to content

Commit

Permalink
card design on landingage
Browse files Browse the repository at this point in the history
  • Loading branch information
PhilipMay committed Nov 14, 2023
1 parent e5318d0 commit bd9a688
Showing 1 changed file with 42 additions and 64 deletions.
106 changes: 42 additions & 64 deletions source/index.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
# Philip May - Data Science and IT

:::::{grid} 1
:gutter: 2

::::{grid-item-card}
:::{card}
I'm Philip May, data scientist expert and open source enthusiast with an NLP focus.
I come from Germany and work for Deutsche Telekom.

This website is a mixture of documentation, blog and personal notes.
::::
::::{grid-item-card}
:::

::::{card}
**Website Topics**
^^^
:::{toctree}
Expand All @@ -22,7 +20,6 @@ linux
blog
:::
::::
:::::

## My Open Source Contributions

Expand All @@ -31,10 +28,7 @@ This is an overview of my open source [models](#models), [datasets](#datasets),

### Models

::::{grid} 1
:gutter: 2

:::{grid-item-card} [german-nlp-group/electra-base-german-uncased](https://huggingface.co/german-nlp-group/electra-base-german-uncased)
:::{card} [german-nlp-group/electra-base-german-uncased](https://huggingface.co/german-nlp-group/electra-base-german-uncased)
German [Electra](https://arxiv.org/abs/2003.10555) NLP model,
joined work with [Philipp Reißel](https://www.linkedin.com/in/philipp-reissel/)
([ambeRoad](https://amberoad.de/))
Expand All @@ -43,7 +37,7 @@ Talk about this model:\
[BEYOND BERT – Challenges and Potentials in the Training of German Language Models](https://www.youtube.com/watch?v=cxgrTd2AQis)
:::

:::{grid-item-card} German T5 models in 3 different sizes
:::{card} German T5 models in 3 different sizes

- [GermanT5/t5-efficient-gc4-all-german-large-nl36](https://huggingface.co/GermanT5/t5-efficient-gc4-all-german-large-nl36)
- [GermanT5/t5-efficient-gc4-german-base-nl36](https://huggingface.co/GermanT5/t5-efficient-gc4-german-base-nl36)
Expand All @@ -54,41 +48,36 @@ These models are trained on our [GC4 corpus](https://german-nlp-group.github.io/
Joined work with [Stefan Schweter](https://github.com/stefan-it) ([schweter.ml](https://schweter.ml)) and [Philipp Schmid ](https://www.philschmid.de/) ([Hugging Face](https://huggingface.co/)).
:::

:::{grid-item-card} [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer)
:::{card} [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer)
This model is intended to [compute sentence (text) embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)
for English and German text. These embeddings can then be compared with [cosine-similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
to find sentences with a similar semantic meaning.
:::

:::{grid-item-card} [T-Systems-onsite/mt5-small-sum-de-en-v2](https://huggingface.co/T-Systems-onsite/mt5-small-sum-de-en-v2)
:::{card} [T-Systems-onsite/mt5-small-sum-de-en-v2](https://huggingface.co/T-Systems-onsite/mt5-small-sum-de-en-v2)
A bilingual summarization model for English and German.
It is based on the multilingual T5 model [google/mt5-small](https://huggingface.co/google/mt5-small).
:::

::::

### Datasets

::::{grid} 1
:gutter: 2

:::{grid-item-card} [The German colossal, cleaned Common Crawl corpus (GC4 corpus)](https://german-nlp-group.github.io/projects/gc4-corpus.html)
:::{card} [The German colossal, cleaned Common Crawl corpus (GC4 corpus)](https://german-nlp-group.github.io/projects/gc4-corpus.html)
This is a German text corpus which is based on [Common Crawl](https://commoncrawl.org/).
The text corpus has the size of 454 GB packed. Unpacked it is more than 1 TB.
It has been cleaned up and preprocessed and can be used for various tasks in the NLP field.
The dataset is joined work with [Philipp Reißel](https://twitter.com/phil_ipp_)
([ambeRoad](https://amberoad.de/)).
:::

:::{grid-item-card} STSb Multi MT
:::{card} STSb Multi MT
Machine translated multilingual translations and
the English original of the [STSbenchmark dataset](https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark).
Translation has been done with [deepl.com](https://www.deepl.com/).\
This dataset is available on [GitHub](https://github.com/PhilipMay/stsb-multi-mt) and
as a [Hugging Face Dataset](https://huggingface.co/datasets/stsb_multi_mt).
:::

:::{grid-item-card} [German Backtranslated Paraphrase Dataset](https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase)
:::{card} [German Backtranslated Paraphrase Dataset](https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase)
This is a dataset of more than 21 million German paraphrases.
These are text pairs that have the same meaning but are expressed with different words.
This dataset can be used for example to train semantic text embeddings.
Expand All @@ -97,22 +86,23 @@ and the [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_refere
can be used.
:::

:::{grid-item-card} [Wikipedia 2 Corpus](https://github.com/GermanT5/wikipedia2corpus)
:::{card} [Wikipedia 2 Corpus](https://github.com/GermanT5/wikipedia2corpus)
Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training.
Includes also a prepared corpus for English and German language.
:::

:::{grid-item-card} [NLU Evaluation Data - German and English + Similarity](https://github.com/t-systems-on-site-services-gmbh/NLU-Evaluation-Data-de-en)
:::{card} [NLU Evaluation Data - German and English + Similarity](https://github.com/t-systems-on-site-services-gmbh/NLU-Evaluation-Data-de-en)
This repository contains two datasets:

1. A labeled multi-domain (21 domains) German and
English dataset with 25K user utterances for human-robot interaction.
It is also available as a Hugging Face dataset:
[deutsche-telekom/NLU-Evaluation-Data-en-de](https://huggingface.co/datasets/deutsche-telekom/NLU-Evaluation-Data-en-de)
2. A dataset with 1,127 German sentence pairs with a similarity score. The sentences originate from the first data set.
:::

:::{grid-item-card} [deutsche-telekom/NLU-few-shot-benchmark-en-de](https://huggingface.co/datasets/deutsche-telekom/NLU-few-shot-benchmark-en-de)
:::

:::{card} [deutsche-telekom/NLU-few-shot-benchmark-en-de](https://huggingface.co/datasets/deutsche-telekom/NLU-few-shot-benchmark-en-de)
This is a few-shot training dataset from the domain of human-robot interaction.
It contains texts in German and English language with 64 different utterances (classes).
Each utterance (class) has exactly 20 samples in the training set.
Expand All @@ -124,29 +114,24 @@ We are building on our
data set.
:::

::::

### Projects

::::{grid} 1
:gutter: 2

:::{grid-item-card} [XLSR – Cross-Lingual Sentence Representations](https://github.com/German-NLP-Group/xlsr)
:::{card} [XLSR – Cross-Lingual Sentence Representations](https://github.com/German-NLP-Group/xlsr)
Models and training code for cross-lingual sentence representations like
[T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer)
:::

:::{grid-item-card} [LightGBM Tools](https://github.com/telekom/lightgbm-tools)
:::{card} [LightGBM Tools](https://github.com/telekom/lightgbm-tools)
This Python package implements tools for [LightGBM](https://lightgbm.readthedocs.io/).
In the current version lightgbm-tools focuses on binary classification metrics.
:::

:::{grid-item-card} [ML-Cloud-Tools](https://github.com/telekom/ml-cloud-tools)
:::{card} [ML-Cloud-Tools](https://github.com/telekom/ml-cloud-tools)
Tools for machine learning in cloud environments.
At the moment it is only a tool to easily handle [Amazon S3](https://aws.amazon.com/s3/).
:::

:::{grid-item-card} [Census-Income with LightGBM and Optuna](https://github.com/telekom/census-income-lightgbm)
:::{card} [Census-Income with LightGBM and Optuna](https://github.com/telekom/census-income-lightgbm)
This project uses the [census income data](https://archive-beta.ics.uci.edu/ml/datasets/census+income) and
fits [LightGBM](https://lightgbm.readthedocs.io/) models on it.
It is not intended to bring super good results, but rather as a demo to show the interaction between
Expand All @@ -156,42 +141,37 @@ We also calculare the feature importances
with [SHAP (SHapley Additive exPlanations)](https://github.com/slundberg/shap).
:::

:::{grid-item-card} [S.M.A.R.T. Prometheus Metrics Exporter](https://github.com/PhilipMay/smart-prom-next)
:::{card} [S.M.A.R.T. Prometheus Metrics Exporter](https://github.com/PhilipMay/smart-prom-next)
smart-prom-next is a [Prometheus](https://prometheus.io/docs/introduction/overview/) metric exporter for
[S.M.A.R.T.](https://en.wikipedia.org/wiki/S.M.A.R.T.) values of hard disks.
:::

:::{grid-item-card} [MLflow Image](https://github.com/PhilipMay/mlflow-image)
:::{card} [MLflow Image](https://github.com/PhilipMay/mlflow-image)
The MLflow Docker image.\
MLflow does not provide an official Docker image. This project fills that gap.
:::

:::{grid-item-card} [Lazy-Imports](https://github.com/telekom/lazy-imports)
:::{card} [Lazy-Imports](https://github.com/telekom/lazy-imports)
Python tool to support lazy imports
:::

:::{grid-item-card} [Style-Doc](https://github.com/telekom/style-doc)
:::{card} [Style-Doc](https://github.com/telekom/style-doc)
This is Black for Python docstrings and reStructuredText (rst). It can be used to format
docstrings ([Google docstring format](https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings))
in Python files or [reStructuredText](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html).
:::

:::{grid-item-card} [PyCharm Community Edition IDE for Python with bundled JRE](https://aur.archlinux.org/packages/pycharm-community-jre)
:::{card} [PyCharm Community Edition IDE for Python with bundled JRE](https://aur.archlinux.org/packages/pycharm-community-jre)
An [Arch Linux](https://archlinux.org/) package ([AUR](https://wiki.archlinux.org/title/Arch_User_Repository))
:::

:::{grid-item-card} [conda-forge/hyperopt-feedstock](https://github.com/conda-forge/hyperopt-feedstock)
:::{card} [conda-forge/hyperopt-feedstock](https://github.com/conda-forge/hyperopt-feedstock)
[conda-forge](https://conda-forge.org/) release of [Hyperopt](https://github.com/hyperopt/hyperopt)
:::

::::

### Pull Requests

::::{grid} 1
:gutter: 2

:::{grid-item-card} [Hugging Face / Transformers](https://github.com/huggingface/transformers)
:::{card} [Hugging Face / Transformers](https://github.com/huggingface/transformers)

- add classifier_dropout to classification heads: [#12794](https://github.com/huggingface/transformers/pull/12794)
- add option for subword regularization in sentencepiece tokenizer: [#11149](https://github.com/huggingface/transformers/pull/11149),
Expand All @@ -200,34 +180,38 @@ An [Arch Linux](https://archlinux.org/) package ([AUR](https://wiki.archlinux.or
- refactor slow sentencepiece tokenizers and add tests: [#11716](https://github.com/huggingface/transformers/pull/11716),
[#11737](https://github.com/huggingface/transformers/pull/11737)
- [more fixes and improvements](https://github.com/huggingface/transformers/pulls?q=is%3Apr+author%3APhilipMay)
:::

:::{grid-item-card} [Optuna](https://github.com/optuna/optuna)
:::

:::{card} [Optuna](https://github.com/optuna/optuna)

- add MLflow integration callback: [#1028](https://github.com/optuna/optuna/pull/1028)
- trial level suggest for same variable with different parameters give warning: [#908](https://github.com/optuna/optuna/pull/908)
- [more fixes and improvements](https://github.com/optuna/optuna/pulls?q=is%3Apr+author%3APhilipMay)
:::

:::{grid-item-card} [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)
:::

:::{card} [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)

- add callback so we can do pruning and check for nan values: [#327](https://github.com/UKPLab/sentence-transformers/pull/327)
- add option to pass params to tokenizer: [#342](https://github.com/UKPLab/sentence-transformers/pull/342)
- always store best_score: [#439](https://github.com/UKPLab/sentence-transformers/pull/439)
- fix for OOM problems on GPU with large datasets: [#525](https://github.com/UKPLab/sentence-transformers/pull/525)
:::

:::{grid-item-card} [SetFit - Efficient Few-shot Learning with Sentence Transformers](https://github.com/huggingface/setfit)
:::

:::{card} [SetFit - Efficient Few-shot Learning with Sentence Transformers](https://github.com/huggingface/setfit)

- add option to normalize embeddings [#177](https://github.com/huggingface/setfit/pull/177)
- add option to set `samples_per_label` [#196](https://github.com/huggingface/setfit/pull/196)
- add warmup_proportion param - make warmup_steps configurable [#140](https://github.com/huggingface/setfit/pull/140)
- add option to use amp / FP16 [#134](https://github.com/huggingface/setfit/pull/134)
- add num_epochs to train_step calculation [#139](https://github.com/huggingface/setfit/pull/134)
- add more loss function options [#159](https://github.com/huggingface/setfit/pull/159)
:::

:::{grid-item-card} Other Fixes and Improvements
:::

:::{card} Other Fixes and Improvements

- [google-research/electra](https://github.com/google-research/electra): add toggle to turn off `strip_accents` [#88](https://github.com/google-research/electra/pull/88)
- [opensearch-project/opensearch-py](https://github.com/opensearch-project/opensearch-py):
Expand All @@ -236,27 +220,21 @@ An [Arch Linux](https://archlinux.org/) package ([AUR](https://wiki.archlinux.or
- [deepset-ai/FARM](https://github.com/deepset-ai/FARM): [various fixes and improvements](https://github.com/deepset-ai/FARM/pulls?q=is%3Apr+author%3APhilipMay)
- [hyperopt/hyperopt](https://github.com/hyperopt/hyperopt): add progressbar with tqdm [#455](https://github.com/hyperopt/hyperopt/pull/455)
- [mlflow/mlflow](https://github.com/mlflow/mlflow): add possibility to use client cert. with tracking API [#2843](https://github.com/mlflow/mlflow/pull/2843)
:::

::::
:::

### Archived Projects

::::{grid} 1
:gutter: 2

:::{grid-item-card} [HPOflow](https://github.com/telekom/HPOflow)
:::{card} [HPOflow](https://github.com/telekom/HPOflow)
Tools for [Optuna](https://optuna.readthedocs.io/),
[MLflow](https://www.mlflow.org/docs/latest/index.html) and
the integration of both
:::

:::{grid-item-card} [Transformer-Tools](https://github.com/telekom/transformer-tools)
:::{card} [Transformer-Tools](https://github.com/telekom/transformer-tools)
Tools for [Hugging Face / Transformers](https://github.com/huggingface/transformers)
:::

::::

%# Indices and tables
%
%- {ref}`genindex`
Expand Down

0 comments on commit bd9a688

Please sign in to comment.