A two-model system for reasonable text generation (via vector scoring)
"Think before you speak"
This repo demonstrates how to use two language models (LMs) to achieve more lucid and coherent text generations.
...these best scoring texts are given output priority:
...these worst scoring ones get filtered out:
Causal-LM
for text generation (e.g.distilgpt2
)Masked-LM
for generated text criticism (e.g.distilroberta
)
- Select a specific domain to build generators for (e.g. Machine Learning Ideas)
- Acquire a text corpus for the domain at hand (e.g.
aalksii/ml-arxiv-papers
) - Fine-tune the
Causal-LM
on the corpus (e.g. finetune_causal.ipynb) - Fine-tune the
Masked-LM
on the corpus (e.g. finetune_masked.ipynb) - Acquire and check quality of
Masked-LM
vectors on the corpus and save them (e.g. vectors.ipynb) - Determine the generation objective of the
Causal-LM
w.r.t. the embeddings of theMasked-LM
; for example:- Novelty: generated ML idea should be at least
0.05
cosine distance away from any existing idea vector - Feasibility: generated ML idea should not be too isolated; it should have at least
10
neighbors within0.1
cosine distance away
- Novelty: generated ML idea should be at least
- Generate texts from
Causal-LM
; only output those that pass the objective. (e.g. generation.ipynb)
There is potential for Reinforcement Learning-inspired improvements that could be made here.