Markovian bash text generator.
We'll use a dataset of Cioran quotes...
grep -Eo "[^ ]+ " dataset.txt
With grep
you can filter texts using REGEX. In this case, we use extended REGEX (-E) and print only matches (-o).
"[^ ]+ "
Selects a group []+
of any characters excluding space ^
followed by a space " "
.
The result is a word per line, including punctuation:
It
is
not
worth
the
bother
of
killing
yourself,
...
grep -Eo "[^ ]+ " dataset.txt | sort #| uniq -c
We'll pipe the output to a sort
:
-
–
—
a
a
a
...
your
your
yourself
yourself
yourself,
zoologist
This way, all occurrences will be together. Uncomment the uniq
command to filter out repetitions and, also, count them -c
.
Tip: we can pass the uniq
output to another sort -n
and get them in incremental order:
...
11 all
12 that
16 in
20 a
21 and
21 is
25 of
30 the
31 I
32 to
We will not use the number of occurrences BUT if we pick a random word from a set which includes all repetitions, the probabilities of each word will depend on it's occurrences. We're working with probabilities without doing any math!
Take a look at this command:
grep -Eo "[^ ]+ " dataset.txt | shuf
We just shuffled all lines!
world,
we
goal!
—That's
living,
...
Let's pick just the first one with head
:
grep -Eo "[^ ]+ " dataset.txt | head -n 1
So far, we're ready to make the cheapest IA:
grep -Eo "[^ ]+ " dataset.txt | shuf | head -n 3 # pick three random words
understand
What
not
And put them together in a "sentence":
grep -Eo "[^ ]+ " dataset.txt | shuf | head -n 3 | tr -d "\n" # remove new lines
absolutely their nothing;
That was nice but sentences can have more than three words and usualy end with a period. Let's look for endings in the dataset:
grep -Eo "[^ ]\." dataset.txt # Show groups of characters excluding spaces followed by a period
The period is a REGEX reserved character, so we must escape it to match a litteral
late.
postponed.
optimists.
dreams.
myself.
...
Great! Run this oneliner and see what happens:
while true; do WORD=$(grep -Eo "[^ ]+ " dataset.txt | shuf | head -n 1); echo -n "$WORD"; [ "$(echo "$WORD" | grep "\." )" != "" ] && break; done
and on a to when lives in metaphysically only and your roots.
It generates a text of random lenght because loops until a word with a period is found.
Here is the same code but clearer:
while true # starts a loop
do
WORD=$(grep -Eo "[^ ]+ " dataset.txt | shuf | head -n 1) # pick a random word
echo -n "$WORD" # print the word, no new line
[ "$(echo "$WORD" | grep "\." )" != "" ] && break # break the loop if there's a period in the word
done
Now that we have an ending, we need a proper start. We can see all words starting with capital letters:
grep -Eo "[A-Z][^ ]+ " dataset.txt
It
Only
The
Wouldn’t
Then
...
Incorporate this to the previous script:
echo -n "$(grep -Eo "[A-Z][^ ]+ " dataset.txt | shuf | head -n 1)" # start with a word with capital letter
while true # loop
do
WORD=$(grep -Eo "[^ ]+ " dataset.txt | shuf | head -n 1) # pick random word
echo -n "$WORD" # print it, no new line
[ "$(echo "$WORD" | grep "\." )" != "" ] && break # break if it has a period
done
We can do better than this. What if we filter words that follows a particular word:
grep -Eo "you [^ ]+ " dataset.txt #| sed "s/you //g"
Uncomment the sed
instruction to remove the first word
you always
you do
you have
you are
you would
you achieve
you cry
you are
A Markov chain is a sequence of possible events that depend only on the previous state. If we could pick a random word but in the set of words that follow the last word, sentences will make a little more "sence".
Let's include all we learned in a final script:
#!/bin/bash
############################
# Markovian Text Generator #
############################
# Select init word with capital letter
WORD="$(grep -Eo "[A-Z][^ ]+ " dataset.txt | shuf | head -n 1)"
echo -n "$WORD"
# Generation loop
while true
do
# Select next word in the set of words that follow previous word
WORD=$(grep -Eo "$WORD[^ ]+ " dataset.txt | sed 's/'"$WORD"'//g' | shuf | head -n 1)
# If no match, break
[ "$WORD" == "" ] && break
# Print word
echo -n "$WORD"
# If string ends with period, break
[ "$(echo "$WORD" | grep "\." )" != "" ] && break
done
# New line
echo
I'll leave you with a final thought from our markovian Cioran:
I'd like someone who observed gorillas in philosophical systems, fight for novelty is all too weak for weaknesses in everyday, in philosophical systems, fight for moral and dreams.