Skip to content

Commit

Permalink
Removed conversion manager.
Browse files Browse the repository at this point in the history
Main function now:
- creates the parser, and calls the parse method passing the input reader as an argument,
- gets the parsed text, and
- sends teh parsed text to the output writers.
Changed parser to receive an input reader and parse the whole stream.
Changed tokenizer to receive an input reader and generate tokens for the whole stream.
Removed source text and source location from the tokenizer, as this information could be obtained from the AST.
Updated grammar to include a list of sentences at the top level.
Updated README.md.
  • Loading branch information
rturrado committed Jan 17, 2023
1 parent 48a28ec commit ce4eddf
Show file tree
Hide file tree
Showing 12 changed files with 434 additions and 395 deletions.
58 changes: 20 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,10 +239,11 @@ The `main` function logic is quite simple:
- Parses the command line options.
- Creates an input reader.
- Creates a stream output writer (that will write to standard output), and, if requested by the user, a file output writer.
- Runs the conversion manager, passing it the reader and the writers.
- Creates a parser, passing the input reader as an argument, and calls its parse method to receive the parsed text.
- Sends the parsed text to the output writers.

Exceptions thrown whether during the parsing of the command line options, or while creating the reader or the writers, are captured,
and make the program terminate.<br/>
Exceptions thrown whether during the parsing of the command line options, while creating the reader or the writers, or by the parser,
are captured, and make the program terminate.<br/>
Both readers and writers are implemented as runtime polymorphic objects. A pure virtual base class, e.g. `input_reader` defines an interface,
and concrete classes, e.g. `file_reader`, implement that interface.
Using polymorphic readers is not mandatory for the task, but makes the implementation symmetric to that of the writers.
Expand Down Expand Up @@ -276,44 +277,27 @@ Again, each concrete class holds a stream, in this case an output stream.<br/>
The `file_writer` constructor just checks that the file stream is good. It doesn't check the file already exists.
The base class just exposes one `write` method, which grabs the output stream and writes a text to it.

#### Conversion manager

The `conversion_manager`:
- reads an input text from an `input_reader`,
- processes it using a `parser`, and
- writes it out to a list of `output_writer`s.

It basically contains a static `run` function that:
- Keeps reading sentences from an `input_reader` until the end of the file is reached.
- Texts that do not form a sentence (i.e. that do not end in a period) are not converted. All the texts are written out though.
- Every input sentence that needs to be processed is sent to the `parser`, and the result of this parsing appended to an output sentence.
- Once an input sentence has been processed, the output sentence is sent out to the different writers.

#### Tokenizer

The `tokenizer` receives a text upon construction, and regex searches it for different patterns (space, dash, period, or word).
This search is done at `operator()`, a coroutine that yields the found tokens back to the caller.
The `tokenizer` receives an `input_reader` upon construction, and keeps reading sentences from it until the end of the stream is reached.
Every sentence is regex searched for different patterns (space, dash, period, or word).
The reading of sentences is done at `operator()`, and the regex searches at `get_next_token()`.
Both methods form a nested coroutine that yields the found tokens back to the caller.
Notice that text not fitting any of the patterns will still be captured, whether as a prefix of the search operation,
or as a remainder of the search loop, and yielded as a token of type `other`.
Once the input text has been completely processed, an `end` token is yielded.

For debugging purposes, the `tokenizer` also keeps track of a *source location*,
an offset to the start of the returned token within the input text.
Once the stream has been completely processed, an `end` token is yielded.

#### Lexer

The `lexer` hides the `tokenizer` implementation to the `parser`, and offers:
- two main methods: `advance_to_next_token` and `get_current_token`,
- two helper methods: `get_current_lexeme` and `get_current_text` to access the two members of a token, and
- two methods for debugging purposes: `get_source_text` and `get_source_location`.
- two main methods: `advance_to_next_token` and `get_current_token`, and
- two helper methods: `get_current_lexeme` and `get_current_text` to access the two members of a token.

#### Parser

The `parser`:
- is constructed from an input text corresponding to a sentence, i.e., a text ending in a period character;
- creates a `lexer`, passing it this input text, and an `AST` (Abstract Syntax Tree);
- calls a `start` method, where all the parsing is effectively done, and
- returns an output text via the `AST`.
The `parser` is constructed from an `input_reader`, and creates a `lexer`, passing it this input reader,
and an `AST` (Abstract Syntax Tree). The `parse` method calls a `start` method, where all the parsing is effectively done, and
returns an output text via the `AST`.

The `start` method is the entry point to a descendent parser implementation, based on an LL1 grammar.
Typical descendent parser implementations define a function for each element of the grammar.
Expand All @@ -326,15 +310,13 @@ Each of these functions can:

#### Abstract Syntax Tree

The `AST` is implemented as a vector of 2 types of nodes:
- text nodes, and
- number expression nodes.

Number expression nodes, likewise, are implemented as a vector of 2 types of nodes:
- integer nodes, and
- text nodes.
The `AST` is implemented as a vector of sentence nodes.
Likewise, a sentence node is implemented as a vector of two types of nodes: text nodes, or number expression nodes.
And number expression nodes, again, are lists of possibly two types of nodes: text nodes, or integer nodes.

The `AST` composes the output text for the `parser` by:
The `AST` offers two APIs: `dump()` and `evaluate()`. The only difference between these two methods is at the number expressions level.
Dumping a number expression returns the original input text for that expression.
While evaluating a number expression performs the conversion from words to numbers. The `AST` performs this evaluation by:
- walking the vector of nodes,
- concatenating the text nodes, and
- for the case of a number expression, concatenating the value of the expression.
Expand Down
91 changes: 77 additions & 14 deletions include/word_converter/ast.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,54 @@
#include <numeric> // accumulate
#include <stdexcept> // runtime_error
#include <string> // to_string
#include <unordered_map>
#include <variant> // visit
#include <vector>


inline static const std::unordered_map<int, std::string> number_to_word_map{
{ 0, "zero" }, // zero
{ 1, "one" }, // one
{ 2, "two" }, // two to nine
{ 3, "three" },
{ 4, "four" },
{ 5, "five" },
{ 6, "six" },
{ 7, "seven" },
{ 8, "eight" },
{ 9, "nine" },
{ 10, "ten" }, // ten to nineteen
{ 11, "eleven" },
{ 12, "twelve" },
{ 13, "thirteen" },
{ 14, "fourteen" },
{ 15, "fifteen" },
{ 16, "sixteen" },
{ 17, "seventeen" },
{ 18, "eighteen" },
{ 19, "nineteen" },
{ 20, "twenty" }, // tens
{ 30, "thirty" },
{ 40, "forty" },
{ 50, "fifty" },
{ 60, "sixty" },
{ 70, "seventy" },
{ 80, "eighty" },
{ 90, "ninety" },
{ 100, "hundred" }, // a hundred
{ 1'000, "thousand" }, // a thousand
{ 1'000'000, "million" }, // a million
{ 1'000'000'000, "billion" } // a billion
};


struct invalid_number_expression_error : public std::runtime_error {
explicit invalid_number_expression_error(const std::string& message) : std::runtime_error{ "" } {
message_ += fmt::format("'{}'", message);
explicit invalid_number_expression_error(const std::string& number_expression_str) : std::runtime_error{ "" } {
message_ += fmt::format("'{}'", number_expression_str);
}
[[nodiscard]] const char* what() const noexcept override { return message_.c_str(); };
private:
std::string message_{ "invalid number expression error: " };
std::string message_{ "invalid number expression: " };
};


Expand Down Expand Up @@ -51,19 +88,21 @@ namespace ast {
struct text_node {
std::string data{};
explicit text_node(std::string text) : data{ std::move(text) } {}
[[nodiscard]] std::string to_string() const { return data; }
[[nodiscard]] std::string dump() const { return data; }
[[nodiscard]] std::string evaluate() const { return data; }
};


struct int_node {
int data{};
explicit int_node(int value) : data{ value } {}
[[nodiscard]] std::string to_string() const { return std::to_string(data); }
[[nodiscard]] std::string dump() const { return number_to_word_map.at(data); }
[[nodiscard]] std::string evaluate() const { return std::to_string(data); }
};


class number_expression_node {
using node_t = std::variant<int_node, text_node>;
using node_t = std::variant<text_node, int_node>;
using nodes_t = std::vector<node_t>;
private:
nodes_t nodes_{};
Expand All @@ -80,7 +119,14 @@ class number_expression_node {
});
return numbers_stack.value();
}
[[nodiscard]] std::string to_string() const {
[[nodiscard]] std::string dump() const {
std::string ret{};
std::ranges::for_each(nodes_, [&ret](auto&& node) {
std::visit([&ret](auto&& arg) { ret += arg.dump(); }, node);
});
return ret;
}
[[nodiscard]] std::string evaluate() const {
if (nodes_.empty()) {
return {};
}
Expand All @@ -103,24 +149,41 @@ class sentence_node {
void add(node_t n) {
nodes_.push_back(std::move(n));
}
[[nodiscard]] std::string to_string() const {
[[nodiscard]] std::string dump() const {
std::string ret{};
std::ranges::for_each(nodes_, [&ret](auto&& node) {
std::visit([&ret](auto&& arg) { ret += arg.to_string(); }, node);
std::visit([&ret](auto&& arg) { ret += arg.dump(); }, node);
});
return ret;
}
[[nodiscard]] std::string evaluate() const {
std::string ret{};
std::ranges::for_each(nodes_, [&ret](auto&& node) {
std::visit([&ret](auto&& arg) { ret += arg.evaluate(); }, node);
});
return ret;
}
};


class tree {
sentence_node start_;
using node_t = sentence_node;
using nodes_t = std::vector<node_t>;
private:
nodes_t nodes_{};
public:
void add(sentence_node n) {
start_ = std::move(n);
void add(node_t n) {
nodes_.push_back(std::move(n));
}
[[nodiscard]] std::string to_string() const {
return start_.to_string();
[[nodiscard]] std::string dump() const {
return std::accumulate(nodes_.begin(), nodes_.end(), std::string{}, [](const auto& total, const auto& node) {
return total + node.dump();
});
}
[[nodiscard]] std::string evaluate() const {
return std::accumulate(nodes_.begin(), nodes_.end(), std::string{}, [](const auto& total, const auto& node) {
return total + node.evaluate();
});
}
};

Expand Down
31 changes: 0 additions & 31 deletions include/word_converter/conversion_manager.h

This file was deleted.

8 changes: 7 additions & 1 deletion include/word_converter/grammar.ebnf
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
start ::= sentence
start ::= sentences

sentences ::= sentence rest_of_sentences
rest_of_sentences ::= sentences
| nothing

sentence ::= sentence_prefix sentence_body
sentence_prefix ::= text_without_number_expressions
sentence_body ::= number_expression rest_of_sentence_body
| period
| end
rest_of_sentence_body ::= text_without_number_expression sentence_body
| period
| end

text_without_number_expressions ::= text_without_number_expression text_without_number_expressions
| nothing
Expand Down
3 changes: 3 additions & 0 deletions include/word_converter/input_reader.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,6 @@ class stream_reader : public input_reader {
return is_;
}
};


using input_reader_up = std::unique_ptr<input_reader>;
Loading

0 comments on commit ce4eddf

Please sign in to comment.