In this assignment, your task is to re-implement the Wordlist
class from
Assignment 1, this time using an AVL tree as its underlying representation.
Vectors, arrays, linked lists, or other container data structures are not
allowed.
Recall that the Wordlist
class stores a count of how many time words appear in
a file.
When it's done, you'll be able to write code like this:
Wordlist lst("tiny_shakespeare.txt");
lst.print_stats();
Which prints:
Number of different words: 25670
Total number of words: 202651
Most frequent word: the 5437
Number of singletons: 14919 (58%)
All the code you'll submit for this assignment goes in Wordlist.h.
Don't put main
in Wordlist.h. Instead, put main
in
a5_main.cpp, along with all the code you need to test your
Wordlist
class.
Be sure to thoroughly test your code before submitting it!
Note You can download all the files for this assignment in a single .zip archive from the Github repository for the course. Click on the green "Code" button, and then click on "Download ZIP".
Wordlist.h contains the class Wordlist
where you should
implement all the virtual methods listed in the Wordlist_base
class in
Wordlist_base.h.
Most of the methods in Wordlist_base
are virtual and abstract, and so you
must write your own version of them in Wordlist
. A couple of methods, such
as print_stats
, are not virtual
and have implementations that you can't
change, i.e. your Wordlist
class must work correctly with them as given.
Do not change Wordlist_base.h in any way: keep it as-is.
Put your implementation of Wordlist
in Wordlist.h. It must
publicly inherit from Wordlist_base
, and use the Node
struct
(given in
Wordlist
) to implement an AVL tree.
Don't use vectors, arrays, linked lists or any other container data structures in Wordlist.h.
In addition to the methods listed in Wordlist_base
, in Wordlist
write a
default constructor that takes no parameters and creates an empty Wordlist
object:
Wordlist lst;
// ... lst is an empty Wordlist object ...
Also, write a constructor that takes the name of a file as input, and adds all
the words in that file to the Wordlist
object. Read the words from the file
using C++'s standard <<
operator. When implemented, you'll be able to write
code like in the example above:
Wordlist lst("tiny_shakespeare.txt");
lst.print_stats();
Write a destructor for Wordlist
that de-allocates all the nodes in the list.
In Wordlist_base
, the destructor is called ~Wordlist_base()
, and the one you
write for Wordlist
should be called ~Wordlist()
.
You can use the test_read()
function in a5_main.cpp to test
your code. For example, small.txt contains the following text:
This is
a test
or is this
a test?
When you run this code:
// ...
void test_read()
{
Wordlist lst;
string w;
while (cin >> w)
{
lst.add_word(w);
}
lst.print_words();
cout << endl;
lst.print_stats();
}
int main()
{
test_read();
}
The output is:
❯ ./a5_main < small.txt
1. {"This", 1}
2. {"a", 2}
3. {"is", 2}
4. {"or", 1}
5. {"test", 1}
6. {"test?", 1}
7. {"this", 1}
Number of different words: 7
Total number of words: 9
Most frequent word: a 2
Number of singletons: 5 (71%)
Notice that case matters, e.g. "This"
and "this"
are counted as
different words. Also, punctuation matters, e.g. "test"
and "test?"
are
counted as different.
Note Real life programs would likely strip out punctuation and ignore letter case, but in this assignment we want to count every word exactly as it appears in the file. This makes the code a littler simpler to implement.
Here's another example using the larger file tiny_shakespeare.txt:
❯ ./a5_main < tiny_shakespeare.txt >tiny_shakespeare_out
On average computer, this should run in a couple of seconds.
There's more than 25,000 lines of output, and so the example uses
>tiny_shakespeare_out
to re-direct the output to the file
tiny_shakespeare_out:
1. {"&C:", 2}
...
25670. {"zodiacs", 1}
Number of different words: 25670
Total number of words: 202651
Most frequent word: the 5437
Number of singletons: 14919 (58%)
One way to test your program is to use the Linux diff
command to compare your
output to this expected output:
> ./a5_main < tiny_shakespeare.txt >out
❯ diff out tiny_shakespeare_out
If diff
prints nothing, then the two files are identical. Otherwise, it prints
each pair of different lines.
When you're done, submit just your Wordlist.h on Canvas. Don't submit anything else. A copy of Wordlist_base.h will be in the same folder as your Wordlist.h when it's compiled.
The marker will use their own a5_main.cpp
that will include your
Wordlist.h and will test the methods in it using their own test
cases. They will compile your code on Ubuntu Linux using makefile,
which runs this command:
> make a5_main
g++ -std=c++17 -Wall -Wextra -Werror -Wfatal-errors -Wno-sign-compare -Wnon-virtual-dtor -g a5_main.cpp -o a5_main
This is a very strict compilation command! No warnings or errors are allowed! Make sure that your code compiles with this command before submitting it.
The marker will test the correctness of your code on at least one text file you have not seen before, and they will also test individual method calls using test functions and inputs you have not seen.
Your program will also be run with valgrind
to check for memory leaks, and
other memory errors, e.g.:
> valgrind ./a5_main
==13731== Memcheck, a memory error detector
==13731== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==13731== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==13731== Command: ./a5_main
==13731==
1. {"This", 1}
2. {"a", 2}
3. {"is", 2}
4. {"or", 1}
5. {"test", 1}
6. {"test?", 1}
7. {"this", 1}
Number of different words: 7
Total number of words: 9
Most frequent word: a 2
Number of singletons: 5 (71%)
==13731==
==13731== HEAP SUMMARY:
==13731== in use at exit: 0 bytes in 0 blocks
==13731== total heap usage: 10 allocs, 10 frees, 78,160 bytes allocated
==13731==
==13731== All heap blocks were freed -- no leaks are possible
==13731==
==13731== For lists of detected and suppressed errors, rerun with: -s
==13731== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
valgrind
should report "no leaks are possible", and should not print any
other errors.
Be sure to test your program, and run it with valgrind
, before submitting it.
- 2 marks for each of the 8 virtual methods in
Wordlist_base
that is implemented correctly.is_sorted
is not included in the marking because it always returnstrue
. - 2 marks for a default constructor that creates an empty
Wordlist
object. - 2 marks for a constructor that takes the name of a file as input, and adds
all the words in that file to the
Wordlist
. Read the words from the file using C++'s standard<<
operator (the first example at the top of this assignment shows how this constructor should work, and what its output ought to be).
- All code is sensibly and consistently indented, and all lines are 100 characters in length, or less.
- Whitespace is used to group related pieces of a code to make it easier for humans to read. All whitespace has a purpose.
- Variable and function names are self-descriptive.
- Appropriate features of C++ are used, as discussed in class and in the notes. Note If you use a feature that we haven't discussed in class, you must explain it in a comment, even if you think it's obvious.
- Comments are used when needed to explain code whose purpose is not obvious from the code itself. There should be no commented-out code from previous versions.
- at least -5 marks for any memory leaks, or other errors, reported by
valgrind
. - at least -1 mark if your file has an incorrect name, or you submit it in the incorrect format, etc.
- up to -3 marks if you do not include your full name, email, and SFU ID in the header of your file.
- a score of 0 if you do any of the following:
- change
Node
in ways that are not allowed, if you modify anything inWordlist_base
, or if you use a vector, array, or any other data structure other than an AVL tree; - don't include the "statement of originality, or its modified in any way;
- submit a "wrong" non-working file, and then after the due date submit the "right" file. If you can provide evidence that you finished the assignment on time, then it may be marked.
- change
You can think of the AVL tree as storing the words in alphabetical order. So,
the tie-breaking rule for the most_frequent()
is to return the word that comes
first alphabetically.
Also, the is_sorted()
method should return true
if the AVL tree is a BST
(binary search tree), and false
otherwise. In a working version of this
assignment it should always return true
: if ever your AVL tree is not a BST,
then you have a serious problem!