Skip to content
Eric Kerfoot edited this page Mar 22, 2023 · 8 revisions

This page describes the design of the RandomDataset library.

Modules Overview

modules

The library is composed of the following main modules:

application

This contains the entry point for the command line utility called generate_dataset. This routine uses the click library to expose it as a program which parses command line arguments and can produce helpful information with the --help flag.

There is also a test routine print_csv_test which creates a simple schema, generates data using it, then prints the results to stdout.

The main application loop as implemented in generate_dataset reads the specified schema file which is expected to return a list of generators. Each generator is expected to produce one dataset. These objects are visited in order in a loop and used to write to a specified destination path, this will fill in a single database file or produce multiple files with that destination as prefix. This is illustrated here:

main_loop

schemaparser

This contains the routine parse_schema which reads a YAML schema file and instantiates the list of objects it specifies. The expectation is that the schema file defines a list of DataGenerator objects so these would be the routine's return value. PyYAML is used to parse the schema file.

fields

Fields represent the columns of datasets in that they are queried by the Dataset object to generate one or more values. What sort of data is generated is totally dependent on what fields are selected in the schema file.

The base FieldGen class is inherited by specialised types to produce specific sorts of data. For example, IntFieldGen to generate random integers, or AlphaNameGen which produces a first or last name chosen at random from an internal list.

Shown here is the class diagram for FieldGen and a few of its subclasses:

fields

The design intent with these classes is to keep the data generation decoupled from dataset representation and storage. FieldGen instances mostly do not depend on what Dataset object may be storing them or whatever fields are present, however they can communicate through a simple shared state mechanism provided by the Dataset class if sharing is needed. Most field objects however can ignore this component and be defined in simple ways with random state generators.

For example, the IntFieldGen class generates integers randomly within a given range using the RandomState object self.R:

class IntFieldGen(FieldGen):
    def __init__(
        self, name: str, vmin: int = 0, vmax: int = 100, rand_state: OptRandStateType = None, shared_state_name=None
    ):
        super().__init__(name, FieldTypes.INTEGER, rand_state, shared_state_name)
        self.vmin: int = vmin
        self.vmax: int = vmax

    def generate(self, shape: OptShapeType = None):
        return self.R.randint(self.vmin, self.vmax, shape)

dataset

The Dataset class represents a set of fields (columns) and provides methods for accessing data by row or column. Generators use datasets as the source of data to write to a destination. If fields need to share data amongst themselves when generating, such as linked fields, sharing is done through the dataset's shared storage mechanism.

generators

Generators implement the data generation component of the library through subclasses of the DataGenerator class. Methods provided by this class handle writing data to a stream or file, but rely on subclasses to implement the write_stream which defines what form of data is written.

One subclass, CSVGenerator, is provided which generates comma-separated tables of data. Each dataset is written to its own file.

generators

Clone this wiki locally