-
Notifications
You must be signed in to change notification settings - Fork 1
Design
This page describes the design of the RandomDataset library.
The library is composed of the following main modules:
This contains the entry point for the command line utility called generate_dataset
. This routine uses the click
library to expose it as a program which parses command line arguments and can produce helpful information with the --help
flag.
There is also a test routine print_csv_test
which creates a simple schema, generates data using it, then prints the results to stdout.
The main application loop as implemented in generate_dataset
reads the specified schema file which is expected to return a list of generators. Each generator is expected to produce one dataset. These objects are visited in order in a loop and used to write to a specified destination path, this will fill in a single database file or produce multiple files with that destination as prefix. This is illustrated here:
This contains the routine parse_schema
which reads a YAML schema file and instantiates the list of objects it specifies. The expectation is that the schema file defines a list of DataGenerator
objects so these would be the routine's return value. PyYAML is used to parse the schema file.
Fields represent the columns of datasets in that they are queried by the Dataset
object to generate one or more values. What sort of data is generated is totally dependent on what fields are selected in the schema file.
The base FieldGen
class is inherited by specialised types to produce specific sorts of data. For example, IntFieldGen
to generate random integers, or AlphaNameGen
which produces a first or last name chosen at random from an internal list.
Shown here is the class diagram for FieldGen
and a few of its subclasses:
The design intent with these classes is to keep the data generation decoupled from dataset representation and storage. FieldGen
instances mostly do not depend on what Dataset
object may be storing them or whatever fields are present, however they can communicate through a simple shared state mechanism provided by the Dataset
class if sharing is needed. Most field objects however can ignore this component and be defined in simple ways with random state generators.
For example, the IntFieldGen
class generates integers randomly within a given range using the RandomState
object self.R
:
class IntFieldGen(FieldGen):
def __init__(
self, name: str, vmin: int = 0, vmax: int = 100, rand_state: OptRandStateType = None, shared_state_name=None
):
super().__init__(name, FieldTypes.INTEGER, rand_state, shared_state_name)
self.vmin: int = vmin
self.vmax: int = vmax
def generate(self, shape: OptShapeType = None):
return self.R.randint(self.vmin, self.vmax, shape)
The Dataset
class represents a set of fields (columns) and provides methods for accessing data by row or column. Generators use datasets as the source of data to write to a destination. If fields need to share data amongst themselves when generating, such as linked fields, sharing is done through the dataset's shared storage mechanism.
Generators implement the data generation component of the library through subclasses of the DataGenerator
class. Methods provided by this class handle writing data to a stream or file, but rely on subclasses to implement the write_stream
which defines what form of data is written.
One subclass, CSVGenerator
, is provided which generates comma-separated tables of data. Each dataset is written to its own file.