-
Notifications
You must be signed in to change notification settings - Fork 1
Requirements
This document sets out the requirements for RandomDataset.
The purpose of this document is to describe what the function is of RandomDataset and its intended uses. This is meant for both developer and user audiences. The goal of RandomDataset is to provide a utility for generating tabular databases with randomised contents suitable for testing database software and demonstration purposes. A user can use this utility programmatically through its API as well as on the command line with the included utility program.
Databases are composed of multiple datasets. Each dataset is a table of values with rows representing each instance of a data item and columns or fields defining which data elements each item has. Databases can be stored in a variety of formats but the simplest are text based files such as comma-separated values (csv). For database software it's often important to have data for testing purposes, but real world data shouldn't be used due to privacy and data protection concerns. A method for generating randomised datasets would be an effective and flexible tool to aid testing these systems.
RandomDataset generates databases by reading a schema file describing what the datasets are along with their fields, and producing output data in selected formats. This is a command line tool to be used to produce file outputs from a schema file input, as well as a library which can be used programmatically to build a database and produce output from it. The data generated for the datasets are randomly generated such as simple numbers or random strings, selected randomly from sets of stored data items such as personal names, or are concepts such as ID numbers which start at a value and increment as they are requested to produce sequential results. Fields can also be linked so that values appearing in one dataset are selected to fill fields in another, this is useful for linking instance in one dataset to the IDs of instance in another.
RandomDataset must have a command line interface for generating data from a user-provided schema file. This schema, defined in YAML format, describes what datasets are present, their fields, and what format to generate. The following is an example schema:
- typename: randomdataset.generators.CSVGenerator
num_lines: 10
dataset:
name: customers
typename: randomdataset.Dataset
fields:
- name: id
typename: randomdataset.UIDFieldGen
- name: FirstName
typename: randomdataset.StrFieldGen
lmin: 6
lmax: 14
- name: LastName
typename: randomdataset.StrFieldGen
lmin: 6
lmax: 14
This will generate a dataset called "Customers" with three fields: id
, FirstName
, LastName
. The command utility is invoked as such, passing in this file and the output directory as arguments:
generate_dataset paymentschema.yaml .
This will produce ./customers.csv
which will contain comma-separate values like the following:
id,FirstName,LastName
0,"QDFFgv4XBd5VW","O1Odro"
1,"Gp4mYq","82IPIChjBALg"
2,"LR7KVudB","HcAPBwM"
3,"6FfWGEYS0Q","5NbspSBJk"
4,"si1Tj0xSBB2","eChYKAaW5aa8R"
5,"DYP6OMerUUFOR","pYNXUTNLqdrv"
6,"ltfnhTgrJF","2Rctye"
7,"1tAoaDl57Lo5","xMkVKt6O"
8,"1yJImoqiwf","IJICD8W6B8k"
9,"XkYgS7","8owHyjR"
This is a list of individual features and elements the system should have. Progress for development of the system is based on how many of the features are accepted to have been implemented correctly and fully.
The program generate_dataset
must be defined in the library and appear as a program in the user's environment when the library is installed. This should use middle ware libraries to simply the process of defining the program.
Simple field definitions must be provided to randomly generate fields with integer, float, string, boolean, and other simple values.
A simple set of generators should be defined for producing databases in common and standard formats. A generator for CSV files should be defined which produces datasets in individual .csv
files whose format conforms to know standards.
Futher generators should be defined for producing data in JSON, YAML, SQLite, and other simple formats.
Generators should be defined to interface with relational database systems through the SQL query interface to create and fill databases. This can be done directly with running database servers or by generating appropriate SQL scripts.
When generating data all the information for a dataset shouldn't be stored in memory at once but written to the appropriate output as is it created. This allows large databases to be generated without fill up system memory.
All of the field generation routines should be fast and simple to generate randomised data very rapidly. Use of caching may be required to speed up the generation of data by selecting from previously created items after a certain number are generated. The focus however should always be on lightweight generation mechanisms.
No private data or otherwise identifiable information should be used when generating datasets.
True randomisation of fields should be used to ensure results do not appear to have bias or patterns. The idea is to produce data without regularity or pattern, this is better suited to be used in a test environment where incorrect code could function for test data with a particular pattern only. This should also encompass selecting words randomly from dictionaries to ensure no apparent meaning is accidentally introduced to randomised text.