DataFaker is a tool for generating fake data based on a schema. It can dump the results to CSV, SQL, or JSON files. It can also load the data into any database supported by SQLAlchemy.
To install, clone the project to your local machine and install it like so:
python3 -m pip install git+ssh://[email protected]:tools/data-faker.git
or
python3 -m pip install git+https://migym-gitlab.svc.ac2sp.army.mil/migym/webapp.git
or
python3 setup.py install
Usage: datafaker [OPTIONS]
Options:
-h, --host TEXT Hostname of machine [default: localhost]
-t, --port INTEGER Port number to connect on [default: 5432]
-u, --username TEXT Username to connect with
-p, --password TEXT Password to connect with
-d, --database TEXT Database name
-w, --write-db
--truncate / --append Provide --truncate to truncate tables before
populating. Tables will be truncated in the
reverse order of their appearance in the
schema.
-s, --schema TEXT Path to schema file [default: schema.yml;
required]
-o, --output [json|csv|sql|none]
Output options: json=output JSON to
output.json, csv=output to files named after
docs, sql=output to output.sql
--help Show this message and exit.
Using -w
, --write-db
or --truncate
requires you supply database connection parameters: host, port, username, and password.
DataFaker generates data based on a schema you provide in the form of a YAML file. By default, DataFaker looks for schema.yml
in the directory from which you execute it.
The following properties can be specified at the top-level of the YAML file:
Property | Type | Description |
---|---|---|
_seed | integer | Seeds random data generators. If provided, DataFaker will produce the same output for all runs (except for dates). |
_pre_sql | list[string] | SQL commands to be executed before database tables are truncated. |
_after_truncate_sql | list[string] | SQL commands to be executed after database tables are truncated but before data is loaded. |
_post_sql | list[string] | SQL commands to be executed after all data has been loaded. |
The schema file is composed of documents. Each document represents a data entity with related fields. When loading data to a database, each document maps directly to a database table.
Here is a sample schema.yml
file with a _seed
metadata property and 2 documents. DataFaker will create 5 random entries for the users
document. Then, because of the _from_doc
metadata property, DataFaker will create 1 users_roles
entry for each users
entry.
_from_doc
essentially performs a JOIN to the users
document. The from_field('id')
function uses the users.id
property from the JOIN to populate the users_roles.user_id
property.
_seed: 919
users:
_count: 5
id: sequence()
first_name: fake.first_name()
last_name: fake.last_name()
users_roles:
_from_doc:
doc: users
user_id: from_field('id')
role_name: one_of(['ADMIN', 'TEAM_LEADER', 'STAFF'])
Documents can be either dynamic or static. The documents above are dynamic because they are functionally generated.
Static documents contain a list of entries instead of a pattern of properties. Therefore, DataFaker will create each entry you provide and no more.
Here is an example of a static document. This will generate exactly 3 entries. Each entry will be created uniquely according to the specified property values. Static documents are useful when you need to guarantee certain values are used for each entry.
app_settings:
- id: sequence()
name: string('show_welcome')
value: boolean(True)
- id: sequence()
name: string('require_otp')
value: boolean(True)
- id: sequence()
name: string('min_password_length')
value: integer(12)
Documents can have the following metadata properties:
Property | Type | Description |
---|---|---|
_count |
integer | Specifies how many entries will be generated for the document. |
_from_doc |
dict | Metadata that describes how to use a previous document as the source for the current document. |
_from_sql |
dict | Metadata that describes how to use a SQL query as the source for the current document. |
_many_to_many |
dict | Metadata that describes how to use the cartesian product of two documents, two SQL queries, or a document and a SQL query as the source for the current document. |
_empty |
boolean | A flag indicating no entries should be generated for the document. This is only useful if you want any data in the document's corresponding SQL table to be truncated while not inserting any new data. You may set it to either True or False . The effect is the same, either way. |
There are three possible joins:
_from_doc
: join to a previous document_from_sql
: join to the results of a SQL query_many_to_many
: join to the results of a cartesian product between two documents, two SQL queries, or a document and a SQL query
Dynamic documents can be joined to other documents so that the child document creates one entry for each entry created by the parent document.
_from_doc
may contain the following fields:
Field | Type | Description |
---|---|---|
doc |
string | Name of the previous document to use as the source for the current document. |
limit |
integer | A maximum number of records that may be generated from the document specified by _from_doc.doc |
filter |
expression | Python expression that evaluates to a boolean value. If True , a record is emitted. If False , no record is emitted. |
multiply |
integer | A multiplier x that causes x number of records to be produced for every 1 record in the source document. |
In this example, three users_roles
entries are created using the first three users
entries where active
is true as the sources. The users.id
property is used to populate the users_roles.user_id
property.
_seed: 919
users:
_count: 5
id: sequence()
first_name: fake.first_name()
last_name: fake.last_name()
active: fake.boolean()
users_roles:
_from_doc:
doc: users
limit: 3
filter: active == True
user_id: from_field('id')
role_name: one_of(['ADMIN', 'TEAM_LEADER', 'STAFF'])
Dynamic documents can also be joined to pre-existing SQL data. Because changes are not applied to databases until after the all data has been generated, you cannot join to SQL data being created by DataFaker.
_from_sql
may contain the following fields:
Field | Type | Description |
---|---|---|
sql |
string | A query used to pull one or more columns of data that will be stored as a Python dict . Each field will be made accessible via from_field('column_name') . |
limit |
integer | A maximum number of records that may be generated from the result set of the SQL query. |
filter |
expression | Python expression that evaluates to a boolean value. If True , a record is emitted. If False , no record is emitted |
multiply |
integer | A multiplier x that causes x number of records to be produced for every 1 record in the result set of the SQL query. |
The example below joins course_registrations
to pre-existing students
records. Three course_registrations
entries will be created for every row in the students
table up to a maximum of 50 entries. The student_id
field will be populated by the students.id
column.
course_registrations:
_from_sql:
sql: SELECT id FROM students
limit: 50
multiply: 3
student_id: from_field('id')
course_id: one_of_sql('SELECT id FROM courses')
registration_date: (now() - days(random_int(0, 15))).isoformat()
There are times where you need to use the Cartesian product of two documents, two SQL queries, or a document and a SQL query. A good use case for this feature is populating a join table in a database. _many_to_many
makes this possible.
_many_to_many
may contain the following fields:
Field | Type | Description |
---|---|---|
left_doc |
string | Name of the document used on the left side of the join. |
right_doc |
string | Name of the document used on the right side of the join. |
left_sql |
string | SQL query whose result set will be used on the left side of the join. |
right_sql |
string | SQL query whose result set will be used on the right side of the join. |
left_filter |
expression | Python expression that evaluates to a boolean value. You may reference the properties/column names from the left side of the join. If True and right_filter is True , a record is emitted. If False , no record is emitted. |
right_filter |
expression | Python expression that evaluates to a boolean value. You may reference the properties/column names from the right side of the join. If True and left_filter is True , a record is emitted. If False , no record is emitted. |
limit |
integer | A maximum number of records that may be generated from the Cartesian product of this join. |
Properties of a _many_to_many
join MUST preface from_field('...')
field names with either left.
or right.
to specify which side of the join the field is found in.
The example below joins classes
to students
and uses their Cartesian product as the source for registrations
. Up to 1000 entries will be created. Only students
that are enrolled
will be used thanks to the right_filter
expression.
registrations:
_many_to_many:
limit: 1000
left_doc: classes
right_doc: students
right_filter: enrolled == True
user_id: from_field('right.id')
class_id: from_field('left.id')
registration_date: one_of([ (fromisoformat(from_field('left.start_date')) - days(1)).isoformat(), None ])
approved: fake.boolean()
Non-metadata/non-join properties must be valid Python expressions. They may include any standard Python syntax. The following object references and functions are made available in schemas.
The Python Faker library is made available via the fake
object reference. This object reference is limited to the en-US
locale.
Here are some examples of using it in a schema:
users:
_count: 5
first_name: fake.first_name()
last_name: fake.last_name()
home_state: fake.state_abbr()
opt_in_email: fake.boolean()
Below is the comprehensive list of functions available along with examples.
Signature | Description | Example |
---|---|---|
sequence(start: int = 0, step: int = 1) -> int |
Produces a sequence of consecutive integers starting at 0 for the entries within a document. | sequence(10, 10) will produce 10, 20, 30, ... |
first_name() |
Generates a random first name from one of these locales: 'de-DE', 'en-IE', 'es-MX', 'en-US', 'en-GB', 'ga-IE', 'es-ES', 'fil-PH', 'fr-FR', 'it-IT', 'pt-PT', 'nl-BE', 'nl-NL' | first_name() |
last_name() |
Generates a random last name from one of these locales: 'de-DE', 'en-IE', 'es-MX', 'en-US', 'en-GB', 'ga-IE', 'es-ES', 'fil-PH', 'fr-FR', 'it-IT', 'pt-PT', 'nl-BE', 'nl-NL' | last_name() |
random_char() |
Produces a random uppercase character from the English language. | random_char() |
random_int(min: int = 0, max: int = 10) -> int |
Produces a random integer bounded inclusively by min and max . |
random_int(5, 10) could produce 5, 6, 7, 8, 9, or 10. |
digits(length: int) -> str |
Produces a string of randomly generated integers. The string will be left-padded by zeroes. | digits(5) could produce 00359 . |
one_of_doc(key: str) -> Any |
Randomly selects a property value from the entries of a previous document. The list of possible values is cached for performance. | one_of_doc('users.id') would randomly select a value from the id field of the users document. |
one_of_sql(query: str) -> Any |
Randomly selects a value from the results of the provided SQL query. Results are cached for performance. | one_of_sql('SELECT id FROM users') would randomly select a value from the id column of the users table. |
one_of(options: list[Any]) -> Any |
Randomly selects a value from the provided list of values. | one_of(['dogs', 'cats', 'birds', 'rabbits']) |
sql(query: str) -> Any |
Immediately executes the SQL for every record emitted. The first column returned will be used. If performance is a concern, avoid using this function. | sql('SELECT nextval('user_id_sequence')') |
field(name: str) -> Any |
Returns the value of another field in the same document entry. | field('address') |
now() -> datetime |
Returns the current system time as a datetime object. |
now() |
milliseconds(delta: int) |
Creates a time delta for doing datetime math. |
(now() - milliseconds(500)).isoformat() |
seconds(delta: int) |
Creates a time delta for doing datetime math. |
(now() - seconds(10)).isoformat() |
minutes(delta: int) |
Creates a time delta for doing datetime math. |
(now() - minutes(30)).isoformat() |
hours(delta: int) |
Creates a time delta for doing datetime math. |
(now() - hours(12)).isoformat() |
days(delta: int) |
Creates a time delta for doing datetime math. |
(now() - days(1)).isoformat() |
integer(value: Any) |
Casts value to the int type. |
integer('35') |
floating(value: Any) |
Casts value to the float type. |
integer('3.14') |
boolean(value: Any) |
Casts value to the bool type. |
boolean('') == False boolean('some_value') == True |
string(value: Any) |
Casts value to the str type. |
string(1.234) |
upper(value: str) |
Returns value in uppercase. |
upper('abcd') == 'ABCD' |
lower(value: str) |
Returns value in lowercase. |
lower('ABCD') == 'abcd' |
raw(value: str) |
Used to pass raw SQL directly to the INSERT queries produced when loading a database. |
raw('CURRENT_TIMESTAMP') |
The YAML parser used by this tool strips surrounding quotes from all YAML properties. Thus, if you want to set a property to a literal string, you will need to use the string(value: Any)
function. Otherwise, Python will try to evaluate it as an expression rather than a literal string.