Distributions enable Benerator to generate numbers with desired distribution characteristics or following certain a sequence. A distribution may also be applied to groups of data objects to provide them with certain distribution characteristics.
Distributions come in two flavors:
- Sequences: Algorithm for generating numbers
- Weights: Functions which provide the probability of a certain number
For most common needs there exist predefined sequences and weights.
For special needs, you can define and add your own custom ones.
A Distribution is selected with a distribution
attribute an can be parameterized with a
min
and max
value and a granularity
. The granularity is applied in a way,
that any generated number is min
plus an integer multiple of granularity
.
Thus, a configuration
<attribute name="price" type="double" distribution="increment" min="0.25" max="100" granularity="0.25"/>
yields the numbers
0.25, 0.50, 0.75, 1.00, ..., 99.75, 100.00
A Sequence is basically a number generator. It can provide a custom random algorithm, a custom weighted number generator or a unique number generation algorithm.
The Sequences used most often are random
, increment
and cumulated
.
For a complete list of predefined sequences, see the 'Component Reference', for defining a custom Sequence, see 'Extending Benerator'.
A weight function basically is a mathematical function that tells which weight
to apply to which number.
The most frequently used weight functions are GaussianFunction
and ExponentialFunction
.
For a complete list of predefined weight functions, see the Component Reference, for defining a custom weight function, see Extending Benerator.
WeightedNumbers is a special component for creating a small set of numbers
based on a weighted-number literal, for example 1^70, 3^30
for generating 70%
1
values and 30% 3
values.
This is a very convenient and simple approach for controlling parent-child cardinalities in nested data generation.
Example:
<attribute name="n" type="int" distribution="new WeightedNumbers('1^70,3^30')"/>
When using WeightedNumbers to determine the cardinality of an Entity part which is a container,
then the container type must be declared. Typical settings are container='array'
or, in some cases, container='list'
:
<part name='y' container='array' countDistribution="new WeightedNumbers('0^70,1^20,2^10')">
<attribute name='z' pattern='AAA'/>
</part>
'Other data' usually comes from a data source and is imported by an
<iterate>
or <attribute>
element:
<attribute name="code" type="string" source="codes.csv"/>
When iterating through data (e.g. imported from file or database), Benerator's default behavior is to serve each item exactly once and in the order as provided. When the end of the data set is reached, Benerator stops.
When setting cyclic="true"
Benerator serves the imported data consecutively too
but does not stop when it reaches the end. Instead, it restarts iteration.
Beware: For SQL queries this means that the query is reissued, so it may have a different result set than the former invocation.
<attribute name="code" type="string" source="codes.csv" cyclic="true"/>
But that is not really a distribution. We can do better and get probability effects:
When importing data from data sources, you can specify weights. They are different when importing simple data or entities:
When importing primitive data from a CSV file, each value is expected to be in an extra row. If a row has more than one column, the content of the
second column is interpreted as weight. If there is no such column, a weight of 1 is assumed. Benerator automatically normalizes over all data
objects, so there is no need to care about manual weight normalization. Remember to use a filename that indicates the weight character, using a suffix
like .wgt.csv
or wgt.xls
.
If you, for example, create a CSV file roles.wgt.csv
:
customer,7
clerk,2
admin,1
and use it in a configuration like this:
<generate type="user" count="100">
<attribute name="role" source="roles.wgt.csv" />
</generate>
this will create 100 users of which about 70 will have the role customer
, 20 clerk
and 10 admin
.
By default, the semicolon is the delimiter between commands: Benerator splits imports commands by their delimiter. The
default separator can be overwritten by the property separator
e.g. separator="|"
:
<generate type="user" count="100">
<attribute name="role" source="roles.wgt.csv" separator="|" />
</generate>
It is also possible to specify the separator for the whole project in your <setup>
node as
<setup defaultSeparator="|">
<generate type="user" count="100">
<attribute name="role" source="roles.wgt.csv" />
</generate>
</setup>
When importing entities, one entity attribute can be chosen to represent the weight
by specifying distribution="weighted[attribute-name]"
.
Remember to indicate, that the source file contains entity data by using the correct
file suffix, e.g. .ent.csv
or .ent.xls
.
Example: If you are importing cities and want to weigh them by their population,
you can define a CSV file cities.ent.csv
:
name,population
New York,8274527
Los Angeles,3834340
San Francisco,764976
and e.g. create addresses with city names weighted by population, when specifying
<generate type="address" count="100" consumer="ConsoleExporter">
<variable name="city_data" source="cities.ent.csv" distribution="weighted[population]"/>
<id name="id" type="long" />
<attribute name="city" script="city_data.name"/>
</generate>
If the imported data does not come with weight information, you can apply a Distribution to control probability:
<attribute name="code" type="string" source="codes.csv" distribution="random"/>
For WeightFunctions, all available data is loaded into RAM and then the Weight Function's number generation feature is used to generate indices of the data items.
Most Sequences implement data distribution as described above for Weight Functions, but can be programmed individually for each Sequence.
Attention: Most distributions load all available data to distribute into RAM.
Most sequences should not be applied to data sets of more than 100.000 elements, a weight function should be restricted to at most 10.000 elements.
'Unlimited' Sequences which are suitable for arbitrarily large data sets are
- expand
- randomWalk
- repeat
- step
For the complete list of predefined Distributions, see the 'Component Reference', for programming custom Distributions, see 'Custom Sequences ' and 'Custom WeightFunctions'.