Gary Sieling

Generating Randomized Sample Data in Python

If you have access to a production data set, it is helpful to generate testing data which follows a similar format, in varying quantities. By introspecting a database, we can identify stated constraints. Given sufficient data volume, we can also infer implicit business process constraints. If preferred, we can also find records that may generate problems.

As a thought experiment, I’ve considered how to acquire this information, as well as how to present it in rows using Python generators – to produce as little or much data as required.

Consider a simple query, which can return datatype information from Oracle:

select  
   column_name,
   data_type,
   data_length,
   data_precision,
   nullable
from all_tab_columns
where schema = 'GARY'
and table_name = 'TEST_TABLE'

For this example, I assume we want to generate one or two large tables – this exercise scales quickly when it expands to many tables. While it’s pretty easy to get the column sizes, it’s not that helpful – typically a UI will enforce additional constraints. If a value can hold a string, you typically also don’t want to generate random strings using every possible Unicode character, as you will get many atypical characters, and few of the ones most people use.

You can improve this analysis by counting the number of distinct values for each column, and the distribution across each value. In many situations, you may do well to generate this for several columns, then pull random values for each:

select count(distinct a) 
from gary.test_table;

select a, count(*) from gary.test_table 
group by a

Since databases typically represent their internal information in tables, we can also query for constraint information. Again, we’re using Oracle, but a similar technique can be used in other systems. Oracle lists types of constraint such as “P”, “R”, “U” (primary key, reference, unique).

SELECT distinct cc.table_name, 
                c.constraint_type, 
                cc.column_name
FROM all_constraints c 
JOIN all_cons_columns cc 
ON (c.owner = cc.owner AND c.constraint_name = cc.constraint_name)
AND cc.table_name = 'TEST_TABLE';

Similar to the above, you can look up foreign key constraints. This is particularly important – this provides a second technique to find out what possible distinct values there can be:

select c.table_name, cc.column_name
from all_constraints c
JOIN all_cons_columns cc 
ON (c.owner = cc.owner AND c.constraint_name = cc.constraint_name)
where c.constraint_name in 
  (select r_constraint_name
  from all_constraints
  where constraint_type in ('R')
  and table_name = 'TABLE_NAME');

Up to this point, the database has provided several possible lists of distinct values and known constraints. This is less than helpful for a complete system – to flesh the idea out entirely, we’re going to have to make some inferences.

Consider that each value could be an open or closed set – a list of vendors for a company might be large, but relatively fixed, whereas a numerical measurement or dollar figure is likely an open set. Primary keys are a special case of open sets- often they are unique and sequential.

We can again run some tests, to filter these categories into groups. The following query gets several simple metrics:

select count(a) c, count(distinct a) d, min(a) e, max(a) f
from table 
group by a

From this we can infer quite a bit: if c = d, the value is completely unique. If f – e = d, we can also infer that we’re looking at an auto-increment column – in fact, we can test if it is near to d to achieve the same results, as records may have been deleted or skipped.

We can also check if two columns are tied closely to each other-

select a, b, count(*) c from table 
group by a,b

select distinct d from (
  select (case when x > y then 1 else 0 end) d
  from test_table
)

If c is always 1, there is a one-to-one mapping of values, and if it is usually close to 1, we know the value are tightly tied. If, in the second query, d is always 1, we know that one value is always higher than the other.

If values for c vary widely, but the results of this query are small (e.g. a few hundred rows), we could use this to establish a probability distribution, choosing b randomly for a given a.

There is one last class of data which requires special consideration: rare data. If you wish to generate a very small dataset, which many special cases to improve testing, you may wish to do something like this:

select x from (
  select x, count(*) c
  from test_table
  group by x

  join 
 
  select count(*) d
  from test_table
) where c/d <= 0.05

If we run the above analysis on many sets of columns, we can then establish a series generator functions in python, one per column. We might, for instance generate data for a three column table, like so:

def row_gen():
  (c1, c2, c3) = (col1(), col2(), col3())
  for x in iter(int, 1):
    yield (c1.next(), c2.next(), c3.next())

Then, each column can be defined by a simple generator function, resembling the output of the above queries, for unique values and range lists:

def col1():
  for w in ["abcd", "efgh", "ijkl"]:
    yield w

def col2():
  for w in range(100, 1000):
    yield w

The above samples merely introduce data from a saved dataset in order - in a real system what we want is the potential to generate infinte data, per some distribution, which the following function demonstrates.

def next_val(cumulative_prob, vals, i):
  for idx in range(0, len(cumulative_prob)):
    if i <= cumulative_prob[idx]:
       return vals[idx]

def col3():
  vals = ['a', 'b', 'c']
  prob = [10, 5, 1]
  
  c = 0
  cumulative_prob = []
  for p in prob:
    c = c + p
    cumulative_prob.append(c)

  s = sum(prob) 
  for x in iter(int, 1):
    i = random.randint(1, s)
    yield next_val(cumulative_prob, vals, i)

l = 0
cnt = Counter()
x = col3()
while (l < 10000):
  cnt[x.next()] += 1
  l = l + 1

cnt
Out[36]: Counter({'a': 6286, 'b': 3096, 'c': 618})

Thus, we can generate data - depending on the data set in question, you likely will wish to combine manual and automated analysis to generate sample data, and will likely make discoveries, especially where implicit dependencies and rare items are concerned (rare data values are often the cause of UI failure, and occasionally are also indicative of error conditions accrued over time).

In some situations, it may also be helpful to simulate statistical distributions - the above examples tend to assume that you are dealing with large sets of text columns, but this clearly will work poorly in other scenarios. Likely this would require generating several probability distributions, and mapping each to the data, to see which fits best.

Exit mobile version