Financial Services Example
In this scenario, we want to create a realistic replica of our production database called bank_db
for the purpose of
application testing.
#
Example Databank_db
has two tables:
- The
users
table which has all the information pertaining to our customers.
{ "id": 1, "created_at_date": "2009-03-19", "created_at_time": "01:00:02", "credit_card": "346215176014733", "currency": "GIP", "email": "tracy31@gmail.com", "is_active": false, "last_login_at": "2020-06-22T05:46:41+0000", "num_logins": 19, "password_hash": "eebed079e19dcf5b936e8ca5a648bee38e30bec0212...", "username": "tracy31@gmail.com"}
- The
transactions
table which has transactions referring to the customers in theusers
table:
{ "id": 1, "amount": 5001.70, "currency": "GIP", "timestamp": "2020-05-13T20:48:01+0000", "description": "Walmart Credit Purchase", "user_id": 1}
#
Feeding Data into SynthWe first create a new workspace to import our dataset into Synth:
mkdir synth_workspace && cd synth_workspace
Synth supports importing from JSON files. To create a namespace, copy the JSON blob below to a file outside your workspace and use the import
command:
synth import bank_db/ --from json:/path/to/the/file.json
{ "users": [ { "id": 1, "created_at_date": "2009-03-19", "created_at_time": "01:00:02", "credit_card": "346215176014733", "currency": "GIP", "email": "tracy31@gmail.com", "is_active": false, "last_login_at": "2020-06-22T05:46:41+0000", "num_logins": 19, "password_hash": "eebed079e19dcf5b936e8ca5a648bee38e30bec02129790eabdd7084919d7972", "username": "tracy31@gmail.com" }, { "id": 2, "created_at_date": "1989-08-02", "created_at_time": "14:01:03", "credit_card": "347805284578857", "currency": "RON", "email": "lunamark@green.com", "is_active": false, "last_login_at": "2020-06-09T12:25:34+0000", "num_logins": 96, "password_hash": "5fd241333135a2852c124a28401460d6c7bbd1851a256becddd95f9141e8a74b", "username": "lunamark@green.com" }, { "id": 3, "created_at_date": "1993-04-12", "created_at_time": "18:28:28", "credit_card": "347652531949402", "currency": "EUR", "email": "nicholsonjoseph@whitehead.org", "is_active": false, "last_login_at": "2020-07-24T18:30:45+0000", "num_logins": 56, "password_hash": "2a3e0851d34b71088fb77c20b840689169b89acef2e74ae687fc1b99387e100b", "username": "nick_j" }, { "id": 4, "created_at_date": "2001-09-11", "created_at_time": "06:25:41", "credit_card": "341034072996090", "currency": "KHR", "email": "fullerangela@williams.biz", "is_active": false, "last_login_at": "2020-03-02T08:31:01+0000", "num_logins": 39, "password_hash": "10ab7dcd6b7ac468bd442febf904f4025f38017f847774f91ca2bdbdfe029ce1", "username": "fullerangela" } ], "transactions": [ { "id": 1, "amount": 5001.7, "currency": "GIP", "timestamp": "2020-05-13T20:48:01+0000", "description": "Walmart Credit Purchase", "user_id": 1 }, { "id": 2, "amount": 274.4, "currency": "GIP", "timestamp": "2020-04-07T20:19:23+0000", "description": "Amazon - Kindle reading tablet", "user_id": 1 }, { "id": 3, "amount": 6199.9, "currency": "KHR", "timestamp": "2020-02-03T11:24:36+0000", "description": "Apple store subscription", "user_id": 2 }, { "id": 4, "amount": 3747.6, "currency": "KHR", "timestamp": "2020-04-02T02:37:22+0000", "description": "Google store rental", "user_id": 2 }, { "id": 5, "amount": 4358.4, "currency": "KHR", "timestamp": "2020-03-20T04:12:11+0000", "description": "Apple accessories", "user_id": 2 }, { "id": 6, "amount": 6597.5, "currency": "EUR", "timestamp": "2020-09-16T07:26:02+0000", "description": "Amazon Kindle subscription", "user_id": 3 } ]}
At this stage, we can run the tree
command to see how the synth import
sub-command updated our workspace.
$ tree -a.โโโ bank_db โโโ transactions.json โโโ users.json
The directory bank_db
(remember from Core Concepts a subdirectory in a workspace represents a
namespace) was created automatically as well as the two collections - transactions
and users
.
We can now generate data from our namespace using the synth generate
sub-command. (We are piping this
into jq
for the auto-formatting but this is optional.)
synth generate bank_db | jq{ "transactions": [ { "amount": 5336.4, "currency": "k1BFV", "description": "z86x3QSvEY9pCVdTsr3zDyuf04ccmf5", "id": 1, "timestamp": "kfAuUrNEb8dgGT5", "user_id": 2 } ], "users": [ { "created_at_date": "Mg", "created_at_time": "Me2kBYEDb", "credit_card": "LnafugyfWMLf8Gns", "currency": "8e9u8h5KYg", "email": "yLXGebLNS5ZmZWifCqv20", "id": 3, "is_active": false, "last_login_at": "gH7zB0nkU0ScpmOhWr3vm", "num_logins": 78, "password_hash": "OJUsm0b4d", "username": "3ouuDKRsR7a" }, ... ]}
Notice, that the data generated has the right schema, but looks kind of useless. For example the timestamp
field is
not even a timestamp, it's just a random string.
The semantic meaning of the data has not been perfectly captured by the Synth inference engine.
As synth
evolves, inference will get better - but for now, we need to tweak the schema.
#
Tweaking the SchemaTo modify the schema, open the workspace in your favourite editor. Let's take a look at bank_db/transactions.json
first.
{ "type": "array", "length": { "type": "number", "subtype": "u64", "range": { "low": 1, "high": 6, "step": 1 } }, "content": { "type": "object", "amount": { "optional": false, "type": "number", "subtype": "f64", "range": { "low": 274.4, "high": 6597.5, "step": 1.0 } }, "id": { "optional": false, "type": "number", "subtype": "u64", "range": { "low": 1, "high": 6, "step": 1 } }, "timestamp": { "optional": false, "type": "string", "pattern": "[a-zA-Z0-9]*" }, "user_id": { "optional": false, "type": "number", "subtype": "u64", "range": { "low": 1, "high": 3, "step": 1 } }, "currency": { "optional": false, "type": "string", "pattern": "[a-zA-Z0-9]*" }, "description": { "type": "string", "pattern": "[a-zA-Z0-9]*" } }}
There is quite a bit going on here, so let's break it down. This file represents a schema for
a collection
. Collections are arrays under the hood and so they
have 2 fields.
1) The content
of an Array. This can be any valid JSON, but since bank_db
originates from a SQL database with column
names and so on, it is a JSON object.
2) The length
of an Array. The length of an Array is actually also a Content node. This gives you flexibility - for
example you can make the length of an array be a number::range
For more information on how to compose schemas, see the Schema page.
#
Tweaking Individual FieldsReading through the schema, we can see that Synth inferred id
as being a number::range
.
What we actually need, is for id
to be a monotonically increasing number::id
type starting
at 0
.
{ "type": "number", "subtype": "u64", "id": { "start_at": 0 }}
The amount
field is almost right. Synth inferred the right low
and high
bounds, but, the step should be 0.01
as
we are dealing with currencies. So let's replace the amount
field:
{ "type": "number", "subtype": "f64", "range": { "low": 274.4, "high": 6597.5, "step": 0.01 }}
Next, we see Synth detected the timestamp
field as a string following a random pattern. Consulting the documentation
it should be a date_time.
{ "type": "date_time", "format": "%Y-%m-%dT%H:%M:%S%z", "begin": "2000-01-01T00:00:00+0000", "end": "2020-01-01T00:00:00+0000"}
The user_id
field should point to a valid entry in the users
collection, so let's use
the same_as content type to express this foreign key relationship.
{ "type": "same_as", "ref": "users.content.id"}
The currency
field should reflect the real currencies that the bank supports. We could use
the string::faker support currency_code
generator to do this, but the bank only supports USD
, GBP
and EUR
. So she uses a string::categorical instead. Roughly 80% of transactions are
in USD
so let's assign a higher probability to that variant.
{ "type": "string", "categorical": { "USD": 8, "GBP": 1, "EUR": 1 }}
Finally, the description
can be more descriptive. We can again use a string::categorical
generator, but don't want to put thousands of descriptive lines in our declaration file. Instead, we'll use an
external JSON datasource to pull descriptions from. First copy the following JSON to
a descriptions.json
file in the workspace.
[ "Walmart Credit Purchase", "Amazon - Kindle reading tablet", "Apple store subscription", "Google store rental", "Apple accessories", "Amazon Kindle subscription"]
Next, edit the description
field as follow:
{ "type": "datasource", "path": "json:descriptions.json", "cycle": true}
Now let's generate data from the transactions
collection again:
synth generate bank_db --collection transactions --size 10 | jq[ { "amount": 1458.2, "currency": "GBP", "description": "Walmart Credit Purchase", "id": 0, "timestamp": "2014-12-15T22:49:23+0000", "user_id": 3 }, { "amount": 6043.2, "currency": "USD", "description": "Amazon - Kindle reading tablet", "id": 1, "timestamp": "2002-10-10T23:41:32+0000", "user_id": 1 }, { "amount": 2515.7000000000003, "currency": "GBP", "description": "Apple store subscription", "id": 2, "timestamp": "2000-07-17T05:50:27+0000", "user_id": 3 },...]
Ah, much better.
As an exercise for the reader, try to do the same with the collection users.json
.