In this scenario, we want to create a realistic replica of our production database called
bank_db for the purpose of
bank_db has two tables:
userstable which has all the information pertaining to our customers.
transactionstable which has transactions referring to the customers in the
We first create a new workspace to import our dataset into Synth:
Synth supports importing from JSON files. To create a namespace, copy the JSON blob below to a file outside your workspace and use the
At this stage, we can run the
tree command to see how the
synth import sub-command updated our workspace.
bank_db (remember from Core Concepts a subdirectory in a workspace represents a
namespace) was created automatically as well as the two collections -
We can now generate data from our namespace using the
synth generate sub-command. (We are piping this
jq for the auto-formatting but this is optional.)
Notice, that the data generated has the right schema, but looks kind of useless. For example the
timestamp field is
not even a timestamp, it's just a random string.
The semantic meaning of the data has not been perfectly captured by the Synth inference engine.
synth evolves, inference will get better - but for now, we need to tweak the schema.
To modify the schema, open the workspace in your favourite editor. Let's take a look at
content of an Array. This can be any valid JSON, but since
bank_db originates from a SQL database with column
names and so on, it is a JSON object.
length of an Array. The length of an Array is actually also a Content node. This gives you flexibility - for
example you can make the length of an array be a
For more information on how to compose schemas, see the Schema page.
Reading through the schema, we can see that Synth inferred
id as being a
What we actually need, is for
id to be a monotonically increasing
number::id type starting
amount field is almost right. Synth inferred the right
high bounds, but, the step should be
we are dealing with currencies. So let's replace the
Next, we see Synth detected the
timestamp field as a string following a random pattern. Consulting the documentation
it should be a string::date_time.
user_id field should point to a valid entry in the
users collection, so let's use
the same_as content type to express this foreign key relationship.
currency field should reflect the real currencies that the bank supports. We could use
the string::faker support
currency_code generator to do this, but the bank only supports
EUR. So she uses a string::categorical instead. Roughly 80% of transactions are
USD so let's assign a higher probability to that variant.
Now let's generate data from the
transactions collection again:
Ah, much better.
As an exercise for the reader, try to do the same with the collection