Interactive Tutorial - Creating fake HTTP logs with Synth
In this tutorial we're going to be creating fake HTTP logs with synth. Whether you're benchmarking some log-collector or are experimenting with log compression, this tutorial aims to teach you how to build a customisable data model that will let you generate as many logs as you need.
#
Data ModelWe want to generate HTTP logs - but what is an HTTP log actually? Different HTTP servers will have different logging conventions - but we'll stick to the Common Log Format (CLF) used as a default by web servers like the Apache Web Server.
CLF has the following syntax:
host ident authuser date request status bytes
concretely:
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
or as json:
{ "host": "127.0.0.1", // ip address of the client "ident": "user-identifier", // RFC 1413 identity of the cient "authuser": "frank", // userid "date": "10/Oct/2000:13:55:36 -0700", // %d/%b/%Y:%H:%M:%S %z "request": "GET /apache_pb.gif HTTP/1.0", // HTTP request from client "status": 200, // HTTP status code "bytes": 2326 // size of object returned in bytes}
#
Installing SynthTo install synth
head over to the download page.
#
ImplementationSo let's get started!
To get started with synth let's create a new namespace by simply
creating a new directory - we'll call it clf-logs
:
$ mkdir clf-logs
Next let's create a collection called logs which will define the schema of our log data:
$ cd clf-logs && touch logs.json
#
ScaffoldingWe're going to base the meat of our CLF log schema on the date
field. We
want our logs to look and behave realistically. We can model requests
arriving at our web server using a poisson process - a model which is often used to model
independent random events with a mean interval between events (like
customers arriving a store.) synth
has a poisson generator which we can use to do this.
First let's open up logs.json
in our favourite IDE and define an array of
objects which have a field date
:
{ "type": "array", "length": 1, "content": { "type": "object", "date": 0 }}
Now lets sanity check our schema by generating some data:
$ synth generate . --collection logs | jq[ { "date": 0 }]
#
Poisson GeneratorCool! We generated a 0
. So far so good. Now let's swap out the 0
for the
poisson generator in logs.json
.
{ "type": "array", "length": 1, "content": { "type": "object", "date": { "type": "series", "format": "%d/%b/%Y:%H:%M:%S", "poisson": { "start": "10/Oct/2000:13:55:36", "rate": "10m" } } }}
Our poisson series generator has 2 parameters:
start
: the earliest time an event can occurrate
: the average arrival rate for events
#
HostNext let's add the host
field. Here we can simply use one of our faker
generators for generating ipv4
addresses:
{ "type": "array", "length": 1, "content": { "type": "object", "date": { "type": "series", "format": "%d/%b/%Y:%H:%M:%S", "poisson": { "start": "10/Oct/2000:13:55:36", "rate": "10m" } }, "host": { "type": "string", "faker": { "generator": "ipv4" } } }}
Easy!
#
IdentIdent corresponds to the RFC 1413 Identification Protocol. We can leave this
as "-"
using the pattern generator for now, but you can get
creative if you need something more elaborate for your use case:
{ "type": "array", "length": 1, "content": { "type": "object", "date": { "type": "series", "format": "%d/%b/%Y:%H:%M:%S", "poisson": { "start": "10/Oct/2000:13:55:36", "rate": "10m" } }, "host": { "type": "string", "faker": { "generator": "ipv4" } }, "ident": { "type": "string", "pattern": "-" } }}
#
Authuserauthuser
is the userid of the person requesting the document. Usually "-" unless .htaccess has requested authentication.
Here we'll just use the faker
first_name
generator get a bunch of first
names:
{ "type": "array", "length": 1, "content": { "type": "object", "date": { "type": "series", "format": "%d/%b/%Y:%H:%M:%S", "poisson": { "start": "10/Oct/2000:13:55:36", "rate": "10m" } }, "host": { "type": "string", "faker": { "generator": "ipv4" } }, "ident": { "type": "string", "pattern": "-" }, "authuser": { "type": "string", "faker": { "generator": "first_name" } } }}
#
RequestRequest is a little involved - we're going to need to use the format
generator to string together multiple generators. format
takes arbitrarily
many arguments and a format string. Our format string is of the form:
{http_method} {endpoint} HTTP/1.0
We can build this compositionally with 2 child generators and a format
generator:
{ "type": "string", "format": { "format": "{http_method} /{endpoint} HTTP/1.0", "arguments": { "http_method": { "type": "string", "categorical": { "GET": 1, "PUT": 1, "POST": 1, "PATCH": 1 } }, "endpoint": { "type": "string", "faker": { "generator": "file_name" } } } }}
Here http_method
is a categorical generator, with equal probability to
yield any of the 4 HTTP methods defined, and endpoint
is a faker
generator which generates file names.
#
StatusFor status
we'll be using the categorical
generator as well - easy:
{ "type": "array", "length": 1, "content": { "type": "object", "date": { "type": "series", "format": "%d/%b/%Y:%H:%M:%S", "poisson": { "start": "10/Oct/2000:13:55:36", "rate": "10m" } }, "host": { "type": "string", "faker": { "generator": "ipv4" } }, "ident": { "type": "string", "pattern": "-" }, "authuser": { "type": "string", "faker": { "generator": "first_name" } }, "request": { "type": "string", "format": { "format": "{http_method} /{endpoint} HTTP/1.0", "arguments": { "http_method": { "type": "string", "categorical": { "GET": 1, "PUT": 1, "POST": 1, "PATCH": 1 } }, "endpoint": { "type": "string", "faker": { "generator": "file_name" } } } } }, "status": { "type": "number", "categorical": { "200": 8, "404": 1, "500": 1 } } }}
In this case the we're assigning weights to the variants of the categorical.
80% of the time we'll be getting 200
, 10% of the time we'll get 404
and
10% of the time we'll get 500
.
#
BytesAnd finally bytes
. For bytes
we'll use a number range generator
from 1 b to 1 MiB:
{ "type": "array", "length": 1, "content": { "type": "object", "date": { "type": "series", "format": "%d/%b/%Y:%H:%M:%S", "poisson": { "start": "10/Oct/2000:13:55:36", "rate": "10m" } }, "host": { "type": "string", "faker": { "generator": "ipv4" } }, "ident": { "type": "string", "pattern": "-" }, "authuser": { "type": "string", "faker": { "generator": "first_name" } }, "request": { "type": "string", "format": { "format": "{http_method} /{endpoint} HTTP/1.0", "arguments": { "http_method": { "type": "string", "categorical": { "GET": 1, "PUT": 1, "POST": 1, "PATCH": 1 } }, "endpoint": { "type": "string", "faker": { "generator": "file_name" } } } } }, "status": { "type": "number", "categorical": { "200": 8, "404": 1, "500": 1 } }, "bytes": { "type": "number", "range": { "low": 1, "high": 1048576 } } }}
And we're done!
#
Tying it all togetherNow we have our raw data structure - we need to compose it into the proper
log string. To do this, we'll create a separate collection formatted
by
creating another file in our namespace called formatted.json
. We'll then
use a combination of the format
and same_as
generators to compose
together fields from our original collection logs.json
.
{ "type": "array", "length": 1, "content": { "type": "string", "format": { "format": "{host} {ident} {authuser} [{date}] {request} {status} {bytes}", "arguments": { "host": "@logs.content.host", "ident": "@logs.content.ident", "authuser": "@logs.content.authuser", "date": "@logs.content.date", "request": "@logs.content.request", "status": "@logs.content.status", "bytes": "@logs.content.bytes" } } }}
And we're done!
$ synth generate . --collection formatted --size 5 | jq[ "187.123.85.239 - Margarete [10/Oct/2000:13:58:04] GET /woman.7z HTTP/1.0 200 274906", "151.58.227.40 - Ewald [10/Oct/2000:14:41:24] POST /oliver.mp3 HTTP/1.0 404 912953", "90.255.24.42 - Demarcus [10/Oct/2000:14:51:11] POST /young.doc HTTP/1.0 200 1047925", "123.174.213.110 - Carlos [10/Oct/2000:15:07:59] GET /way.rar HTTP/1.0 200 49926", "253.115.73.9 - Paolo [10/Oct/2000:15:14:53] PATCH /next.png HTTP/1.0 200 884672"]