The goal of this project is to generate semi-structured JSON data so you could use it in your data projects. To simulate real world behavior, you can specify the correctness of the generated dataset, either by telling how incomplete or inconsistent with the perfect schema each record should be.
For the configuration without data quality issues, the generated events wil have the following schema:
Field name | Field type | Explanation |
---|---|---|
visit_id | text | The identifier for a user's visit. If a user makes multiple actions without leaving the website, this identifier remains the same for all actions. |
event_time | date time | The action time, in the format year-month-dayThour:minutes:seconds+tim ezone, like for example: "2020-02-27T12:49:41+00:00" |
user_id | long | The unique identifier of the user related to the action. |
keep_private | boolean | If true, we must to anonymize everything under the User structure field. |
page | structure | |
-current | text | The currently visited page. |
-previous | text - optional | The previously visited page. May be null if it's the first user's visit. |
source | structure | |
-site | text | The visited website address. |
-api_version | text | The API version that sent the event. Can be: v1, v2 or v3. |
user | structure | |
-ip | text | The IP address of the user. |
-latitude | double | The geographical latitude of the user. |
-longitude | double | The geographical longitude of the user. |
technical | structure | |
-browser | text | The browser name. |
-os | text | The operating system of the user. |
-lang | text | The browser language of the user. |
-network | text | The network type, can be one of: adsl, fiber_optic, 3g, 4g |
-device | structure | |
--type | text | The device type: pc, tablet, smartphone. |
--version | text | The device version, e.g, Apple iPhone XS; is missing for pc . |
To configure data issues, you have to define these properties:
dataset:
# ...
composition_percentage:
incomplete: 2
inconsistent: 2
fully_valid: 96
What's the difference between them?
An incomplete event misses one or more of the following fields:
(device
, network
, technical.browser
, source
, visit_id
, user_id
). All the fields except user_id
are missing. user_id
is set to 0.
Please notice the lack of device.version
field if the device.type
is pc
.
An inconsistent event has the following data issues:
device
- can be defined as
{"type": {"name": "string"}, "version": string"}}
- can be defined as
network
- can be defined as
{"short_name": "string", "long_name": "string"}
- can be defined as
browser
- can be defined as
{"name": "string", "language": "string"}
and in that case, thetechnical.lang
is missing
- can be defined as
source
- can be defined as
www.name of the website
whereas the expected format doesn't includewww
- can be defined as
With users_no_data_consent_percentage
you can define the percentage of the users who would like to keep
their data private. For them, the keep_private
field will be set to true. In the snippet below, 2% of 10
users will have this flag set to true:
dataset:
all_users: 10
# ...
users_no_data_consent_percentage: 2
Late data can be simulated with data_generator.model.unordered_data.UnorderedDataContainer
class. You have to
create the instance with a method controlling whether an event should be buffered or sent directly to the sink. All
buffered events are then considered as late and, therefore, unordered events.
You can find an example of its use in examples/kafka/generated_dataset_to_kafka.py
:
# create the container from the main configuration file
# generation > late_data_percentage will be used to determine whether the action should be sent directly or not
unordered_data_container = UnorderedDataContainer.from_yaml_with_random_distribution(configuration)
# Method to control if the late data should be delivered
def should_send_late_data_to_kafka():
flags = [0] * 90 + [1] * 10
return choice(flags)
while True:
# usual events generation
action = visit.generate_new_action(dataset.pages, get_random_duration_in_seconds(), is_valid_log())
unordered_data_container.wrap_action((visit.visit_id, action),
lambda generated_action: configuration.send_message(
output_topic_name,
generated_action[0],
generated_action[1]
))
if should_send_late_data_to_kafka():
unordered_data_container.send_buffered_actions(
lambda late_action: configuration.send_message(output_topic_name, late_action[0], late_action[1]))
As for this writing, the versions_percentage
part doesn't introduce any differences and it's here just to simulate some
variability in case of app version analytics axis.
To execute the available generators, you can either use the scripts provided in examples
directory or
build Docker images like defined below.
To create a dockerized environment of the generator, execute the following commands:
make build_kafka_runner_image
cd examples/kafka-docker
docker-compose down --volumes
docker-compose up
To change the configuration, modify examples/kafka-docker/configuration.yaml
file.
The broker will be available at localhost:29092
.
To start Apache Kafka sink from the script, execute the following commands:
- start the broker
cd examples/kafka
docker-compose down --volumes
docker-compose up
The broker will be available at localhost:29092
.
- start the generator
python examples/kafka/generate_dataset_to_kafka.py
To change the configuration, modify examples/kafka/configuration.yaml
file.
To launch the tests on PyCharm, you need to enable pytest as the test runner for the project. You can see how to do this on jetbrains.com page
To execute all tests from command line, you can use make test_all
command. To check test coverage, you can execute
make test_coverage
.
Setup a virtualenv environment:
virtualenv -p python3 .venv/
Activate the installed environment:
source .venv/bin/activate
Install dependencies (venv activated):
pip3 install -r requirements.txt
Desactivate the virtualenv:
deactivate
Check code format:
make lint_all
Reformat code:
make reformat_all
The hook will execute the code formatting before the commit and the unit tests before the push. To install
it, please use Pre-commit plugin and pre-commit install
command.