Skip to content

Commit

Permalink
✨ Test generation support introduced (#22)
Browse files Browse the repository at this point in the history
  • Loading branch information
benrutter authored Nov 14, 2024
1 parent 6c60d68 commit 20a9b47
Show file tree
Hide file tree
Showing 13 changed files with 447 additions and 13 deletions.
78 changes: 78 additions & 0 deletions docs/building-tests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Alongside *running* tests, Wimsey also has some functions to aid *building* tests. This can be useful if you want to automagically create some sensible initial tests for multiple datasets, without needing to type them out by hand, or create them manually in code.

As with the rest of Wimsey, your own dataframe engine will be used to sample the relevant statistics. Wimsey can either generate starter tests from *a list of samples* or it can use *sampling with replacement* to generate samples for you from a single dataframe. If you use the latter, note that Wimsey will need to *evaluate each sample individually* so if you are using a lazy framework such as Polars' LazyFrames, Dask or Modin you will likely want to collect your results first, or implement a caching mechanism to avoid unnecessary repeated computation.


## What is margin?

You'll see the keyword *margin* throughout Wimsey test building, it's worth explaining here first.

Margin is the amount of *extra allowance* tests give, based on the sample. For instance if Wimsey has three samples, with a "column_a" maximum of 1, 2 and 3, rather than creating a test for the maximum being 3 (the highest value seen in the samples), Wimsey will allow an amount of 'give' for the tests.

This is based on the *standard deviation of the statistical metric*, and for the above example would be 1, meaning that Wimsey would build a test that expects the maximum to be less than or equal to 4.

If this is all giberish to you, don't worry, the 'margin' keyword defaults to 1, which is often a sensible choice, if you find that Wimsey is creating to strict tests, bump it up slightly, if tests are too lax, you can reduce margin to a smaller positive number.

Setting `margin` to a negative value means that your creating a test that your given sample would fail, and while supported, is unlikely to be what you're looking to do.

## From Sampling

From a single dataframe, Wimsey will sample with replacement to build a starter test. The `samples` keyword specifies the number of times you want Wimsey to build a sample, while `n` or `fraction` tell Wimsey the size (in rows) or fraction (as a float) of the sample to take. Note that you *can't supply both n AND fraction keywords to Wimsey*.

Wimsey has a `starter_tests_from_sampling` function, and a `save_starter_tests_from_sampling` function dependent on whether you're intending to return the tests as a dictionary, or save them to a file. `save_starter_tests_from_sampling` takes the exact same arguments, but with the addition of a `path` and an optional `storage_options` argument.

=== "starter_tests_from_sampling"
```python
import polars as pl
from wimsey.profile import starter_tests_from_sampling

df = pl.DataFrame({"a": [1, 2, 3], "b": ["cool", "bat", "hat"]})
tests: list[dict] = starter_tests_from_sampling(df, samples=5_000, n=2, margin=3)
```
=== "save_starter_tests_from_sampling"
```python
import pandas as pd
from wimsey.profile import starter_tests_from_sampling

df = pd.DataFrame({"a": [1, 2, 3], "b": ["cool", "bat", "hat"]})
tests: list[dict] = save_starter_tests_from_sampling(
path="my-first-test.json",
df=df,
samples=5_000,
fraction=0.5,
margin=3,
)
```

## From Samples

From a list (or other iterable such as a generate) of supported dataframes, Wimsey will produce a list of passing tests.

Wimsey has a `starter_tests_from_samples` function, and a `save_starter_tests_from_samples` function dependent on whether you're intending to return the tests as a dictionary, or save them to a file. `save_starter_tests_from_samples` takes the exact same arguments, but with the addition of a `path` and an optional `storage_options` argument.

=== "starter_tests_from_samples"
```python
from glob import glob

import pandas as pd
from wimsey.profile import starter_tests_from_samples

dfs = [pd.read_csv(i) for i in glob("folder/of/samples/*.csv")]
tests: list[dict] = starter_tests_from_samples(dfs, margin=1.5)
```
=== "save_starter_tests_from_samples"
```python
from glob import glob

import polars as pl
from wimsey.profile import save_starter_tests_from_samples

from config import my_storage_options

save_starter_tests_from_samples(
path="s3://test-store/cooltest.yaml",
samples=[pl.read_parquet(i) for i in glob("folder/of/samples/*.parquet")],
margin=0.8,
storage_options=my_storage_options,
)
```
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Ideally, all data would be usable when you recieve it, but you probably already

A data contract is an expression of what *should* be true of some data, such as that it should 'only have columns x and y' or 'the values of column a should never exceed 1'. Wimsey is a library built to run these contracts on a dataframe during python runtime.

Additionally, Wimsey has tools to [help you generate sensible tests from a data sample](building-tests.md)

Wimsey is built on top of the awesome [Narwhals](https://github.com/narwhals-dev/narwhals) and natively supports any dataframes that Narwhal's does. At the time of writing, that includes Polars, Pandas, Arrow, Dask, Rapids and Modin.

If you're looking to get a quick feel for Wimsey, check out the [quick start documentation](quick-start.md)
6 changes: 4 additions & 2 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ We can test for a lot more than that, but that works for our example. Our first
]
```

See [Possible Tests](possible_tests.md) for a full catalogue of runnable tests and their configurations.
See [Possible Tests](possible-tests.md) for a full catalogue of runnable tests and their configurations.

### Executing Tests

Expand Down Expand Up @@ -210,4 +210,6 @@ Validate, will run tests in the exact same way as `test`, but simply raises an e
print(f"{top_sleuth} is the best sleuth!")
```

And that's it, to keep things simple `validate` and `test` are the only public-intended functions in Wimsey, aside from test creation, which is covered further in the *possible tests* section.
And that's it for testing, to keep things simple `validate` and `test` are the only public-intended functions in Wimsey, aside from test creation, which is covered further in the *possible tests* section.

Wimsey also support *generating tests*, see [the building tests section](building-tests.md) for how to get started.
5 changes: 4 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@ site_name: Wimsey

nav:
- Intro: index.md
- Motivation: motivation.md
- Quick Start: quick-start.md
- Building Tests: building-tests.md
- Motivation: motivation.md
- Test Catalogue: possible-tests.md

theme:
name: material
icon:
logo: fontawesome/solid/magnifying-glass-chart
palette:
- scheme: default
primary: brown
Expand Down
12 changes: 8 additions & 4 deletions tests/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ class DummyOpenFile:
def __enter__(self, *args, **kwargs):
return self

def __exit__(self, *args, **kwargs): ...
def __exit__(self, *args, **kwargs):
...

def read(self, *args, **kwargs):
return yaml.dump(test_suite)
Expand All @@ -65,7 +66,8 @@ class DummyOpenFile:
def __enter__(self, *args, **kwargs):
return self

def __exit__(self, *args, **kwargs): ...
def __exit__(self, *args, **kwargs):
...

def read(self, *args, **kwargs):
return json.dumps(test_suite)
Expand All @@ -83,7 +85,8 @@ class DummyOpenFile:
def __enter__(self, *args, **kwargs):
return self

def __exit__(self, *args, **kwargs): ...
def __exit__(self, *args, **kwargs):
...

def read(self, *args, **kwargs):
return json.dumps(test_suite)
Expand All @@ -104,7 +107,8 @@ class DummyOpenFile:
def __enter__(self, *args, **kwargs):
return self

def __exit__(self, *args, **kwargs): ...
def __exit__(self, *args, **kwargs):
...

def read(self, *args, **kwargs):
return "dsafasdfasdf"
Expand Down
19 changes: 19 additions & 0 deletions tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,22 @@ def test_that_describe_excludes_non_specified_column_and_metric_combos() -> None
assert "count_a" in actual
assert "count_b" not in actual
assert "min_a" not in actual


def test_that_profile_by_sampling_returns_list_of_dicts_of_expected_length() -> None:
df = pl.DataFrame({"a": [1.2, 1.3, 1.4], "b": ["one", "two", None]})
actual = dataframe.profile_from_sampling(df, samples=10, n=1)
assert len(actual) == 10
assert actual[0]["mean_a"] in [1.2, 1.3, 1.4]
assert actual[4]["columns"] == "a_^&^_b"


def test_that_profile_from_samples_returns_list_of_dicts_of_expected_length() -> None:
dfs = [
pl.DataFrame({"a": [1.2, 1.3, 1.4], "b": ["one", "two", None]})
for _ in range(20)
]
actual = dataframe.profile_from_samples(dfs)
assert len(actual) == 20
assert actual[10]["mean_a"] == 1.3
assert actual[4]["columns"] == "a_^&^_b"
79 changes: 79 additions & 0 deletions tests/test_profile.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import polars as pl

from wimsey import profile
from wimsey import execution


def test_starter_tests_from_sampling_returns_passing_test() -> None:
df = pl.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": ["hat", "bat", "cat", "mat", "sat"],
"c": [0.2, 0.4, 0.2, 0.56, 0.1],
}
)
starter_test = profile.starter_tests_from_sampling(df, samples=100, n=5)
result = execution.test(df, starter_test)
assert result.success


def test_starter_tests_from_samples_returns_passing_test() -> None:
df = pl.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": ["hat", "bat", "cat", "mat", "sat"],
"c": [0.2, 0.4, 0.2, 0.56, 0.1],
}
)
starter_test = profile.starter_tests_from_samples(
[df.sample(fraction=0.5) for _ in range(100)]
)
result = execution.test(df, starter_test)
assert result.success


def test_margin_works_as_anticipated() -> None:
df = pl.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": ["hat", "bat", "cat", "mat", "sat"],
"c": [0.2, 0.4, 0.2, 0.56, 0.1],
}
)
starter_test = profile.starter_tests_from_sampling(df, n=5, margin=50)
result = execution.test(df, starter_test)
assert result.success
impossible_test = profile.starter_tests_from_sampling(df, n=5, margin=-500)
result = execution.test(df, impossible_test)
assert not result.success


def test_save_tests_from_sampling_creates_expected_and_runnable_file(tmp_path) -> None:
df = pl.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": ["hat", "bat", "cat", "mat", "sat"],
"c": [0.2, 0.4, 0.2, 0.56, 0.1],
}
)
profile.save_starter_tests_from_sampling(
str(tmp_path / "cool.yaml"), df, n=5, margin=1
)
result = execution.test(df, str(tmp_path / "cool.yaml"))
assert result.success


def test_save_tests_from_samples_creates_expected_and_runnable_file(tmp_path) -> None:
df = pl.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": ["hat", "bat", "cat", "mat", "sat"],
"c": [0.2, 0.4, 0.2, 0.56, 0.1],
}
)
profile.save_starter_tests_from_samples(
str(tmp_path / "cool.json"),
[df.sample(fraction=0.5) for _ in range(10)],
)
result = execution.test(df, str(tmp_path / "cool.json"))
assert result.success
2 changes: 1 addition & 1 deletion wimsey/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.3.2"
__version__ = "0.4.0"
2 changes: 1 addition & 1 deletion wimsey/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def read_config(path: str, storage_options: dict | None = None) -> list[Callable
config: dict
with fsspec.open(path, "rt", **storage_options_dict) as file:
contents = file.read()
if path.endswith(".yaml"):
if path.endswith(".yaml") or path.endswith(".yml"):
try:
import yaml

Expand Down
28 changes: 26 additions & 2 deletions wimsey/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ def describe(
"type",
"count",
"null",
"null_percentage",
"length",
]

Expand Down Expand Up @@ -59,9 +60,14 @@ def describe(
required_exprs += [
nw.lit(str(df.schema[c])).alias(f"type_{c}") for c in columns_to_check
]
if "count" in metrics or "null" in metrics or "leghth" in metrics:
if (
"count" in metrics
or "null" in metrics
or "length" in metrics
or "null_percentage" in metrics
):
required_exprs += [nw.col(*columns_to_check).count().name.prefix("count_")]
if "null" in metrics or "length" in metrics:
if "null" in metrics or "length" in metrics or "null_percentage" in metrics:
required_exprs += [
nw.col(*columns_to_check).null_count().name.prefix("null_count_")
]
Expand All @@ -88,3 +94,21 @@ def describe(
k: v[0]
for k, v in df_metrics.collect().to_dict(as_series=False).items() # type: ignore[union-attr]
}


def profile_from_sampling(
df: FrameT,
samples: int = 100,
n: int | None = None,
fraction: int | None = None,
) -> list[dict[str, float]]:
return [
describe(df.sample(n=n, fraction=fraction, with_replacement=True))
for _ in range(samples)
]


def profile_from_samples(
samples: list[FrameT],
) -> list[dict[str, float]]:
return [describe(i) for i in samples]
3 changes: 2 additions & 1 deletion wimsey/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ class final_result:
results: list[result]


class DataValidationException(Exception): ...
class DataValidationException(Exception):
...


def _as_set(val: Any) -> set:
Expand Down
Loading

0 comments on commit 20a9b47

Please sign in to comment.