-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
✨ Test generation support introduced (#22)
- Loading branch information
Showing
13 changed files
with
447 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
Alongside *running* tests, Wimsey also has some functions to aid *building* tests. This can be useful if you want to automagically create some sensible initial tests for multiple datasets, without needing to type them out by hand, or create them manually in code. | ||
|
||
As with the rest of Wimsey, your own dataframe engine will be used to sample the relevant statistics. Wimsey can either generate starter tests from *a list of samples* or it can use *sampling with replacement* to generate samples for you from a single dataframe. If you use the latter, note that Wimsey will need to *evaluate each sample individually* so if you are using a lazy framework such as Polars' LazyFrames, Dask or Modin you will likely want to collect your results first, or implement a caching mechanism to avoid unnecessary repeated computation. | ||
|
||
|
||
## What is margin? | ||
|
||
You'll see the keyword *margin* throughout Wimsey test building, it's worth explaining here first. | ||
|
||
Margin is the amount of *extra allowance* tests give, based on the sample. For instance if Wimsey has three samples, with a "column_a" maximum of 1, 2 and 3, rather than creating a test for the maximum being 3 (the highest value seen in the samples), Wimsey will allow an amount of 'give' for the tests. | ||
|
||
This is based on the *standard deviation of the statistical metric*, and for the above example would be 1, meaning that Wimsey would build a test that expects the maximum to be less than or equal to 4. | ||
|
||
If this is all giberish to you, don't worry, the 'margin' keyword defaults to 1, which is often a sensible choice, if you find that Wimsey is creating to strict tests, bump it up slightly, if tests are too lax, you can reduce margin to a smaller positive number. | ||
|
||
Setting `margin` to a negative value means that your creating a test that your given sample would fail, and while supported, is unlikely to be what you're looking to do. | ||
|
||
## From Sampling | ||
|
||
From a single dataframe, Wimsey will sample with replacement to build a starter test. The `samples` keyword specifies the number of times you want Wimsey to build a sample, while `n` or `fraction` tell Wimsey the size (in rows) or fraction (as a float) of the sample to take. Note that you *can't supply both n AND fraction keywords to Wimsey*. | ||
|
||
Wimsey has a `starter_tests_from_sampling` function, and a `save_starter_tests_from_sampling` function dependent on whether you're intending to return the tests as a dictionary, or save them to a file. `save_starter_tests_from_sampling` takes the exact same arguments, but with the addition of a `path` and an optional `storage_options` argument. | ||
|
||
=== "starter_tests_from_sampling" | ||
```python | ||
import polars as pl | ||
from wimsey.profile import starter_tests_from_sampling | ||
|
||
df = pl.DataFrame({"a": [1, 2, 3], "b": ["cool", "bat", "hat"]}) | ||
tests: list[dict] = starter_tests_from_sampling(df, samples=5_000, n=2, margin=3) | ||
``` | ||
=== "save_starter_tests_from_sampling" | ||
```python | ||
import pandas as pd | ||
from wimsey.profile import starter_tests_from_sampling | ||
|
||
df = pd.DataFrame({"a": [1, 2, 3], "b": ["cool", "bat", "hat"]}) | ||
tests: list[dict] = save_starter_tests_from_sampling( | ||
path="my-first-test.json", | ||
df=df, | ||
samples=5_000, | ||
fraction=0.5, | ||
margin=3, | ||
) | ||
``` | ||
|
||
## From Samples | ||
|
||
From a list (or other iterable such as a generate) of supported dataframes, Wimsey will produce a list of passing tests. | ||
|
||
Wimsey has a `starter_tests_from_samples` function, and a `save_starter_tests_from_samples` function dependent on whether you're intending to return the tests as a dictionary, or save them to a file. `save_starter_tests_from_samples` takes the exact same arguments, but with the addition of a `path` and an optional `storage_options` argument. | ||
|
||
=== "starter_tests_from_samples" | ||
```python | ||
from glob import glob | ||
|
||
import pandas as pd | ||
from wimsey.profile import starter_tests_from_samples | ||
|
||
dfs = [pd.read_csv(i) for i in glob("folder/of/samples/*.csv")] | ||
tests: list[dict] = starter_tests_from_samples(dfs, margin=1.5) | ||
``` | ||
=== "save_starter_tests_from_samples" | ||
```python | ||
from glob import glob | ||
|
||
import polars as pl | ||
from wimsey.profile import save_starter_tests_from_samples | ||
|
||
from config import my_storage_options | ||
|
||
save_starter_tests_from_samples( | ||
path="s3://test-store/cooltest.yaml", | ||
samples=[pl.read_parquet(i) for i in glob("folder/of/samples/*.parquet")], | ||
margin=0.8, | ||
storage_options=my_storage_options, | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
import polars as pl | ||
|
||
from wimsey import profile | ||
from wimsey import execution | ||
|
||
|
||
def test_starter_tests_from_sampling_returns_passing_test() -> None: | ||
df = pl.DataFrame( | ||
{ | ||
"a": [1, 2, 3, 4, 5], | ||
"b": ["hat", "bat", "cat", "mat", "sat"], | ||
"c": [0.2, 0.4, 0.2, 0.56, 0.1], | ||
} | ||
) | ||
starter_test = profile.starter_tests_from_sampling(df, samples=100, n=5) | ||
result = execution.test(df, starter_test) | ||
assert result.success | ||
|
||
|
||
def test_starter_tests_from_samples_returns_passing_test() -> None: | ||
df = pl.DataFrame( | ||
{ | ||
"a": [1, 2, 3, 4, 5], | ||
"b": ["hat", "bat", "cat", "mat", "sat"], | ||
"c": [0.2, 0.4, 0.2, 0.56, 0.1], | ||
} | ||
) | ||
starter_test = profile.starter_tests_from_samples( | ||
[df.sample(fraction=0.5) for _ in range(100)] | ||
) | ||
result = execution.test(df, starter_test) | ||
assert result.success | ||
|
||
|
||
def test_margin_works_as_anticipated() -> None: | ||
df = pl.DataFrame( | ||
{ | ||
"a": [1, 2, 3, 4, 5], | ||
"b": ["hat", "bat", "cat", "mat", "sat"], | ||
"c": [0.2, 0.4, 0.2, 0.56, 0.1], | ||
} | ||
) | ||
starter_test = profile.starter_tests_from_sampling(df, n=5, margin=50) | ||
result = execution.test(df, starter_test) | ||
assert result.success | ||
impossible_test = profile.starter_tests_from_sampling(df, n=5, margin=-500) | ||
result = execution.test(df, impossible_test) | ||
assert not result.success | ||
|
||
|
||
def test_save_tests_from_sampling_creates_expected_and_runnable_file(tmp_path) -> None: | ||
df = pl.DataFrame( | ||
{ | ||
"a": [1, 2, 3, 4, 5], | ||
"b": ["hat", "bat", "cat", "mat", "sat"], | ||
"c": [0.2, 0.4, 0.2, 0.56, 0.1], | ||
} | ||
) | ||
profile.save_starter_tests_from_sampling( | ||
str(tmp_path / "cool.yaml"), df, n=5, margin=1 | ||
) | ||
result = execution.test(df, str(tmp_path / "cool.yaml")) | ||
assert result.success | ||
|
||
|
||
def test_save_tests_from_samples_creates_expected_and_runnable_file(tmp_path) -> None: | ||
df = pl.DataFrame( | ||
{ | ||
"a": [1, 2, 3, 4, 5], | ||
"b": ["hat", "bat", "cat", "mat", "sat"], | ||
"c": [0.2, 0.4, 0.2, 0.56, 0.1], | ||
} | ||
) | ||
profile.save_starter_tests_from_samples( | ||
str(tmp_path / "cool.json"), | ||
[df.sample(fraction=0.5) for _ in range(10)], | ||
) | ||
result = execution.test(df, str(tmp_path / "cool.json")) | ||
assert result.success |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
__version__ = "0.3.2" | ||
__version__ = "0.4.0" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.