✨ Test generation support introduced (#22)

benrutter · Nov 14, 2024 · 20a9b47 · 20a9b47
1 parent 6c60d68
commit 20a9b47
Show file tree

Hide file tree

Showing 13 changed files with 447 additions and 13 deletions.
diff --git a/docs/building-tests.md b/docs/building-tests.md
@@ -0,0 +1,78 @@
+Alongside *running* tests, Wimsey also has some functions to aid *building* tests. This can be useful if you want to automagically create some sensible initial tests for multiple datasets, without needing to type them out by hand, or create them manually in code.
+
+As with the rest of Wimsey, your own dataframe engine will be used to sample the relevant statistics. Wimsey can either generate starter tests from *a list of samples* or it can use *sampling with replacement* to generate samples for you from a single dataframe. If you use the latter, note that Wimsey will need to *evaluate each sample individually* so if you are using a lazy framework such as Polars' LazyFrames, Dask or Modin you will likely want to collect your results first, or implement a caching mechanism to avoid unnecessary repeated computation.
+
+
+## What is margin?
+
+You'll see the keyword *margin* throughout Wimsey test building, it's worth explaining here first.
+
+Margin is the amount of *extra allowance* tests give, based on the sample. For instance if Wimsey has three samples, with a "column_a" maximum of 1, 2 and 3, rather than creating a test for the maximum being 3 (the highest value seen in the samples), Wimsey will allow an amount of 'give' for the tests.
+
+This is based on the *standard deviation of the statistical metric*, and for the above example would be 1, meaning that Wimsey would build a test that expects the maximum to be less than or equal to 4.
+
+If this is all giberish to you, don't worry, the 'margin' keyword defaults to 1, which is often a sensible choice, if you find that Wimsey is creating to strict tests, bump it up slightly, if tests are too lax, you can reduce margin to a smaller positive number.
+
+Setting `margin` to a negative value means that your creating a test that your given sample would fail, and while supported, is unlikely to be what you're looking to do.
+
+## From Sampling
+
+From a single dataframe, Wimsey will sample with replacement to build a starter test. The `samples` keyword specifies the number of times you want Wimsey to build a sample, while `n` or `fraction` tell Wimsey the size (in rows) or fraction (as a float) of the sample to take. Note that you *can't supply both n AND fraction keywords to Wimsey*.
+
+Wimsey has a `starter_tests_from_sampling` function, and a `save_starter_tests_from_sampling` function dependent on whether you're intending to return the tests as a dictionary, or save them to a file. `save_starter_tests_from_sampling` takes the exact same arguments, but with the addition of a `path` and an optional `storage_options` argument.
+
+=== "starter_tests_from_sampling"
+    ```python
+    import polars as pl
+    from wimsey.profile import starter_tests_from_sampling
+
+    df = pl.DataFrame({"a": [1, 2, 3], "b": ["cool", "bat", "hat"]})
+    tests: list[dict] = starter_tests_from_sampling(df, samples=5_000, n=2, margin=3)
+    ```
+=== "save_starter_tests_from_sampling"
+    ```python
+    import pandas as pd
+    from wimsey.profile import starter_tests_from_sampling
+
+    df = pd.DataFrame({"a": [1, 2, 3], "b": ["cool", "bat", "hat"]})
+    tests: list[dict] = save_starter_tests_from_sampling(
+        path="my-first-test.json",
+        df=df,
+        samples=5_000,
+        fraction=0.5,
+        margin=3,
+    )
+    ```
+
+## From Samples
+
+From a list (or other iterable such as a generate) of supported dataframes, Wimsey will produce a list of passing tests.
+
+Wimsey has a `starter_tests_from_samples` function, and a `save_starter_tests_from_samples` function dependent on whether you're intending to return the tests as a dictionary, or save them to a file. `save_starter_tests_from_samples` takes the exact same arguments, but with the addition of a `path` and an optional `storage_options` argument.
+
+=== "starter_tests_from_samples"
+    ```python
+    from glob import glob
+
+    import pandas as pd
+    from wimsey.profile import starter_tests_from_samples
+
+    dfs = [pd.read_csv(i) for i in glob("folder/of/samples/*.csv")]
+    tests: list[dict] = starter_tests_from_samples(dfs, margin=1.5)
+    ```
+=== "save_starter_tests_from_samples"
+    ```python
+    from glob import glob
+
+    import polars as pl
+    from wimsey.profile import save_starter_tests_from_samples
+
+    from config import my_storage_options
+
+    save_starter_tests_from_samples(
+        path="s3://test-store/cooltest.yaml",
+        samples=[pl.read_parquet(i) for i in glob("folder/of/samples/*.parquet")],
+        margin=0.8,
+        storage_options=my_storage_options,
+    )
+    ```
diff --git a/docs/index.md b/docs/index.md
@@ -12,6 +12,8 @@ Ideally, all data would be usable when you recieve it, but you probably already
 
 A data contract is an expression of what *should* be true of some data, such as that it should 'only have columns x and y' or 'the values of column a should never exceed 1'. Wimsey is a library built to run these contracts on a dataframe during python runtime.
 
+Additionally, Wimsey has tools to [help you generate sensible tests from a data sample](building-tests.md)
+
 Wimsey is built on top of the awesome [Narwhals](https://github.com/narwhals-dev/narwhals) and natively supports any dataframes that Narwhal's does. At the time of writing, that includes Polars, Pandas, Arrow, Dask, Rapids and Modin.
 
 If you're looking to get a quick feel for Wimsey, check out the [quick start documentation](quick-start.md)
diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -74,7 +74,7 @@ We can test for a lot more than that, but that works for our example. Our first
     ]
     ```
 
-See [Possible Tests](possible_tests.md) for a full catalogue of runnable tests and their configurations.
+See [Possible Tests](possible-tests.md) for a full catalogue of runnable tests and their configurations.
 
 ### Executing Tests
 
@@ -210,4 +210,6 @@ Validate, will run tests in the exact same way as `test`, but simply raises an e
     print(f"{top_sleuth} is the best sleuth!")
     ```
 
-And that's it, to keep things simple `validate` and `test` are the only public-intended functions in Wimsey, aside from test creation, which is covered further in the *possible tests* section.
+And that's it for testing, to keep things simple `validate` and `test` are the only public-intended functions in Wimsey, aside from test creation, which is covered further in the *possible tests* section.
+
+Wimsey also support *generating tests*, see [the building tests section](building-tests.md) for how to get started.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -2,12 +2,15 @@ site_name: Wimsey
 
 nav:
   - Intro: index.md
-  - Motivation: motivation.md
   - Quick Start: quick-start.md
+  - Building Tests: building-tests.md
+  - Motivation: motivation.md
   - Test Catalogue: possible-tests.md
 
 theme:
   name: material
+  icon:
+    logo: fontawesome/solid/magnifying-glass-chart
   palette:
     - scheme: default
       primary: brown

diff --git a/tests/test_config.py b/tests/test_config.py
@@ -47,7 +47,8 @@ class DummyOpenFile:
         def __enter__(self, *args, **kwargs):
             return self
 
-        def __exit__(self, *args, **kwargs): ...
+        def __exit__(self, *args, **kwargs):
+            ...
 
         def read(self, *args, **kwargs):
             return yaml.dump(test_suite)
@@ -65,7 +66,8 @@ class DummyOpenFile:
         def __enter__(self, *args, **kwargs):
             return self
 
-        def __exit__(self, *args, **kwargs): ...
+        def __exit__(self, *args, **kwargs):
+            ...
 
         def read(self, *args, **kwargs):
             return json.dumps(test_suite)
@@ -83,7 +85,8 @@ class DummyOpenFile:
         def __enter__(self, *args, **kwargs):
             return self
 
-        def __exit__(self, *args, **kwargs): ...
+        def __exit__(self, *args, **kwargs):
+            ...
 
         def read(self, *args, **kwargs):
             return json.dumps(test_suite)
@@ -104,7 +107,8 @@ class DummyOpenFile:
         def __enter__(self, *args, **kwargs):
             return self
 
-        def __exit__(self, *args, **kwargs): ...
+        def __exit__(self, *args, **kwargs):
+            ...
 
         def read(self, *args, **kwargs):
             return "dsafasdfasdf"

diff --git a/tests/test_dataframe.py b/tests/test_dataframe.py
@@ -37,3 +37,22 @@ def test_that_describe_excludes_non_specified_column_and_metric_combos() -> None
     assert "count_a" in actual
     assert "count_b" not in actual
     assert "min_a" not in actual
+
+
+def test_that_profile_by_sampling_returns_list_of_dicts_of_expected_length() -> None:
+    df = pl.DataFrame({"a": [1.2, 1.3, 1.4], "b": ["one", "two", None]})
+    actual = dataframe.profile_from_sampling(df, samples=10, n=1)
+    assert len(actual) == 10
+    assert actual[0]["mean_a"] in [1.2, 1.3, 1.4]
+    assert actual[4]["columns"] == "a_^&^_b"
+
+
+def test_that_profile_from_samples_returns_list_of_dicts_of_expected_length() -> None:
+    dfs = [
+        pl.DataFrame({"a": [1.2, 1.3, 1.4], "b": ["one", "two", None]})
+        for _ in range(20)
+    ]
+    actual = dataframe.profile_from_samples(dfs)
+    assert len(actual) == 20
+    assert actual[10]["mean_a"] == 1.3
+    assert actual[4]["columns"] == "a_^&^_b"
diff --git a/tests/test_profile.py b/tests/test_profile.py
@@ -0,0 +1,79 @@
+import polars as pl
+
+from wimsey import profile
+from wimsey import execution
+
+
+def test_starter_tests_from_sampling_returns_passing_test() -> None:
+    df = pl.DataFrame(
+        {
+            "a": [1, 2, 3, 4, 5],
+            "b": ["hat", "bat", "cat", "mat", "sat"],
+            "c": [0.2, 0.4, 0.2, 0.56, 0.1],
+        }
+    )
+    starter_test = profile.starter_tests_from_sampling(df, samples=100, n=5)
+    result = execution.test(df, starter_test)
+    assert result.success
+
+
+def test_starter_tests_from_samples_returns_passing_test() -> None:
+    df = pl.DataFrame(
+        {
+            "a": [1, 2, 3, 4, 5],
+            "b": ["hat", "bat", "cat", "mat", "sat"],
+            "c": [0.2, 0.4, 0.2, 0.56, 0.1],
+        }
+    )
+    starter_test = profile.starter_tests_from_samples(
+        [df.sample(fraction=0.5) for _ in range(100)]
+    )
+    result = execution.test(df, starter_test)
+    assert result.success
+
+
+def test_margin_works_as_anticipated() -> None:
+    df = pl.DataFrame(
+        {
+            "a": [1, 2, 3, 4, 5],
+            "b": ["hat", "bat", "cat", "mat", "sat"],
+            "c": [0.2, 0.4, 0.2, 0.56, 0.1],
+        }
+    )
+    starter_test = profile.starter_tests_from_sampling(df, n=5, margin=50)
+    result = execution.test(df, starter_test)
+    assert result.success
+    impossible_test = profile.starter_tests_from_sampling(df, n=5, margin=-500)
+    result = execution.test(df, impossible_test)
+    assert not result.success
+
+
+def test_save_tests_from_sampling_creates_expected_and_runnable_file(tmp_path) -> None:
+    df = pl.DataFrame(
+        {
+            "a": [1, 2, 3, 4, 5],
+            "b": ["hat", "bat", "cat", "mat", "sat"],
+            "c": [0.2, 0.4, 0.2, 0.56, 0.1],
+        }
+    )
+    profile.save_starter_tests_from_sampling(
+        str(tmp_path / "cool.yaml"), df, n=5, margin=1
+    )
+    result = execution.test(df, str(tmp_path / "cool.yaml"))
+    assert result.success
+
+
+def test_save_tests_from_samples_creates_expected_and_runnable_file(tmp_path) -> None:
+    df = pl.DataFrame(
+        {
+            "a": [1, 2, 3, 4, 5],
+            "b": ["hat", "bat", "cat", "mat", "sat"],
+            "c": [0.2, 0.4, 0.2, 0.56, 0.1],
+        }
+    )
+    profile.save_starter_tests_from_samples(
+        str(tmp_path / "cool.json"),
+        [df.sample(fraction=0.5) for _ in range(10)],
+    )
+    result = execution.test(df, str(tmp_path / "cool.json"))
+    assert result.success
diff --git a/wimsey/_version.py b/wimsey/_version.py
@@ -1 +1 @@
-__version__ = "0.3.2"
+__version__ = "0.4.0"
diff --git a/wimsey/config.py b/wimsey/config.py
@@ -39,7 +39,7 @@ def read_config(path: str, storage_options: dict | None = None) -> list[Callable
     config: dict
     with fsspec.open(path, "rt", **storage_options_dict) as file:
         contents = file.read()
-    if path.endswith(".yaml"):
+    if path.endswith(".yaml") or path.endswith(".yml"):
         try:
             import yaml
 

diff --git a/wimsey/dataframe.py b/wimsey/dataframe.py
@@ -25,6 +25,7 @@ def describe(
         "type",
         "count",
         "null",
+        "null_percentage",
         "length",
     ]
 
@@ -59,9 +60,14 @@ def describe(
         required_exprs += [
             nw.lit(str(df.schema[c])).alias(f"type_{c}") for c in columns_to_check
         ]
-    if "count" in metrics or "null" in metrics or "leghth" in metrics:
+    if (
+        "count" in metrics
+        or "null" in metrics
+        or "length" in metrics
+        or "null_percentage" in metrics
+    ):
         required_exprs += [nw.col(*columns_to_check).count().name.prefix("count_")]
-    if "null" in metrics or "length" in metrics:
+    if "null" in metrics or "length" in metrics or "null_percentage" in metrics:
         required_exprs += [
             nw.col(*columns_to_check).null_count().name.prefix("null_count_")
         ]
@@ -88,3 +94,21 @@ def describe(
             k: v[0]
             for k, v in df_metrics.collect().to_dict(as_series=False).items()  # type: ignore[union-attr]
         }
+
+
+def profile_from_sampling(
+    df: FrameT,
+    samples: int = 100,
+    n: int | None = None,
+    fraction: int | None = None,
+) -> list[dict[str, float]]:
+    return [
+        describe(df.sample(n=n, fraction=fraction, with_replacement=True))
+        for _ in range(samples)
+    ]
+
+
+def profile_from_samples(
+    samples: list[FrameT],
+) -> list[dict[str, float]]:
+    return [describe(i) for i in samples]
diff --git a/wimsey/execution.py b/wimsey/execution.py
@@ -14,7 +14,8 @@ class final_result:
     results: list[result]
 
 
-class DataValidationException(Exception): ...
+class DataValidationException(Exception):
+    ...
 
 
 def _as_set(val: Any) -> set: