Fix installation process for Serverless (#150)

## Changes * Removed pyspark dependency from the library. Enable it for testing and cli only. * Updated Databricks CLI version requirement * Added nightly CI * Refactored ci commands Requires PR for supporting extras in Databricks CLI to be merged and released: databricks/cli#2288 ### Linked issues Resolves #139 ### Tests - [x] manually tested - [ ] added unit tests - [ ] added integration tests
databrickslabs · Feb 13, 2025 · 6e4bcdc · 6e4bcdc
1 parent de11239
commit 6e4bcdc
Show file tree

Hide file tree

Showing 15 changed files with 224 additions and 115 deletions.
diff --git a/tests/integration/.coveragerc → .coveragerc b/tests/integration/.coveragerc → .coveragerc
diff --git a/.github/workflows/acceptance.yml b/.github/workflows/acceptance.yml
@@ -45,6 +45,13 @@ jobs:
       - name: Run unit tests and generate test coverage report
         run: make test
 
+      # Integration tests are run from within tests/integration folder.
+      # We need to make sure .coveragerc is there so that code coverage is generated for the right modules.
+      - name: Prepare code coverage configuration for integration tests
+        run: cp .coveragerc tests/integration
+
+      # Run tests from `tests/integration` as defined in .codegen.json
+      # and generate code coverage for modules defined in .coveragerc
       - name: Run integration tests and generate test coverage report
         uses: databrickslabs/sandbox/acceptance@acceptance/v0.4.3
         with:

diff --git a/.github/workflows/docs-release.yml b/.github/workflows/docs-release.yml
@@ -4,7 +4,7 @@ on:
   push:
     tags:
       - 'v[0-9]+.[0-9]+.[0-9]+' # Must match semantic version tags like 'v1.2.3'
-  workflow_dispatch: # Enables manual triggering of the workflow
+  workflow_dispatch: # Allows manual triggering of the workflow
 
 jobs:
   build:

diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
@@ -0,0 +1,62 @@
+name: nightly
+
+on:
+  workflow_dispatch:  # Allows manual triggering of the workflow
+  schedule:
+    - cron: '0 4 * * *'  # Runs automatically at 4:00 AM UTC every day
+
+permissions:
+  id-token: write
+  issues: write
+  contents: read
+  pull-requests: read
+
+concurrency:
+  group: single-acceptance-job-per-repo
+
+jobs:
+  integration:
+    environment: tool
+    runs-on: larger
+    steps:
+      - name: Checkout Code
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          cache: 'pip'
+          cache-dependency-path: '**/pyproject.toml'
+          python-version: '3.10'
+
+      - name: Install hatch
+        run: pip install hatch==1.9.4
+
+      - name: Run unit tests and generate test coverage report
+        run: make test
+
+      # Acceptance tests are run from within tests/integration folder.
+      # We need to make sure .coveragerc is there so that code coverage is generated for the right modules.
+      - name: Prepare .coveragerc for integration tests
+        run: cp .coveragerc tests/integration
+
+      # Run tests from `tests/integration` as defined in .codegen.json
+      # and generate code coverage for modules defined in .coveragerc
+      - name: Run integration tests and generate test coverage report
+        uses: databrickslabs/sandbox/acceptance@acceptance/v0.4.3
+        with:
+          vault_uri: ${{ secrets.VAULT_URI }}
+          timeout: 2h
+          create_issues: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
+          ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
+
+      # collects all coverage reports: coverage.xml from integration tests, coverage-unit.xml from unit tests
+      - name: Publish test coverage
+        uses: codecov/codecov-action@v5
+        with:
+          use_oidc: true
diff --git a/Makefile b/Makefile
@@ -29,7 +29,7 @@ setup_spark_remote:
 test: setup_spark_remote ci-test
 
 coverage:
-	hatch run coverage && open htmlcov/index.html
+	hatch run coverage; open htmlcov/index.html
 
 docs-build:
 	yarn --cwd docs/dqx build

diff --git a/demos/dqx_demo_tool.py b/demos/dqx_demo_tool.py
@@ -227,6 +227,6 @@
 from databricks.labs.dqx.contexts.workspace import WorkspaceContext
 
 ctx = WorkspaceContext(WorkspaceClient())
-dashboards_folder_link = f"{ctx.installation.workspace_link("")}dashboards/"
+dashboards_folder_link = f"{ctx.installation.workspace_link('')}dashboards/"
 print(f"Open a dashboard from the following folder and refresh it:")
 print(dashboards_folder_link)
diff --git a/docs/dqx/docs/demos.mdx b/docs/dqx/docs/demos.mdx
@@ -8,3 +8,6 @@ Install the [installation](/docs/installation) framework, and import the followi
 * [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
 * [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
 * [DQX DLT Demo Notebook](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Delta Live Tables (DLT).
+
+Note that DQX don't have to be run from a Notebook. You can run it from any Python script as long as it runs on Databricks.
+For example, you can add DQX as a library to your job or cluster.
diff --git a/docs/dqx/docs/dev/contributing.mdx b/docs/dqx/docs/dev/contributing.mdx
@@ -19,49 +19,43 @@ While minimizing external dependencies is essential, exceptions can be made case
 justified, such as when a well-established and actively maintained library provides significant benefits, like time savings, performance improvements, 
 or specialized functionality unavailable in standard libraries.
 
-## Common fixes for `mypy` errors
-
-See https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html for more details
-
-### ..., expression has type "None", variable has type "str"
-
-* Add `assert ... is not None` if it's a body of a method. Example:
-
-```
-# error: Argument 1 to "delete" of "DashboardWidgetsAPI" has incompatible type "str | None"; expected "str"
-self._ws.dashboard_widgets.delete(widget.id)
-```
-
-after
-
-```
-assert widget.id is not None
-self._ws.dashboard_widgets.delete(widget.id)
-```
-
-* Add `... | None` if it's in the dataclass. Example: `cloud: str = None` -> `cloud: str | None = None`
-
-### ..., has incompatible type "Path"; expected "str"
-
-Add `.as_posix()` to convert Path to str
-
-###  Argument 2 to "get" of "dict" has incompatible type "None"; expected ...
+## First contribution
 
-Add a valid default value for the dictionary return. 
+If you're interested in contributing, please create a PR, reach out to us or open an issue to discuss your ideas.
 
-Example: 
-```python
-def viz_type(self) -> str:
-    return self.viz.get("type", None)
-```
+Here are the example steps to submit your first contribution:
 
-after:
+1. Fork the repo. You can also create a branch if you are added as writer to the repo.
+2. The locally: `git clone`
+3. `git checkout main` (or `gcm` if you're using [ohmyzsh](https://ohmyz.sh/)).
+4. `git pull` (or `gl` if you're using [ohmyzsh](https://ohmyz.sh/)).
+5. `git checkout -b FEATURENAME` (or `gcb FEATURENAME` if you're using [ohmyzsh](https://ohmyz.sh/)).
+6. .. do the work
+7. `make fmt`
+8. `make lint`
+9. .. fix if any issues reported
+10. `make setup_spark_remote`, `make test` and `make integration`, and optionally `make coverage` (generate coverage report)
+11. .. fix if any issues reported
+12. `git commit -S -a -m "message"`
+
+    Make sure to enter a meaningful commit message title.
+    You need to sign commits with your GPG key (hence -S option).
+    To setup GPG key in your Github account follow [these instructions](https://docs.github.com/en/github/authenticating-to-github/managing-commit-signature-verification).
+    You can configure Git to sign all commits with your GPG key by default: `git config --global commit.gpgsign true`
+
+    If you have not signed your commits initially, you can re-apply all of them and sign as follows:
+    ```shell
+    git reset --soft HEAD~<how-many-commit-go-back>
+    git commit -S --reuse-message=ORIG_HEAD
+    git push -f origin <remote-branch-name>
+    ```
+13. `git push origin FEATURENAME`
 
-Example: 
-```python
-def viz_type(self) -> str:
-    return self.viz.get("type", "UNKNOWN")
-```
+    To access the repository, you must use the HTTPS remote with a personal access token or SSH with an SSH key and passphrase that has been authorized for `databrickslabs` organization.
+14. Go to GitHub UI and create PR. Alternatively, `gh pr create` (if you have [GitHub CLI](https://cli.github.com/) installed).
+    Use a meaningful pull request title because it'll appear in the release notes. Use `Resolves #NUMBER` in pull
+    request description to [automatically link it](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue)
+    to an existing issue.
 
 ## Local Setup
 
@@ -98,9 +92,9 @@ The command `make setup_spark_remote` sets up the environment for running unit t
 DQX uses Databricks Connect as a test dependency, which restricts the creation of a Spark session in local mode.
 To enable spark local execution for unit testing, the command install spark remote.
 
-### Local setup for integration tests and code coverage
+### Running integration tests and code coverage
 
-Note that integration tests and code coverage are run automatically when you create a Pull Request in Github.
+Integration tests and code coverage are run automatically when you create a Pull Request in Github.
 You can also trigger the tests from a local machine by configuring authentication to a Databricks workspace.
 You can use any Unity Catalog enabled Databricks workspace.
 
@@ -171,12 +165,29 @@ To run integration tests on serverless compute, add the `DATABRICKS_SERVERLESS_C
   }
 }
 ```
-When `DATABRICKS_SERVERLESS_COMPUTE_ID` is set the `DATABRICKS_CLUSTER_ID` is ignored, and tests will run on serverless compute.
+When `DATABRICKS_SERVERLESS_COMPUTE_ID` is set the `DATABRICKS_CLUSTER_ID` is ignored, and tests run on serverless compute.
 
-## Running CLI from the local repo
+## Manual testing of the framework
+
+We require that all changes be covered by unit tests and integration tests. A pull request (PR) will be blocked if the code coverage is negatively impacted by the proposed change.
+However, manual testing may still be useful before creating or merging a PR.
+
+To test DQX from your feature branch, you can install it directly as follows:
+```commandline
+pip install git+https://github.com/databrickslabs/dqx.git@feature_barnch_name
+```
+
+Replace `feature_branch_name` with the name of your branch.
+
+## Manual testing of the CLI commands from the current codebase
 
 Once you clone the repo locally and install Databricks CLI you can run labs CLI commands from the root of the repository.
-Similar to other databricks cli commands we can specify profile to use with `--profile`.
+Similar to other databricks cli commands we can specify Databricks profile to use with `--profile`.
+
+Build the project:
+```commandline
+make dev
+```
 
 Authenticate your current machine to your Databricks Workspace:
 ```commandline
@@ -190,6 +201,7 @@ databricks labs show .
 
 Install dqx:
 ```commandline
+# use the current codebase
 databricks labs install .
 ```
 
@@ -203,43 +215,72 @@ Uninstall DQX:
 databricks labs uninstall dqx
 ```
 
-## First contribution
+## Manual testing of the CLI commands from a pre-release version
 
-If you're interested in contributing, please reach out to us or open an issue to discuss your ideas. 
-To contribute, you need to be added as a writer to the repository. 
-Please note that we currently do not accept external contributors.
+In most cases, installing DQX directly from the current codebase is sufficient to test CLI commands. However, this approach may not be ideal in some cases because the CLI would use the current development virtual environment.
+When DQX is installed from a released version, it creates a fresh and isolated Python virtual environment locally and installs all the required packages, ensuring a clean setup.
+If you need to perform end-to-end testing of the CLI before an official release, follow the process outlined below.
 
-Here are the example steps to submit your first contribution:
+Note: This is only available for GitHub accounts that have write access to the repository. If you contribute from a fork this method is not available.
 
-1. Make a branch in the dqx repo
-2. `git clone`
-3. `git checkout main` (or `gcm` if you're using [ohmyzsh](https://ohmyz.sh/)).
-4. `git pull` (or `gl` if you're using [ohmyzsh](https://ohmyz.sh/)).
-5. `git checkout -b FEATURENAME` (or `gcb FEATURENAME` if you're using [ohmyzsh](https://ohmyz.sh/)).
-6. .. do the work
-7. `make fmt`
-8. `make lint`
-9. .. fix if any
-10. `make setup_spark_remote`, make test` and `make integration`, optionally `make coverage` to get test coverage report
-11. .. fix if any issues
-12. `git commit -S -a -m "message"`. 
-Make sure to enter a meaningful commit message title. 
-You need to sign commits with your GPG key (hence -S option).
-To setup GPG key in your Github account follow [these instructions](https://docs.github.com/en/github/authenticating-to-github/managing-commit-signature-verification).
-You can configure Git to sign all commits with your GPG key by default: `git config --global commit.gpgsign true`
-13. `git push origin FEATURENAME`
-14. Go to GitHub UI and create PR. Alternatively, `gh pr create` (if you have [GitHub CLI](https://cli.github.com/) installed). 
-    Use a meaningful pull request title because it'll appear in the release notes. Use `Resolves #NUMBER` in pull
-    request description to [automatically link it](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue)
-    to an existing issue.
+```commandline
+# create new tag
+git tag v0.1.12-alpha
 
-If you have not signed your commits initially, you can re-apply all of them and sign as follows:
-```shell
-git reset --soft HEAD~<how-many-commit-go-back>
-git commit -S --reuse-message=ORIG_HEAD
-git push -f origin <remote-branch-name>
+# push the tag
+git push origin v0.1.12-alpha
+
+# specify the tag (pre-release version)
+databricks labs install [email protected]
 ```
 
+The release pipeline only triggers when a valid semantic version is provided (e.g. v0.1.12).
+Pre-release versions (e.g. v0.1.12-alpha) do not trigger the release pipeline, allowing you to test changes safely before making an official release.
+
 ## Troubleshooting
 
 If you encounter any package dependency errors after `git pull`, run `make clean`
+
+### Common fixes for `mypy` errors
+
+See https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html for more details
+
+**..., expression has type "None", variable has type "str"**
+
+* Add `assert ... is not None` if it's a body of a method. Example:
+
+```
+# error: Argument 1 to "delete" of "DashboardWidgetsAPI" has incompatible type "str | None"; expected "str"
+self._ws.dashboard_widgets.delete(widget.id)
+```
+
+after
+
+```
+assert widget.id is not None
+self._ws.dashboard_widgets.delete(widget.id)
+```
+
+* Add `... | None` if it's in the dataclass. Example: `cloud: str = None` -> `cloud: str | None = None`
+
+**..., has incompatible type "Path"; expected "str"**
+
+Add `.as_posix()` to convert Path to str
+
+**Argument 2 to "get" of "dict" has incompatible type "None"; expected ...**
+
+Add a valid default value for the dictionary return.
+
+Example:
+```python
+def viz_type(self) -> str:
+    return self.viz.get("type", None)
+```
+
+after:
+
+Example:
+```python
+def viz_type(self) -> str:
+    return self.viz.get("type", "UNKNOWN")
+```
diff --git a/docs/dqx/docs/guide.mdx b/docs/dqx/docs/guide.mdx
@@ -419,20 +419,22 @@ Discover the full list of available data quality rules and learn how to define y
 To perform data quality checking with DQX, you need to create `DQEngine` object.
 The engine requires a Databricks workspace client for authentication and interaction with the Databricks workspace.
 
-When running the code on a Databricks workspace (e.g. in a notebook or as a job), the workspace client is automatically authenticated.
-For external environments (e.g. CI servers or local machines), you can authenticate using any method supported by the Databricks SDK. Detailed instructions are available in the [default authentication flow](https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html#default-authentication-flow).
-
-If you use Databricks [configuration profiles](https://docs.databricks.com/dev-tools/auth.html#configuration-profiles) or Databricks-specific [environment variables](https://docs.databricks.com/dev-tools/auth.html#environment-variables) for authentication, you only need the following code to create a workspace client:
+When running the code on a Databricks workspace, the workspace client is automatically authenticated, whether DQX is used in a notebook, script, or as part of a job/workflow.
+You only need the following code to create the workspace client if you run DQX on Databricks workspace:
 ```python
 from databricks.sdk import WorkspaceClient
 from databricks.labs.dqx.engine import DQEngine
 
 ws = WorkspaceClient()
-
-# use the workspace client to create the DQX engine
 dq_engine = DQEngine(ws)
 ```
 
+For external environments, such as CI servers or local machines, you can authenticate to Databricks using any method supported by the Databricks SDK. For detailed instructions, refer to the [default authentication flow](https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html#default-authentication-flow).
+If you're using Databricks [configuration profiles](https://docs.databricks.com/dev-tools/auth.html#configuration-profiles) or Databricks-specific [environment variables](https://docs.databricks.com/dev-tools/auth.html#environment-variables) for authentication, you can easily create the workspace client without needing to provide additional arguments:
+```python
+ws = WorkspaceClient()
+```
+
 For details on the specific methods available in the engine, visit to the [reference](/docs/reference#dq-engine-methods) section.
 
 Information on testing applications that use `DQEngine` can be found [here](/docs/reference#testing-applications-using-dqx).
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,7 +4,7 @@ on: @@
       push:
         tags:
           - 'v[0-9]+.[0-9]+.[0-9]+' # Must match semantic version tags like 'v1.2.3'
-      workflow_dispatch: # Enables manual triggering of the workflow
+      workflow_dispatch: # Allows manual triggering of the workflow
     jobs:
       build:
@@ Expand Down @@