Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/cassandra): Add support for Cassandra as a source #11822

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d15793c
feat(ingest/cassandra): Add support for Cassandra as a source
sagar-salvi-apptware Nov 8, 2024
6e5885f
fix: linting error
sagar-salvi-apptware Nov 8, 2024
6251a30
fix: minor changes
sagar-salvi-apptware Nov 8, 2024
57a658f
test: fix ci test
sagar-salvi-apptware Nov 8, 2024
c978ead
fix: added view properties and dataset properties for views
sagar-salvi-apptware Nov 11, 2024
0ae30d4
fix: Fixed PR Comments and Added changes for profilling
sagar-salvi-apptware Nov 12, 2024
dcc3aa6
test: added profiling changes
sagar-salvi-apptware Nov 12, 2024
bbd377b
fix: minor fixes for profiling
sagar-salvi-apptware Nov 12, 2024
291e364
test: fixed tests for profiling and column lineage
sagar-salvi-apptware Nov 12, 2024
0d91919
fix: minor change
sagar-salvi-apptware Nov 12, 2024
9b0c143
docs: Created Pre-Requiste doc for cassandra
sagar-salvi-apptware Nov 13, 2024
13e6d91
fix: minor changes for profilling
sagar-salvi-apptware Nov 13, 2024
671a11c
fix: pr comments
sagar-salvi-apptware Nov 13, 2024
024e350
fix: change field path from v2 to v1
sagar-salvi-apptware Nov 13, 2024
1c1d745
tests: update golden files for fieldpaths
sagar-salvi-apptware Nov 13, 2024
cd278b5
fix: PR comments regarding Dataclasses
sagar-salvi-apptware Nov 14, 2024
9c7dded
tests: update golden files for latest changes
sagar-salvi-apptware Nov 14, 2024
e1063e0
fix: common profiling and minor changes
sagar-salvi-apptware Nov 14, 2024
72f9a90
test: fix ci issue
sagar-salvi-apptware Nov 14, 2024
9e247ba
fix: PR comments
sagar-salvi-apptware Nov 14, 2024
159c354
tests: update golden files for latest changes
sagar-salvi-apptware Nov 14, 2024
3186e79
fix: minor hardning changes
sagar-salvi-apptware Nov 15, 2024
f7dd7aa
fix: added platform for cassandra
sagar-salvi-apptware Nov 15, 2024
b07a681
fix: minor change
sagar-salvi-apptware Nov 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions datahub-web-react/src/app/ingest/source/builder/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ import csvLogo from '../../../../images/csv-logo.png';
import qlikLogo from '../../../../images/qliklogo.png';
import sigmaLogo from '../../../../images/sigmalogo.png';
import sacLogo from '../../../../images/saclogo.svg';
import cassandraLogo from '../../../../images/cassandralogo.png';
import datahubLogo from '../../../../images/datahublogo.png';

export const ATHENA = 'athena';
Expand Down Expand Up @@ -129,6 +130,8 @@ export const SIGMA = 'sigma';
export const SIGMA_URN = `urn:li:dataPlatform:${SIGMA}`;
export const SAC = 'sac';
export const SAC_URN = `urn:li:dataPlatform:${SAC}`;
export const CASSANDRA = 'cassandra';
export const CASSANDRA_URN = `urn:li:dataPlatform:${CASSANDRA}`;
export const DATAHUB = 'datahub';
export const DATAHUB_GC = 'datahub-gc';
export const DATAHUB_LINEAGE_FILE = 'datahub-lineage-file';
Expand Down Expand Up @@ -175,6 +178,7 @@ export const PLATFORM_URN_TO_LOGO = {
[QLIK_SENSE_URN]: qlikLogo,
[SIGMA_URN]: sigmaLogo,
[SAC_URN]: sacLogo,
[CASSANDRA_URN]: cassandraLogo,
[DATAHUB_URN]: datahubLogo,
};

Expand Down
7 changes: 7 additions & 0 deletions datahub-web-react/src/app/ingest/source/builder/sources.json
Original file line number Diff line number Diff line change
Expand Up @@ -310,5 +310,12 @@
"description": "Import Spaces, Sources, Tables and statistics from Dremio.",
"docsUrl": "https://datahubproject.io/docs/metadata-ingestion/",
"recipe": "source:\n type: dremio\n config:\n # Coordinates\n hostname: null\n port: null\n #true if https, otherwise false\n tls: true\n\n #For cloud instance\n #is_dremio_cloud: True\n #dremio_cloud_project_id: <project_id>\n\n #Credentials with personal access token\n authentication_method: PAT\n password: pass\n\n #Or Credentials with basic auth\n #authentication_method: password\n #username: null\n #password: null\n\n stateful_ingestion:\n enabled: true"
},
{
"urn": "urn:li:dataPlatform:cassandra",
"name": "cassandra",
"displayName": "CassandraDB",
"docsUrl": "https://datahubproject.io/docs/generated/ingestion/sources/cassandra",
"recipe": "source:\n type: cassandra\n config:\n # Credentials for on prem cassandra\n contact_point: localhost\n port: 9042\n username: admin\n password: password\n\n # Or\n # Credentials Astra Cloud\n #cloud_config:\n # secure_connect_bundle: Path to Secure Connect Bundle (.zip)\n # token: Application Token\n\n # Optional Allow / Deny extraction of particular keyspaces.\n keyspace_pattern:\n allow: [.*]\n\n # Optional Allow / Deny extraction of particular tables.\n table_pattern:\n allow: [.*]"
}
]
Binary file added datahub-web-react/src/images/cassandralogo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 40 additions & 0 deletions metadata-ingestion/docs/sources/cassandra/cassandra_pre.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
### Setup

This integration pulls metadata directly from Cassandra databases, including both **DataStax Astra DB** and **Cassandra Enterprise Edition (EE)**.

You’ll need to have a Cassandra instance or an Astra DB setup with appropriate access permissions.

#### Steps to Get the Required Information

1. **Set Up User Credentials**:

- **For Astra DB**:
- Log in to your Astra DB Console.
- Navigate to **Organization Settings** > **Token Management**.
- Generate an **Application Token** with the required permissions for read access.
- Download the **Secure Connect Bundle** from the Astra DB Console.
- **For Cassandra EE**:
- Ensure you have a **username** and **password** with read access to the necessary keyspaces.

2. **Permissions**:

- The user or token must have `SELECT` permissions that allow it to:
- Access metadata in system keyspaces (e.g., `system_schema`) to retrieve information about keyspaces, tables, columns, and views.
- Perform `SELECT` operations on the data tables if data profiling is enabled.

3. **Verify Database Access**:
- For Astra DB: Ensure the **Secure Connect Bundle** is used and configured correctly.
- For Cassandra Opensource: Ensure the **contact point** and **port** are accessible.


:::caution

When enabling profiling, make sure to set a limit on the number of rows to sample. Profiling large tables without a limit may lead to excessive resource consumption and slow performance.

:::

:::note

For cloud configuration with Astra DB, it is necessary to specify the Secure Connect Bundle path in the configuration. For that reason, use the CLI to ingest metadata into DataHub.

:::
30 changes: 30 additions & 0 deletions metadata-ingestion/docs/sources/cassandra/cassandra_recipe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
source:
type: "cassandra"
config:
# Credentials for on prem cassandra
contact_point: "localhost"
port: 9042
username: "admin"
password: "password"

# Or
# Credentials Astra Cloud
#cloud_config:
# secure_connect_bundle: "Path to Secure Connect Bundle (.zip)"
# token: "Application Token"

# Optional Allow / Deny extraction of particular keyspaces.
keyspace_pattern:
allow: [".*"]

# Optional Allow / Deny extraction of particular tables.
table_pattern:
allow: [".*"]

# Optional
profiling:
enabled: true
profile_table_level_only: true

sink:
# config sinks
9 changes: 9 additions & 0 deletions metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,6 +404,13 @@
# https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/release-notes.html#rn-7-14-0
# https://github.com/elastic/elasticsearch-py/issues/1639#issuecomment-883587433
"elasticsearch": {"elasticsearch==7.13.4"},
"cassandra": {
"cassandra-driver>=3.28.0",
# We were seeing an error like this `numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject`
# with numpy 2.0. This likely indicates a mismatch between scikit-learn and numpy versions.
# https://stackoverflow.com/questions/40845304/runtimewarning-numpy-dtype-size-changed-may-indicate-binary-incompatibility
"numpy<2",
},
"feast": {
"feast>=0.34.0,<1",
"flask-openid>=1.3.0",
Expand Down Expand Up @@ -660,6 +667,7 @@
"qlik-sense",
"sigma",
"sac",
"cassandra",
]
if plugin
for dependency in plugins[plugin]
Expand Down Expand Up @@ -778,6 +786,7 @@
"qlik-sense = datahub.ingestion.source.qlik_sense.qlik_sense:QlikSenseSource",
"sigma = datahub.ingestion.source.sigma.sigma:SigmaSource",
"sac = datahub.ingestion.source.sac.sac:SACSource",
"cassandra = datahub.ingestion.source.cassandra.cassandra:CassandraSource",
],
"datahub.ingestion.transformer.plugins": [
"pattern_cleanup_ownership = datahub.ingestion.transformer.pattern_cleanup_ownership:PatternCleanUpOwnership",
Expand Down
Empty file.
Loading
Loading