Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for table sharing when a catalog account is being used #904

Closed
blitzmohit opened this issue Dec 7, 2023 · 2 comments · Fixed by #1021
Closed

Support for table sharing when a catalog account is being used #904

blitzmohit opened this issue Dec 7, 2023 · 2 comments · Fixed by #1021
Labels
effort: medium priority: high status: in-progress This issue has been picked and is being implemented
Milestone

Comments

@blitzmohit
Copy link
Contributor

Is your feature request related to a problem? Please describe.

In certain data mesh architectures such as the ones described in https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/ and https://aws.amazon.com/blogs/big-data/how-jpmorgan-chase-built-a-data-mesh-architecture-to-drive-significant-value-to-enhance-their-enterprise-data-platform/
a catalog account owns the Glue Database & Tables instead of the producer.

Currently data.all does not account for or support sharing of tables using a catalog account.

If a dataset is imported using a database which was shared to them from a catalog account i.e. a resource link, the import works fine. However if any attempt to share access to any of the tables in such a dataset outside the same producer account is made data.all would fail because LakeFormation does not allow resharing of Databases/tables

Describe the solution you'd like
Proposed solution is as follows:

  1. On share approval, detect if the source Glue database is a resource link
  2. If it is a resource link then identify the catalog account
  3. Check that data.all has access to this account i.e. it should be on boarded as a data.all environment
  4. Validate permissions i.e. does the dataset owner approving the request have access to share this table/database. To support this we are checking for tag “owner-account-id” on the database which should be the same as the dataset owner’s account id.

Additional context
In terms of support, the catalog could be an additional high level object in data.all that could power additional use cases

@dlpzx
Copy link
Contributor

dlpzx commented Jan 3, 2024

Discussion happening directly in PR #905

@dlpzx dlpzx added status: in-progress This issue has been picked and is being implemented priority: high effort: medium labels Jan 3, 2024
@noah-paige noah-paige linked a pull request Jan 12, 2024 that will close this issue
TejasRGitHub added a commit to TejasRGitHub/aws-dataall that referenced this issue Jan 30, 2024
TejasRGitHub added a commit to TejasRGitHub/aws-dataall that referenced this issue Jan 30, 2024
TejasRGitHub pushed a commit to TejasRGitHub/aws-dataall that referenced this issue Feb 15, 2024
# Conflicts:
#	backend/dataall/modules/dataset_sharing/aws/glue_client.py
#	backend/dataall/modules/dataset_sharing/services/data_sharing_service.py
#	backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
#	backend/dataall/modules/dataset_sharing/services/share_processors/lf_process_cross_account_share.py
#	tests/modules/datasets/tasks/test_lf_share_manager.py
TejasRGitHub pushed a commit to TejasRGitHub/aws-dataall that referenced this issue Feb 21, 2024
# Conflicts:
#	backend/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py
noah-paige pushed a commit that referenced this issue Feb 23, 2024
### Feature or Bugfix
- Feature

### Detail

PR containing all the code raised in PR -
#905 + Unit Tests +
Addressing comments raised on that PR. Copy pasting details from PR -

Detect if the source database is a resource link
If it is a resource link, check that the catalog account has been
onboarded to data.all
Check for the presence of owner_account_id tag on the database
The tag needs to exist and the value has to match the account id of the
share approver

Credits - @blitzmohit 

## Testing 

Running Unit tests - ✅ 
Testing on AWS Deployed data.all instance with the Original PR -  ✅ 
Sanity testing after addressing comments - **[EDIT]** ✅ ( Testing done )

### Relates
- #904

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)? No
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization? No
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features? No
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users? Yes
  - Have you used the least-privilege principle? How? Yes


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: trajopadhye <[email protected]>
@noah-paige noah-paige linked a pull request Feb 23, 2024 that will close this issue
@noah-paige
Copy link
Contributor

Closing this issue - as completed in #1021

noah-paige pushed a commit that referenced this issue Mar 4, 2024
### Feature or Bugfix
- Bugfix


### Detail

When using worksheet with a share made with a catalog account ( by using
steps as described here in this PR -
#1021 ) , the worksheet drop
down list doesn't display the correct DB name. This is due to the fact
that DB name is picked from the producer account ( where the S3 bucket
is present and where the actualDB is not present ) which has the
resource linked DB. Thus, the autogenerated querying doesn't work .
Please refer to the screenshot
<img width="1482" alt="image"
src="https://github.com/data-dot-all/dataall/assets/71188245/fbc28286-0ca7-47de-a6ae-3020b1188dcb">

Also, on the share view, the db name mentioned on the query ( in the
"Data Consumption details" ) is the resource linked DB name and not the
correct DB name.

### Relates
- #904

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)? No
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization? No
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features? No
  - Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
No
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: trajopadhye <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort: medium priority: high status: in-progress This issue has been picked and is being implemented
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants