Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull dataset from studio if not available locally #901

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

amritghimire
Copy link
Contributor

If the following case are met, this will pull dataset from Studio.

  • User should be logged in to Studio.
  • The dataset or version doesn't exist in local
  • User has not pass studio=False to from_dataset.

In such case, this will pull the dataset from studio before continuing
further.

The test is added to check for such behavior.

Closes #874

If the following case are met, this will pull dataset from Studio.
- User should be logged in to Studio.
- The dataset or version doesn't exist in local
- User has not pass studio=False to from_dataset.

In such case, this will pull the dataset from studio before continuing
further.

The test is added to check for such behavior.

Closes #874
Copy link

cloudflare-workers-and-pages bot commented Feb 6, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: c4cdeed
Status: ✅  Deploy successful!
Preview URL: https://d641f349.datachain-documentation.pages.dev
Branch Preview URL: https://amrit-from-dataset.datachain-documentation.pages.dev

View logs

@amritghimire amritghimire self-assigned this Feb 6, 2025
@amritghimire amritghimire requested review from ilongin, dreadatour and a team February 6, 2025 14:36
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 87.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 87.70%. Comparing base (79f6cf9) to head (c4cdeed).

Files with missing lines Patch % Lines
src/datachain/query/dataset.py 86.36% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #901   +/-   ##
=======================================
  Coverage   87.69%   87.70%           
=======================================
  Files         130      130           
  Lines       11664    11683   +19     
  Branches     1586     1589    +3     
=======================================
+ Hits        10229    10246   +17     
+ Misses       1038     1037    -1     
- Partials      397      400    +3     
Flag Coverage Δ
datachain 87.62% <87.50%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -1112,6 +1135,21 @@ def __iter__(self):
def __or__(self, other):
return self.union(other)

def pull_dataset(self, name: str, version: Optional[int] = None) -> "DatasetRecord":
print("Dataset not found in local catalog, trying to get from studio")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use logger here in debug mode? @skshetry what is your take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using logger with info here at beginning. But print seemed consistent with other similar messages. And also, we definitely want user to know we are trying to get it from Studio so that they can expect the delay in execution of the code.

if not studio:
raise

if not is_token_set():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge these 2 checks for studio and token in one line

ds = self.catalog.get_dataset(name)
self.version = version or ds.latest_version
self.feature_schema = ds.get_version(self.version).feature_schema
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should be in Catalog instead. Here we are putting a lot of custom code related to how and when we pull datasets into DatasetQuery which should not be it's domain IMO.
We currently have method Catalog.get_dataset_with_version_uuid(...) so I would create similar method with signature:

def get_dataset_with_version(name: str, version: int, studio=False) -> DatasetRecord
    ...

This method would ensure to pull dataset from studio if the one with specific name + version doesn't exist locally and studio flag is set to True. Otherwise it would raise an Exception.
Then we wouldn't even need this special pull_dataset() method here, we would just call ds = self.catalog.get_dataset_with_version(...) and that's it.

@@ -481,6 +481,7 @@ def from_dataset(
version: Optional[int] = None,
session: Optional[Session] = None,
settings: Optional[dict] = None,
studio: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is little bit ambiguous to me. If I'm user I wouldn't know if this means we fetch from studio first or locally? In any case, it's not clear that fetching from studio is just in case it doesn't exist locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Studio datasets in Python
3 participants