Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #1601 Add initial support for Unizin Synthetic data #1600

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jxiao21
Copy link
Contributor

@jxiao21 jxiao21 commented Jul 31, 2024

Draft as there's a few things to fix up

  • Re-add metadata query to run, though it might not work with synthetic data. If it doesn't work with synthetic only run if DATASET_PROJECT_ID is not set.
  • Verify code works if DATASET_PROJECT_ID is not set if that's used for above. It might not work with the regular data unless we remove the connection_property.
  • Possibly rename or move DEFAULT_PROJECT_ID to env since GOOGLE_CLOUD_PROJECT is one that is expected to be defined in the environment
  • Test with service account

This is initial support

Future support ideas include

  • Adding on a per course basis whether to pull from synthetic data or actual data. This could allow courses to use this data as a preview. However there's more to this
  • To be able to preview synthetic data, we'll need to time-shift the events somehow so they fit more in the present. This might be looking at the term data for the synthetic course and adding in the difference in days to all other dates present in the data. This feels like it will be a semi-complex transformation but some views won't work well unless we do this.
  • Having a separate synthetic data increment value (or getting rid of this entirely and using short ids?). I think to be able to hold multiple institution data we might still need long ids unless we also change the data model to link up to the LTI deployment id.

@jxiao21 jxiao21 linked an issue Jul 31, 2024 that may be closed by this pull request
@jonespm jonespm changed the title Limit BQ Query Fixes #1601 Add initial support for Unizin Synthetic data Jul 31, 2024
@jonespm jonespm linked an issue Jul 31, 2024 that may be closed by this pull request
config/env_sample.hjson Show resolved Hide resolved
# Change the default Bigquery Project ID
"DEFAULT_PROJECT_ID": "udp-umich-prod",
# Change the dataset project ID where queries are run against
"DATASET_PROJECT_ID": "unizin-shared"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave this line commented by default, so developer needs to enable DATASET_PROJECT_ID in purpose, to connect to Unizin synthetic data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it would be needed otherwise. Though there could be a case where a default project is used but the data sets are in another project. Like when/if we get true shared repositories.

Anyway, this is a temporary solution and I believe we'll need a future issue to set this value on a per course level in the admin.

dashboard/cron.py Outdated Show resolved Hide resolved
DEFAULT_PROJECT_ID = ENV.get("DEFAULT_PROJECT_ID", None)

# Override the default project ID for BigQuery if needed, like to unizin-shared
DATASET_PROJECT_ID = ENV.get("DATASET_PROJECT_ID", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default value be DEFAULT_PROJECT_ID, instead of None?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but then we might need a new variable to indicate whether or not this is running on synthetic data or not.

Maybe it's better to make that explicit. I'm not sure yet, this still has some more work on it.

"CRON_QUERY_FILE": "config/cron_udp.hjson",

# Change the default Bigquery Project ID
"DEFAULT_PROJECT_ID": "udp-umich-prod",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a placeholder value here, instead of "udp-umich-prod", like

"DEFAULT_PROJECT_ID": "<UDP_institution_id>",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just empty before. This should all work with the service account and the regular data if it's not set and I don't think it will work yet.

@zqian
Copy link
Member

zqian commented Jul 31, 2024

Maybe it is worthwhile to add a section into this loading_data.md, with the direction of using Unizin synthetic dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add initial support for Unizin Synthetic data
3 participants