Skip to content

Commit

Permalink
Improve query management in BigQuery integration
Browse files Browse the repository at this point in the history
  • Loading branch information
mkuthan committed Nov 22, 2024
1 parent 22762bd commit 55e5f7b
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 5 deletions.
7 changes: 7 additions & 0 deletions .streamlit/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,13 @@ gatherUsageStats = false
# Show only options set externally (e.g. hrough st.set_page_config)
toolbarMode = "minimal"

[logger]
# Info is a default logging level, but keep for reference
level = "info"

# Add logging level and logger name to make logs more complete
messageFormat = "%(asctime)s %(levelname) -7s %(name)s: %(message)s"

[runner]
# Raise an exception after adding unserializable data to Session State.
enforceSerializableSessionState = true
Expand Down
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,14 @@ This project demonstrates how to leverage the built-in power of Streamlit using
* πŸ—‚οΈ BigQuery integration using the New York Taxi public dataset
* πŸ”’ Authentication skeleton, easily replaceable with OAuth
* πŸ”— Application state sharing via URL
* πŸ’Ύ Dataframe export to CSV and XLS
* πŸ’Ύ Dataframe export buttons to CSV and XLS

### TODO

* 🐳 Create Docker image
* πŸ§ͺ Implement BigQuery integration tests
* πŸ“ˆ Add more visualizations for integrated public dataset
* πŸ” Integration with external OAuth provider, see [roadmap](https://roadmap.streamlit.app/)
* πŸ“‹ Better table with sorting and filtering
* πŸ“ Add request logging
* πŸ”„ Redirect to the original page after login
* βš–οΈ Describe load balancer strategies, for example: sticky session
Expand Down
21 changes: 18 additions & 3 deletions example/infrastructure/big_query.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
import pandas as pd
from google.cloud import bigquery

# Don't allow unbounded results to avoid OOM errors
_MAX_RESULTS = 100_000

# Limit the time a query can run to avoid long waits
_JOB_TIMEOUT_MS = 60_000

# Define labels for better query management
_JOB_LABELS = {"application": "example-streamlit"}


# TODO: add error handling, define timeouts, etc.
def query(q: str, params: dict = None) -> pd.DataFrame:
client = __get_client()

Expand All @@ -11,10 +19,17 @@ def query(q: str, params: dict = None) -> pd.DataFrame:
# TODO: add support for other types
bigquery.ScalarQueryParameter(name, "STRING", value)
for name, value in (params or {}).items()
]
],
job_timeout_ms=_JOB_TIMEOUT_MS,
labels=_JOB_LABELS,
)

# Default timeout is None, but there is a good reason for that, see sources for more details.
# Default retry and job_retry polices look good, so we don't need to change them.
results = client.query(q, job_config=job_config)
return results.to_dataframe()

# Use regular Job instead of Storage API to avoid costs
return results.to_dataframe(max_results=_MAX_RESULTS, create_bqstorage_client=False)


def __get_client() -> bigquery.Client:
Expand Down

0 comments on commit 55e5f7b

Please sign in to comment.