Skip to content

Commit

Permalink
Added Road-map in README file
Browse files Browse the repository at this point in the history
  • Loading branch information
Darshan Prajapati committed Jan 10, 2024
1 parent 0c2165c commit 0b0b7ae
Show file tree
Hide file tree
Showing 2 changed files with 202 additions and 39 deletions.
108 changes: 83 additions & 25 deletions xql/README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,116 @@
# `xql` - Querying Xarray Datasets with SQL

Running SQL like queries on Xarray Datasets.
Running SQL like queries on Xarray Datasets. Consider dataset as a table and data variable as a column.
> Note: For now, we support only zarr datasets.
# Supported Features

* **`Select` Variables** - From a large dataset having hundreds of variables select only needed variables.
* **Apply `where` clause** - A general where condition like SQL. Applicable for queries which includes data for specific time range or only for specific regions.
* > Note: For now, we support conditions on coordinates.
* **`group by` and `order by` Functions** - Both are supported on the coordinates only. e.g. time, latitude, longitude, etc.
* **`aggregate` Functions** - Aggregate functions `AVG()`, `MIN()`, `MAX()`, etc. are supported on any coordinate like time.
* For more checkout the [road-map](https://github.com/google/weather-tools/tree/xql-init/xql#roadmap).
> Note: For now, we support `where` conditions on coordinates only.
# Quickstart

## Prerequisites

Get an access to the dataset you want to query. As an example we're using the analysis ready era5 public dataset. [full_37-1h-0p25deg-chunk-1.zarr-v3](https://pantheon.corp.google.com/storage/browser/gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3?project=gcp-public-data-signals).
Get an access to the dataset you want to query. Here as an example we're going to use the analysis ready era5 public dataset. [full_37-1h-0p25deg-chunk-1.zarr-v3](https://pantheon.corp.google.com/storage/browser/gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3?project=gcp-public-data-signals).

For this gcloud must be configured in the environment. [Initializing the gcloud CLI](https://cloud.google.com/sdk/docs/initializing).
For this `gcloud` must be configured in your local environment. Refer [Initializing the gcloud CLI](https://cloud.google.com/sdk/docs/initializing) for configuring the `gcloud` locally.

## Usage

Install required packages
```
# Install required packages
pip install -r xql/requirements.txt
```
Jump into xql
```
# Jump into xql
python xql/main.py
```
---
### Supported meta commands
`.help`: For usage info.

Running a simple query on dataset. Comparing with SQL a data variable is like a column and table is like a dataset.
```
SELECT evaporation, geopotential_at_surface, temperature FROM '{TABLE}'
```
Replace `{TABLE}` with dataset uri. Eg. `gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3`.
`.exit`: To exit from the xql interpreter.

---
Apply a conditions. Query to get temperature of arctic region in last winter:
`.set`: To set the dataset uri as a shortened key.
```
SELECT temperature FROM '{TABLE}' WHERE time >= '2022-12-01' AND time < '2023-03-01' AND latitude >= 66.5
.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
```
---
Aggregating results using Group By and Aggregate function. Daily average of temperature of last winter in arctic region.

`.show`: To list down dataset shortened key. Eg. `.show` or `.show [key]`

```
SELECT AVG(temperature) FROM '{TABLE}' WHERE time >= '2022-12-01' AND time < '2023-03-01' AND latitude >= 66.5
GROUP BY time_day
.show era5
```
Replace `time_day` to `time_month` or `time_year` if monthly or yearly average is needed. Also use MIN() and MAX() functions same way as AVG().

`[query]` => Any valid sql like query.

---
Order by latitude, longitude in ascending and descending order.
```
SELECT surface_pressure FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3' WHERE time >= '2021-06-01T00:00:00Z' AND time <= '2021-06-30T23:59:59Z' ORDER BY latitude, longitude DESC LIMIT 1
```
### Example Queries

1. Apply a conditions. Query to get temperature of arctic region in January 2022:
```
SELECT
temperature
FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
WHERE
time >= '2022-01-01' AND
time < '2022-02-01' AND
latitude >= 66.5
```
> Note: Multiline queries are not yet supported. Convert copied queries into single line before execution.
2. Aggregating results using Group By and Aggregate function. Daily average of temperature of arctic region in January 2022.
Setting the table name as shortened key.
```
.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
```
```
SELECT
AVG(temperature)
FROM era5
WHERE
time >= '2022-01-01' AND
time < '2022-02-01' AND
latitude >= 66.5
GROUP BY time_day
```
Replace `time_day` to `time_month` or `time_year` if monthly or yearly average is needed. Also use `MIN()` and `MAX()` functions same way as `AVG()`.
3. `caveat`: Above queries run on the client's local machine and it generates a large two dimensional array so querying for very large amount of data will fall into out of memory erros.
e.g. Query like below will give OOM errors if the client machine don't have the enough RAM.
```
SELECT
evaporation,
geopotential_at_surface,
temperature
FROM era5
```
# Roadmap
_Updated on 2024-01-08_
1. [x] **Select Variables**
1. [ ] On Coordinates
2. [x] On Variables
2. [x] **Where Clause**: `=`, `>`, `>=`, `<`, `<=`, etc.
1. [x] On Coordinates
2. [ ] On Variables
3. [x] **Aggregate Functions**: Only `AVG()`, `MIN()`, `MAX()`, `SUM()` are supported.
1. [x] With Group By
2. [ ] Without Group By
3. [ ] Multiple Aggregate function in a single query
4. [x] **Order By**: Only suppoted for coordinates.
5. [ ] **Limit**: Limiting the result to display.
6. [ ] **Mathematical Operators** `(+, - , *, / )`: Add support to use mathematical operators in the query.
7. [ ] **Aliases**: Add support to alias while querying.
8. [ ] **Join Operations**: Support joining tables and apply query.
9. [ ] **Nested Queries**: Add support to write nested queries.
10. [ ] **Custom Aggregate Functions**: Support custom aggregate functions. ()
133 changes: 119 additions & 14 deletions xql/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,15 @@
from sqlglot import parse_one, exp
from xarray.core.groupby import DatasetGroupBy

command_info = {
".exit": "To exit from the current session.",
".set": "To set the dataset uri as a shortened key. e.g. .set era5 gs://{BUCKET}/dataset-uri",
".show": "To list down dataset shortened key. e.g. .show or .show [key]",
"[query]": "Any valid sql like query."
}

table_dataset_map = {} # To store dataset shortened keys for a single session.

operate = {
"and" : lambda a, b: a & b,
"or" : lambda a, b: a | b,
Expand Down Expand Up @@ -184,14 +193,34 @@ def apply_aggregation(groups: t.Union[xr.Dataset, DatasetGroupBy], fun: str, dim
return aggregate_function_map[fun](groups, dim)


def get_table(e: exp.Expression) -> str:
"""
Get the table name from an expression.
Args:
e (Expression): The expression containing table information.
Returns:
str: The table name.
"""
# Extract the table name from the expression
table = e.find(exp.Table).args['this'].args['this']

# Check if the table is mapped in table_dataset_map
if table in table_dataset_map:
table = table_dataset_map[table]

return table


def parse_query(query: str) -> xr.Dataset:

expr = parse_one(query)

if not isinstance(expr, exp.Select):
return "ERROR: Only select queries are supported."

table = expr.find(exp.Table).args['this'].args['this']
table = get_table(expr)

is_star = expr.find(exp.Star)

Expand All @@ -201,7 +230,6 @@ def parse_query(query: str) -> xr.Dataset:

where = expr.find(exp.Where)
group_by = expr.find(exp.Group)
order_by = expr.find(exp.Order)

agg_funcs = {
var.args['this'].args['this'].args['this']: var.key
Expand All @@ -224,28 +252,105 @@ def parse_query(query: str) -> xr.Dataset:
groupby_fields = [ e.args['this'].args['this'] for e in group_by.args['expressions'] ]
ds = apply_group_by(groupby_fields, ds, agg_funcs)

if order_by:
orderby_fields = [(str(e)) for e in order_by.args['expressions'] ]
ds = apply_order_by(orderby_fields, ds)

return ds


def set_dataset_table(cmd: str) -> None:
"""
Set the mapping between a key and a dataset.
Args:
cmd (str): The command string in the format ".set key val"
where key is the identifier and val is the dataset table.
"""
# Split the command into parts
cmd_parts = cmd.split(" ")

# Check if the command has the correct number of arguments
if len(cmd_parts) == 3:
# Extract key and val from the command
_, key, val = cmd_parts
# Update the dataset table mapping
table_dataset_map[key] = val
else:
# Print an error message for incorrect arguments
print("Incorrect args. Run .help .set for usage info.")


def list_key_values(input: t.Dict[str, str]) -> None:
"""
Display key-value pairs from a dictionary.
Args:
input (Dict[str, str]): The dictionary containing key-value pairs.
"""
for cmd, desc in input.items():
print(f"{cmd} => {desc}")


def display_help(cmd: str) -> None:
"""
Display help information for commands.
Args:
cmd (str): The command string.
"""
cmd_parts = cmd.split(" ")

if len(cmd_parts) == 2:
if cmd_parts[1] in command_info:
print(f"{cmd_parts[1]} => {command_info[cmd_parts[1]]}")
else:
list_key_values(command_info)
elif len(cmd_parts) == 1:
list_key_values(command_info)
else:
print("Incorrect usage. Run .help or .help [cmd] for usage info.")


def display_table_dataset_map(cmd: str) -> None:
"""
Display information from the table_dataset_map.
Args:
cmd (str): The command string.
"""
cmd_parts = cmd.split(" ")

if len(cmd_parts) == 2:
if cmd_parts[1] in table_dataset_map:
print(f"{cmd_parts[1]} => {table_dataset_map[cmd_parts[1]]}")
else:
list_key_values(table_dataset_map)
else:
list_key_values(table_dataset_map)


if __name__ == "__main__":

while True:

query = input("xql>")

if query == "exit":
if query == ".exit":
break

try:
result = parse_query(query)
except Exception:
result = "Something wrong with the query."
elif ".help" in query:
display_help(query)

elif ".set" in query:
set_dataset_table(query)

elif ".show" in query:
display_table_dataset_map(query)

if isinstance(result, xr.Dataset):
print(result.to_dataframe())
else:
print(result)
try:
result = parse_query(query)
except Exception as e:
result = f"ERROR: {type(e).__name__}: {e.__str__()}."

if isinstance(result, xr.Dataset):
print(result.to_dataframe())
else:
print(result)

0 comments on commit 0b0b7ae

Please sign in to comment.