Skip to content

Commit

Permalink
Added Road-map in README file
Browse files Browse the repository at this point in the history
  • Loading branch information
Darshan Prajapati committed Jan 9, 2024
1 parent 0c2165c commit a89cbf0
Show file tree
Hide file tree
Showing 2 changed files with 212 additions and 37 deletions.
116 changes: 93 additions & 23 deletions xql/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# `xql` - Querying Xarray Datasets with SQL

Running SQL like queries on Xarray Datasets.
Running SQL like queries on Xarray Datasets. Consider dataset as a table and data variable as a column.
> Note: For now, we support only zarr datasets.
# Supported Features

* **`Select` Variables** - From a large dataset having hundreds of variables select only needed variables.
* **Apply `where` clause** - A general where condition like SQL. Applicable for queries which includes data for specific time range or only for specific regions.
* > Note: For now, we support conditions on coordinates.
* **`group by` and `order by` Functions** - Both are supported on the coordinates only. e.g. time, latitude, longitude, etc.
* **`aggregate` Functions** - Aggregate functions `AVG()`, `MIN()`, `MAX()`, etc. are supported on any coordinate like time.
* For more checkout the [road-map](https://github.com/google/weather-tools/tree/xql-init/xql#roadmap).
> Note: For now, we support `where` conditions on coordinates only.
# Quickstart

Expand All @@ -21,38 +22,107 @@ For this gcloud must be configured in the environment. [Initializing the gcloud

## Usage

Install required packages
```
# Install required packages
pip install -r xql/requirements.txt
```
Jump into xql
```
# Jump into xql
python xql/main.py
```
---
### Supported commands
`.help`: For usage info.

Running a simple query on dataset. Comparing with SQL a data variable is like a column and table is like a dataset.
```
SELECT evaporation, geopotential_at_surface, temperature FROM '{TABLE}'
```
Replace `{TABLE}` with dataset uri. Eg. `gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3`.
`.exit`: To exit from the current session.

---
Apply a conditions. Query to get temperature of arctic region in last winter:
`.set`: To set the dataset uri as a shortened key.
```
SELECT temperature FROM '{TABLE}' WHERE time >= '2022-12-01' AND time < '2023-03-01' AND latitude >= 66.5
.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
```
---
Aggregating results using Group By and Aggregate function. Daily average of temperature of last winter in arctic region.

`.show`: To list down dataset shortened key. Eg. `.show` or `.show [key]`

```
SELECT AVG(temperature) FROM '{TABLE}' WHERE time >= '2022-12-01' AND time < '2023-03-01' AND latitude >= 66.5
GROUP BY time_day
.show era5
```
Replace `time_day` to `time_month` or `time_year` if monthly or yearly average is needed. Also use MIN() and MAX() functions same way as AVG().

`[query]` => Any valid sql like query.

---
Order by latitude, longitude in ascending and descending order.
```
SELECT surface_pressure FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3' WHERE time >= '2021-06-01T00:00:00Z' AND time <= '2021-06-30T23:59:59Z' ORDER BY latitude, longitude DESC LIMIT 1
```
### Example Queries

1. Running a simple query on dataset.
```
SELECT
evaporation,
geopotential_at_surface,
temperature
FROM '{TABLE}'
```
Replace `{TABLE}` with dataset uri. Eg. `gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3`.
> Note: Multiline queries are not yet supported. Convert the query into single line before execute. See below example.
Setting the table name as shortened key.
```
.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
```
Now `{TABLE}` can be replaced with `era5`
2. Apply a conditions. Query to get temperature of arctic region in last winter:
```
SELECT
temperature
FROM '{TABLE}'
WHERE
time >= '2022-12-01' AND
time < '2023-03-01' AND
latitude >= 66.5
```
3. Aggregating results using Group By and Aggregate function. Daily average of temperature of last winter in arctic region.
```
SELECT
AVG(temperature)
FROM '{TABLE}'
WHERE
time >= '2022-12-01' AND
time < '2023-03-01' AND
latitude >= 66.5
GROUP BY time_day
```
Replace `time_day` to `time_month` or `time_year` if monthly or yearly average is needed. Also use `MIN()` and `MAX()` functions same way as `AVG()`.
4. Order by latitude, longitude in ascending and descending order.
```
SELECT
surface_pressure
FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
WHERE
time >= '2021-06-01T00:00:00Z' AND
time <= '2021-06-30T23:59:59Z'
ORDER BY latitude, longitude
```
# Roadmap
_Updated on 2024-01-08_
1. [x] **Select Variables**
1. [ ] On Coordinates
2. [x] On Variables
2. [x] **Where Clause**: `=`, `>`, `>=`, `<`, `<=`, etc.
1. [x] On Coordinates
2. [ ] On Variables
3. [x] **Aggregate Functions**: Only `AVG()`, `MIN()`, `MAX()`, `SUM()` are supported.
1. [x] With Group By
2. [ ] Without Group By
3. [ ] Multiple Aggregate function in a single query
4. [x] **Order By**: Only suppoted for coordinates.
5. [ ] **Limit**: Limiting the result to display.
6. [ ] **Mathematical Operators** `(+, - , *, / )`: Add support to use mathematical operators in the query.
7. [ ] **Aliases**: Add support to alias while querying.
8. [ ] **Join Operations**: Support joining tables and apply query.
9. [ ] **Nested Queries**: Add support to write nested queries.
10. [ ] **Custom Aggregate Functions**: Support custom aggregate functions. ()
133 changes: 119 additions & 14 deletions xql/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,15 @@
from sqlglot import parse_one, exp
from xarray.core.groupby import DatasetGroupBy

command_info = {
".exit": "To exit from the current session.",
".set": "To set the dataset uri as a shortened key. e.g. .set era5 gs://{BUCKET}/dataset-uri",
".show": "To list down dataset shortened key. e.g. .show or .show [key]",
"[query]": "Any valid sql like query."
}

table_dataset_map = {} # To store dataset shortened keys for a single session.

operate = {
"and" : lambda a, b: a & b,
"or" : lambda a, b: a | b,
Expand Down Expand Up @@ -184,14 +193,34 @@ def apply_aggregation(groups: t.Union[xr.Dataset, DatasetGroupBy], fun: str, dim
return aggregate_function_map[fun](groups, dim)


def get_table(e: exp.Expression) -> str:
"""
Get the table name from an expression.
Args:
e (Expression): The expression containing table information.
Returns:
str: The table name.
"""
# Extract the table name from the expression
table = e.find(exp.Table).args['this'].args['this']

# Check if the table is mapped in table_dataset_map
if table in table_dataset_map:
table = table_dataset_map[table]

return table


def parse_query(query: str) -> xr.Dataset:

expr = parse_one(query)

if not isinstance(expr, exp.Select):
return "ERROR: Only select queries are supported."

table = expr.find(exp.Table).args['this'].args['this']
table = get_table(expr)

is_star = expr.find(exp.Star)

Expand All @@ -201,7 +230,6 @@ def parse_query(query: str) -> xr.Dataset:

where = expr.find(exp.Where)
group_by = expr.find(exp.Group)
order_by = expr.find(exp.Order)

agg_funcs = {
var.args['this'].args['this'].args['this']: var.key
Expand All @@ -224,28 +252,105 @@ def parse_query(query: str) -> xr.Dataset:
groupby_fields = [ e.args['this'].args['this'] for e in group_by.args['expressions'] ]
ds = apply_group_by(groupby_fields, ds, agg_funcs)

if order_by:
orderby_fields = [(str(e)) for e in order_by.args['expressions'] ]
ds = apply_order_by(orderby_fields, ds)

return ds


def set_dataset_table(cmd: str) -> None:
"""
Set the mapping between a key and a dataset.
Args:
cmd (str): The command string in the format ".set key val"
where key is the identifier and val is the dataset table.
"""
# Split the command into parts
cmd_parts = cmd.split(" ")

# Check if the command has the correct number of arguments
if len(cmd_parts) == 3:
# Extract key and val from the command
_, key, val = cmd_parts
# Update the dataset table mapping
table_dataset_map[key] = val
else:
# Print an error message for incorrect arguments
print("Incorrect args. Run .help .set for usage info.")


def list_key_values(input: t.Dict[str, str]) -> None:
"""
Display key-value pairs from a dictionary.
Args:
input (Dict[str, str]): The dictionary containing key-value pairs.
"""
for cmd, desc in input.items():
print(f"{cmd} => {desc}")


def display_help(cmd: str) -> None:
"""
Display help information for commands.
Args:
cmd (str): The command string.
"""
cmd_parts = cmd.split(" ")

if len(cmd_parts) == 2:
if cmd_parts[1] in command_info:
print(f"{cmd_parts[1]} => {command_info[cmd_parts[1]]}")
else:
list_key_values(command_info)
elif len(cmd_parts) == 1:
list_key_values(command_info)
else:
print("Incorrect usage. Run .help or .help [cmd] for usage info.")


def display_table_dataset_map(cmd: str) -> None:
"""
Display information from the table_dataset_map.
Args:
cmd (str): The command string.
"""
cmd_parts = cmd.split(" ")

if len(cmd_parts) == 2:
if cmd_parts[1] in table_dataset_map:
print(f"{cmd_parts[1]} => {table_dataset_map[cmd_parts[1]]}")
else:
list_key_values(table_dataset_map)
else:
list_key_values(table_dataset_map)


if __name__ == "__main__":

while True:

query = input("xql>")

if query == "exit":
if query == ".exit":
break

try:
result = parse_query(query)
except Exception:
result = "Something wrong with the query."
elif ".help" in query:
display_help(query)

elif ".set" in query:
set_dataset_table(query)

elif ".show" in query:
display_table_dataset_map(query)

if isinstance(result, xr.Dataset):
print(result.to_dataframe())
else:
print(result)
try:
result = parse_query(query)
except Exception as e:
result = f"ERROR: {type(e).__name__}: {e.__str__()}."

if isinstance(result, xr.Dataset):
print(result.to_dataframe())
else:
print(result)

0 comments on commit a89cbf0

Please sign in to comment.