Added Road-map in README file

google · Jan 10, 2024 · 0b0b7ae · 0b0b7ae
1 parent 0c2165c
commit 0b0b7ae
Show file tree

Hide file tree

Showing 2 changed files with 202 additions and 39 deletions.
diff --git a/xql/README.md b/xql/README.md
@@ -1,58 +1,116 @@
 # `xql` - Querying Xarray Datasets with SQL
 
-Running SQL like queries on Xarray Datasets.
+Running SQL like queries on Xarray Datasets. Consider dataset as a table and data variable as a column.
 > Note: For now, we support only zarr datasets.
 
 # Supported Features
 
 * **`Select` Variables** - From a large dataset having hundreds of variables select only needed variables.
 * **Apply `where` clause** - A general where condition like SQL. Applicable for queries which includes data for specific time range or only for specific regions. 
-* > Note: For now, we support conditions on coordinates.
 * **`group by` and `order by` Functions** - Both are supported on the coordinates  only. e.g. time, latitude, longitude, etc.
 * **`aggregate` Functions** - Aggregate functions `AVG()`, `MIN()`, `MAX()`, etc. are supported on any coordinate like time.
+* For more checkout the [road-map](https://github.com/google/weather-tools/tree/xql-init/xql#roadmap).
+> Note: For now, we support `where` conditions on coordinates only.
 
 # Quickstart
 
 ## Prerequisites
 
-Get an access to the dataset you want to query. As an example we're using the analysis ready era5 public dataset. [full_37-1h-0p25deg-chunk-1.zarr-v3](https://pantheon.corp.google.com/storage/browser/gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3?project=gcp-public-data-signals).
+Get an access to the dataset you want to query. Here as an example we're going to use the analysis ready era5 public dataset. [full_37-1h-0p25deg-chunk-1.zarr-v3](https://pantheon.corp.google.com/storage/browser/gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3?project=gcp-public-data-signals).
 
-For this gcloud must be configured in the environment. [Initializing the gcloud CLI](https://cloud.google.com/sdk/docs/initializing).
+For this `gcloud` must be configured in your local environment. Refer [Initializing the gcloud CLI](https://cloud.google.com/sdk/docs/initializing) for configuring the `gcloud` locally.
 
 ## Usage
 
-Install required packages
 ```
+# Install required packages
 pip install -r xql/requirements.txt
-```
 
-Jump into xql
-```
+# Jump into xql
 python xql/main.py
 ```
 ---
+### Supported meta commands
+`.help`: For usage info.
 
-Running a simple query on dataset. Comparing with SQL a data variable is like a column and table is like a dataset.
-```
-SELECT evaporation, geopotential_at_surface, temperature FROM '{TABLE}'
-```
-Replace `{TABLE}` with dataset uri. Eg. `gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3`.
+`.exit`: To exit from the xql interpreter.
 
----
-Apply a conditions. Query to get temperature of arctic region in last winter:
+`.set`: To set the dataset uri as a shortened key.
 ```
-SELECT temperature FROM '{TABLE}' WHERE time >= '2022-12-01' AND time < '2023-03-01' AND latitude >= 66.5
+.set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
 ```
----
-Aggregating results using Group By and Aggregate function. Daily average of temperature of last winter in arctic region.
+
+`.show`: To list down dataset shortened key. Eg. `.show` or `.show [key]`
+
 ```
-SELECT AVG(temperature) FROM '{TABLE}' WHERE time >= '2022-12-01' AND time < '2023-03-01' AND latitude >= 66.5
-GROUP BY time_day
+.show era5
 ```
-Replace `time_day` to `time_month` or `time_year` if monthly or yearly average is needed. Also use MIN() and MAX() functions same way as AVG().
+
+`[query]`  =>  Any valid sql like query.
 
 ---
-Order by latitude, longitude in ascending and descending order.
-```
-SELECT surface_pressure FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3' WHERE time >= '2021-06-01T00:00:00Z' AND time <= '2021-06-30T23:59:59Z' ORDER BY latitude, longitude DESC LIMIT 1
-```
+### Example Queries
+
+1. Apply a conditions. Query to get temperature of arctic region in January 2022:
+    ```
+    SELECT 
+        temperature 
+    FROM 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3' 
+    WHERE
+        time >= '2022-01-01' AND 
+        time < '2022-02-01' AND 
+        latitude >= 66.5
+    ```
+    > Note: Multiline queries are not yet supported. Convert copied queries into single line before execution.
+
+2. Aggregating results using Group By and Aggregate function. Daily average of temperature of arctic region in January 2022.
+    Setting the table name as shortened key.
+
+    ```
+    .set era5 gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
+    ```
+    ```
+    SELECT 
+        AVG(temperature) 
+    FROM era5
+    WHERE 
+        time >= '2022-01-01' AND 
+        time < '2022-02-01' AND 
+        latitude >= 66.5
+    GROUP BY time_day
+    ```
+    Replace `time_day` to `time_month` or `time_year` if monthly or yearly average is needed. Also use `MIN()` and `MAX()` functions same way as `AVG()`.
+
+3. `caveat`: Above queries run on the client's local machine and it generates a large two dimensional array so querying for very large amount of data will fall into out of memory erros.
+
+    e.g. Query like below will give OOM errors if the client machine don't have the enough RAM.
+
+    ```
+    SELECT 
+        evaporation,
+        geopotential_at_surface,
+        temperature 
+    FROM era5
+    ```
+
+# Roadmap
+
+_Updated on 2024-01-08_
+
+1. [x] **Select Variables**
+    1. [ ] On Coordinates
+    2. [x] On Variables 
+2. [x] **Where Clause**: `=`, `>`, `>=`, `<`, `<=`, etc.
+    1. [x] On Coordinates
+    2. [ ] On Variables 
+3. [x] **Aggregate Functions**: Only `AVG()`, `MIN()`, `MAX()`, `SUM()` are supported.
+   1. [x] With Group By
+   2. [ ] Without Group By
+   3. [ ] Multiple Aggregate function in a single query
+4. [x] **Order By**: Only suppoted for coordinates.
+5. [ ] **Limit**: Limiting the result to display.
+6. [ ] **Mathematical Operators** `(+, - , *, / )`: Add support to use mathematical operators in the query.
+7. [ ] **Aliases**: Add support to alias while querying.
+8. [ ] **Join Operations**: Support joining tables and apply query.
+9. [ ] **Nested Queries**: Add support to write nested queries.
+10. [ ] **Custom Aggregate Functions**: Support custom aggregate functions. ()
diff --git a/xql/main.py b/xql/main.py
@@ -22,6 +22,15 @@
 from sqlglot import parse_one, exp
 from xarray.core.groupby import DatasetGroupBy
 
+command_info = {
+    ".exit": "To exit from the current session.",
+    ".set": "To set the dataset uri as a shortened key. e.g. .set era5 gs://{BUCKET}/dataset-uri",
+    ".show": "To list down dataset shortened key. e.g. .show or .show [key]",
+    "[query]": "Any valid sql like query."
+}
+
+table_dataset_map = {} # To store dataset shortened keys for a single session.
+
 operate = {
     "and" : lambda a, b: a & b,
     "or" : lambda a, b: a | b,
@@ -184,14 +193,34 @@ def apply_aggregation(groups: t.Union[xr.Dataset, DatasetGroupBy], fun: str, dim
     return aggregate_function_map[fun](groups, dim)
 
 
+def get_table(e: exp.Expression) -> str:
+    """
+    Get the table name from an expression.
+
+    Args:
+        e (Expression): The expression containing table information.
+
+    Returns:
+        str: The table name.
+    """
+    # Extract the table name from the expression
+    table = e.find(exp.Table).args['this'].args['this']
+
+    # Check if the table is mapped in table_dataset_map
+    if table in table_dataset_map:
+        table = table_dataset_map[table]
+
+    return table
+
+
 def parse_query(query: str) -> xr.Dataset:
 
     expr = parse_one(query)
 
     if not isinstance(expr, exp.Select):
         return "ERROR: Only select queries are supported."
 
-    table = expr.find(exp.Table).args['this'].args['this']
+    table = get_table(expr)
 
     is_star = expr.find(exp.Star)
 
@@ -201,7 +230,6 @@ def parse_query(query: str) -> xr.Dataset:
 
     where = expr.find(exp.Where)
     group_by = expr.find(exp.Group)
-    order_by = expr.find(exp.Order)
 
     agg_funcs = {
         var.args['this'].args['this'].args['this']: var.key
@@ -224,28 +252,105 @@ def parse_query(query: str) -> xr.Dataset:
         groupby_fields = [ e.args['this'].args['this'] for e in group_by.args['expressions'] ]
         ds = apply_group_by(groupby_fields, ds, agg_funcs)
 
-    if order_by:
-        orderby_fields = [(str(e)) for e in order_by.args['expressions'] ]
-        ds = apply_order_by(orderby_fields, ds)
-
     return ds
 
 
+def set_dataset_table(cmd: str) -> None:
+    """
+    Set the mapping between a key and a dataset.
+
+    Args:
+        cmd (str): The command string in the format ".set key val"
+            where key is the identifier and val is the dataset table.
+    """
+    # Split the command into parts
+    cmd_parts = cmd.split(" ")
+
+    # Check if the command has the correct number of arguments
+    if len(cmd_parts) == 3:
+        # Extract key and val from the command
+        _, key, val = cmd_parts
+        # Update the dataset table mapping
+        table_dataset_map[key] = val
+    else:
+        # Print an error message for incorrect arguments
+        print("Incorrect args. Run .help .set for usage info.")
+
+
+def list_key_values(input: t.Dict[str, str]) -> None:
+    """
+    Display key-value pairs from a dictionary.
+
+    Args:
+        input (Dict[str, str]): The dictionary containing key-value pairs.
+    """
+    for cmd, desc in input.items():
+        print(f"{cmd}  =>  {desc}")
+
+
+def display_help(cmd: str) -> None:
+    """
+    Display help information for commands.
+
+    Args:
+        cmd (str): The command string.
+    """
+    cmd_parts = cmd.split(" ")
+
+    if len(cmd_parts) == 2:
+        if cmd_parts[1] in command_info:
+            print(f"{cmd_parts[1]}  =>  {command_info[cmd_parts[1]]}")
+        else:
+            list_key_values(command_info)
+    elif len(cmd_parts) == 1:
+        list_key_values(command_info)
+    else:
+        print("Incorrect usage. Run .help or .help [cmd] for usage info.")
+
+
+def display_table_dataset_map(cmd: str) -> None:
+    """
+    Display information from the table_dataset_map.
+
+    Args:
+        cmd (str): The command string.
+    """
+    cmd_parts = cmd.split(" ")
+
+    if len(cmd_parts) == 2:
+        if cmd_parts[1] in table_dataset_map:
+            print(f"{cmd_parts[1]}  =>  {table_dataset_map[cmd_parts[1]]}")
+        else:
+            list_key_values(table_dataset_map)
+    else:
+        list_key_values(table_dataset_map)
+
+
 if __name__ == "__main__":
 
     while True:
 
         query = input("xql>")
 
-        if query == "exit":
+        if query == ".exit":
             break
 
-        try:
-            result = parse_query(query)
-        except Exception:
-            result = "Something wrong with the query."
+        elif ".help" in query:
+            display_help(query)
+
+        elif ".set" in query:
+            set_dataset_table(query)
+
+        elif ".show" in query:
+            display_table_dataset_map(query)
 
-        if isinstance(result, xr.Dataset):
-            print(result.to_dataframe())
         else:
-            print(result)
+            try:
+                result = parse_query(query)
+            except Exception as e:
+                result = f"ERROR: {type(e).__name__}: {e.__str__()}."
+
+            if isinstance(result, xr.Dataset):
+                print(result.to_dataframe())
+            else:
+                print(result)