Refactor to python package and restructure project directory (#16)

* rename generate.py to example.py and mv to directory with STAC generation scripts * mkdir tests and mv validation.py to test_validation.py * make dir for ci scripts; add convert.py to this dir. * etl and coclico stac configs into src/coclicodata directory that can be packaged * refactor * add poetry for package management * add azure vars to env * mv keys function to cloud utils * snakecase convention * mv scripts to subdirs * setup deltares drive configs * refactor to python package * usage instructions and integrated readme in package dirs here * usage instructions and integrated readme in package dirs here * mv readme to proj root * coastal mask with coclicodata and coastmonitor package * do not require deltares fields * change id to raw gh path * test schema * change path * deltares props no longer as required * fp to href * description for all files in repo * bash script to upload stacs to azure cloud * change schema uri to update json schema * make stacs without deltares properties as required * test to avoid collection id duplication * mv root inside test as convention * black formatting * pre commit config * pre commit cleanup * add pre commit instructions to readme * change href extension to main branch * coastal mask stacs without redundant frontend properties * added coastal mask to stac catalog * sync instead of uploading * load catalog outside function for CI tests * fix href to root in tests * reset hrefs to feature branch --------- Co-authored-by: floriscalkoen <[email protected]>
openearth · Aug 31, 2023 · f88c04d · f88c04d
1 parent 4210fb8
commit f88c04d
Show file tree

Hide file tree

Showing 358 changed files with 73,076 additions and 72,805 deletions.
diff --git a/.env.example b/.env.example
@@ -1 +1,4 @@
 MAPBOX_ACCESS_TOKEN=""
+AZURE_STORAGE_ACCOUNT=""
+AZURE_STORAGE_SAS_TOKEN=""
+GH_COASTMONITOR_TOKEN=""
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -36,7 +36,7 @@ jobs:
           path: live
       - name: Release
         run: |
-          python convert.py
+          python ci/convert.py
           rm -rf ../live/current
           cp -rp live ../live/current
           cd ../live

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,23 @@
+ci:
+  autofix_prs: false
+  autoupdate_schedule: weekly
+
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.4.0
+    hooks:
+      - id: trailing-whitespace
+      - id: debug-statements
+      - id: check-yaml
+      - id: check-added-large-files
+
+  - repo: https://github.com/psf/black
+    rev: 23.7.0
+    hooks:
+      - id: black
+        language_version: python3.11
+
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.6.1
+    hooks:
+      - id: nbstripout
diff --git a/README.md b/README.md
@@ -1,36 +1,141 @@
 # coclicodata
-STAC catalog for CoCliCo 
 
-This is a **relative** STAC catalog for development purposes.
+This repository contains code to maintain the CoCliCo STAC catalog. Please note that
+this is a **relative** STAC catalog for development purposes.
+
+## Usage
+
+Given that `coclicodata` is under active development, it's recommended to clone the repository and then install it in 'editable' mode. This way, any changes made to the package are immediately available without needing a reinstall.
+
+Follow these steps for installation:
+
+1. **Clone the repository**:
+
+   ``` bash
+   git clone https://github.com/openearth/coclicodata.git
+   ```
+
+2. **Install the environment**:
+
+   ``` bash
+   mamba install -f /path/to/coclicodata/environment.yaml
+   ```
+
+3. **Activate the environment**:
+
+   ``` bash
+   mamba activate coclico
+   ```
+
+4. **Install the package in editable mode**:
+
+   ``` bash
+   pip install -e /path/to/coclicodata
+   ```
+
+After installation, you can easily import and use any module or function from the
+`coclicodata` package in your Python scripts or interactive sessions:
+
+```python
+from coclicodata.coclico_stac import utils
+# Further code utilizing the utils module...
+```
+
+## Use pre-commit locally
+
+Ensure consistent code formatting and avoid big repositories by removing output with pre-commit.
+
+In the root of the repository run:
+
+```bash
+pre-commit install
+```
+
+If the hooks catch issues when you commit your changes, they will fix them automatically.:
+
+```bash
+git commit -m "Your message"
+```
+Once hooks pass, push your changes.
 
 ## Test
-You can run pytest to check whether you current STAC collection is valid
+
+You can run `pytest` to check whether you current STAC collection is valid. The command
+will automatically run the test scripts that are maintained in `tests/test_*.py`
 
 ## Release
-On succesfull validation of STAC catalog in the main branch, an **absolute** version 
+
+On successful validation of STAC catalog in the main branch, an **absolute** version
 of the catalog will be published in the `live` branch that can be used externally.
 
+## CoCliCoData repository structure
+
+- **ci**
+  - `convert.py`: CI script to convert current to live stacs.
+
+- **current**: STAC catalog that is used for web portal development.
+
+- **docs**: Various documentation images like flowcharts and diagrams representing data formats and workflows.
+
+- **json-schema**
+  - `schema.json`: JSON schema definition for the frontend Deltares STAC extension.
+
+- **live**: STAC catalog that is used by the web-portal to serve end users.
+
+- **notebooks**: Jupyter notebooks used to load, explore and transform the data;
+  typically one per dataset, to make it CF compliant.
+
+- **scripts**
+  - **bash**: Shell scripts, like `build-stacs.sh` and `upload-stacs-to-azure.sh`, for various automation tasks.
+  - **create_stacs**: Python scripts for creating STACs, each typically corresponding to a specific dataset or processing step.
+  - **utils**: Utility scripts, like `coclico_common_vocab_from_stac.py` and `upload_and_generate_geojson.py`, for various data operations.
+
+- **src/coclicodata**
+  - `__init__.py`: Main package initialization.
+  - `drive_config.py`: Configuration settings for the drive or storage medium.
+  - **etl**
+    - `__init__.py`: Subpackage initialization.
+    - `cf_compliancy_checker.py`: Checks for compliancy with the Climate and Forecast (CF) conventions.
+    - `cloud_utils.py`: Utilities for cloud-based operations and data processing.
+    - `extract.py`: Data extraction and transform functionalities.
+
+  - **coclico_stac**
+    - `__init__.py`: Subpackage initialization.
+    - `datacube.py`: Functions for extracting dimension shapes and metadata from zarr stores.
+    - `extension.py`: CoCliCo STAC extension that is used for frontend visualization.
+    - `io.py`: Defines the CoCLiCo JSON I/O strategy for STAC catalogs.
+    - `layouts.py`: Provides CoCliCo layout strategies for STAC for the data formats used.
+    - `templates.py`: Defines CoCliCo templates for generating STAC items, assets and collections.
+    - `utils.py`: Utility functions for data migration and other STAC-related operations.
+
+- **stories**: Contains narrative data and associated images.
+
+- **tests**: Contains test scripts to ensure code quality and functionality.
+
+- `.pre-commit-config.yaml`: Hooks that will be run when making a commit.
+- `metadata_template.json`: Template file for a STAC collection from a dataset.
+
 ## Metadata
 
 The following attributes are required at dataset level:
 
-- title - 
-- title abbreviation - 
-- description - description that will be used to as dataset explanation in the web portal. 
+- title -
+- title abbreviation -
+- description - description that will be used to as dataset explanation in the web portal.
 - short description - description which is convenient when loading the data into a
   programming environment
 - institution - data producer
 - providers - data host (Deltares / CoCliCo)
   - name
   - url
   - roles - e.g., providers, licensor
-  - description -  
-- history - list of institutions and people who have processed the data 
+  - description -
+- history - list of institutions and people who have processed the data
 - media_type - [also known as mime type](https://www.iana.org/assignments/media-types/media-types.xhtml)
 - spatial extent - bbox [minx, miny, maxx, maxy]
 - temporal extent - time interval in [iso 8601](https://en.wikipedia.org/wiki/ISO_8601), i.e., YYYY-MM-DDTHH:mm:ssZ
-- license - 
-- author - 
+- license -
+- author -
 
 The following attributes are optional at dataset level:
 - keywords - these can be used to search using the STAC API
@@ -39,13 +144,13 @@ The following attributes are optional at dataset level:
   Publisher. (resourceTypeGeneral). Identifier format (Zenodo specification)
 - doi - following [Zenodo specification](https://about.zenodo.org/principles/)
 - thumbnail asset image - image that will be shown to represent the dataset
-- columns - when data is tabular and has column names 
+- columns - when data is tabular and has column names
 
 The following attributes are required at variable level
 
 - long_name - descriptive name
 - standard_name - iff available in [CF convention standard table](https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html)
-- units - follow CF conventions where possible; leave blank when no units. 
+- units - follow CF conventions where possible; leave blank when no units.
 - cell_bnds
 
 The following attributes are optional at variable level:
@@ -54,7 +159,7 @@ The following attributes are optional at variable level:
 The following coordinate labels are required:
 
 - crs or spatial_ref
-- time  
+- time
 
 ### Controlled vocabulary
 | **name**    | **long_name**                                                                               | **standard_name** | **data_structure_type** | **dtype**             |

diff --git a/convert.py → ci/convert.py b/convert.py → ci/convert.py