Merge pull request #1 from jhardenberg/devel/extend

Devel/extend
jhardenberg · Feb 13, 2023 · 419e3a2 · 419e3a2
2 parents 6706fa8 + 7c52b89
commit 419e3a2
Show file tree

Hide file tree

Showing 18 changed files with 1,134 additions and 36 deletions.
diff --git a/.github/workflows/mambatest.yml b/.github/workflows/mambatest.yml
@@ -0,0 +1,51 @@
+# This workflow will install Python dependencies using Conda, run tests and lint with a single version of Python
+# For more information see: https://autobencoder.com/2020-08-24-conda-actions/
+
+name: Mamba PyTest
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+  workflow_dispatch:
+
+permissions:
+  contents: read
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.7", "3.8", "3.9", "3.10"]
+    defaults:
+      run:
+        shell: bash -el {0}
+    steps:
+    - uses: actions/checkout@v3
+    - name: provision-with-micromamba
+      uses: mamba-org/provision-with-micromamba@v14
+      with:
+        environment-file: environment.yml
+        environment-name: smmregrid
+        cache-downloads: true
+        extra-specs: |
+            python=${{ matrix.python-version }}
+    - name: Install smmregrid
+      run: |
+        # install package
+        pip install -e .
+    - name: Lint with flake8
+      run: |
+        # install flake8
+        conda install flake8
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Test with pytest
+      run: |
+        conda install pytest
+        python -m pytest
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+smmregrid.egg-info
+__pycache__
+*.idx
diff --git a/README.md b/README.md
@@ -1,11 +1,30 @@
 # smmregrid
 A compact regridder using sparse matrix multiplication
 
-This repository represents a modification of the regridding routines in [climtas](https://github.com/ScottWales/climtas) by Scott Wales, which already implements efficiently this idea and has no other significant dependencies (it does not use iris or esmf for regridding).
+This repository represents a modification of the regridding routines in [climtas](https://github.com/ScottWales/climtas) by Scott Wales, which already implements efficiently this idea and has no other significant dependencies (it does not use iris).
+The regridder uses efficiently sparse matrix multiplication with dask + some manipulation of the coordinates. 
 
-I only had to change a few lines of code to make it compatible with unstructured grids. The regridder uses efficiently sparse matrix multiplication with dask + some manipulation of the coordinates (which would have to be revised/checked again)
+Please note that this tool is not thought as "another interpolation tool", but rather a method to apply pre-computed weights (with CDO, which is currently tested, and with ESMF, which is not yet supported) within the python environment. 
+The speedup is estimated to be about ~1.5 to ~5 times, slightly lower if then files are written to the disk. 2D and 3D data are supported on all the grids supported by CDO, both xarray.Dataset and xarray.DataArray can be used. Masks are treated in a simple way but are correctly transfered. Attributes are kept.  
+
+It is safer to run it through conda/mamba. Install with: 
+
+```
+conda env create -f environment.yml
+```
+
+then activate the environment:
+
+```
+conda activate smmregrid
+```
+and install smmregrid in editable mode:
 
-Install with
 ```
 pip install -e .
 ```
+
+Cautionary notes:
+- It does not work correctly if the Xarray.Dataset includes fields with different land-sea masks (e.g. temperature and SST)
+- It does not support ESMF weigths.
+
diff --git a/dask_playground.ipynb b/dask_playground.ipynb
@@ -0,0 +1,206 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Tests for SMM (with dask) versus CDO\n",
+    "\n",
+    "There are the same speed test but using dask. Surprisingly, the code is much slower. There should be something wrong"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/work/users/paolo/miniconda3/envs/DevECmean4/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.\n",
+      "Perhaps you already have a cluster running?\n",
+      "Hosting the HTTP server on port 44465 instead\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from time import time\n",
+    "import timeit\n",
+    "import os\n",
+    "import numpy as np\n",
+    "import xarray as xr\n",
+    "from smmregrid import cdo_generate_weights, Regridder\n",
+    "from smmregrid.checker import check_cdo_regrid # this is a new function introduced to verify the output\n",
+    "from cdo import Cdo\n",
+    "import pandas as pd\n",
+    "cdo = Cdo()\n",
+    "\n",
+    "# where and which the data are\n",
+    "indir='tests/data'\n",
+    "filelist = ['onlytos-ipsl.nc','tas-ecearth.nc', '2t-era5.nc','tos-fesom.nc']\n",
+    "tfile = os.path.join(indir, 'r360x180.nc')\n",
+    "\n",
+    "# method for remapping\n",
+    "methods = ['nn','con']\n",
+    "accesses = ['DataArray', 'Data']\n",
+    "\n",
+    "from dask.distributed import LocalCluster, Client\n",
+    "cluster = LocalCluster(ip=\"0.0.0.0\", threads_per_worker=1, n_workers=2)\n",
+    "client = Client(cluster)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Remapping (with weights available)\n",
+    "\n",
+    "This is the real goal of smmregrid. Here we test the computation of the remap when the weights are pre-computed. Considering that SMM does not have to write anything to disk, it is several times faster, between 5 to 10. Running with Dataset implies a bit of overhead (20%). Masks so far does not seem to be an issue."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>CDO</th>\n",
+       "      <th>SMM (Dataset)</th>\n",
+       "      <th>SMM (DataArray)</th>\n",
+       "      <th>SMM (DataSet+NoMask)</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>onlytos-ipsl.nc</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.726599</td>\n",
+       "      <td>0.427539</td>\n",
+       "      <td>0.427731</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>tas-ecearth.nc</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.902123</td>\n",
+       "      <td>0.841263</td>\n",
+       "      <td>0.869179</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2t-era5.nc</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.673339</td>\n",
+       "      <td>0.694263</td>\n",
+       "      <td>0.642410</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>tos-fesom.nc</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.407764</td>\n",
+       "      <td>0.405918</td>\n",
+       "      <td>0.411269</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                 CDO  SMM (Dataset)  SMM (DataArray)  SMM (DataSet+NoMask)\n",
+       "onlytos-ipsl.nc  1.0       0.726599         0.427539              0.427731\n",
+       "tas-ecearth.nc   1.0       0.902123         0.841263              0.869179\n",
+       "2t-era5.nc       1.0       0.673339         0.694263              0.642410\n",
+       "tos-fesom.nc     1.0       0.407764         0.405918              0.411269"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# nrepetition for the check\n",
+    "nr = 10\n",
+    "\n",
+    "data =[]\n",
+    "for filein in filelist: \n",
+    "\n",
+    "    # CDO\n",
+    "    wfile = cdo.gencon(tfile, input = os.path.join(indir,filein))\n",
+    "    one = timeit.timeit(lambda: cdo.remap(tfile + ',' + wfile, input = os.path.join(indir,filein), returnXDataset = True), number = nr)\n",
+    "    #print(filein + ': Exectime CDO Remap ' + str(one/nr))\n",
+    "\n",
+    "    # SMM\n",
+    "    xfield = xr.open_mfdataset(os.path.join(indir,filein))\n",
+    "    wfield = cdo_generate_weights(os.path.join(indir,filein), tfile, method = 'con')\n",
+    "    interpolator = Regridder(weights=wfield)\n",
+    "    # var as the one which have time and not have bnds (could work)\n",
+    "    myvar = [var for var in xfield.data_vars \n",
+    "             if 'time' in xfield[var].dims and 'bnds' not in xfield[var].dims]\n",
+    "    two = timeit.timeit(lambda: interpolator.regrid(xfield), number = nr)\n",
+    "    three = timeit.timeit(lambda: interpolator.regrid(xfield[myvar]), number = nr)\n",
+    "    four = timeit.timeit(lambda: interpolator.regrid(xfield[myvar], masked = False), number = nr)\n",
+    "    data.append([one, two, three, four])\n",
+    "\n",
+    "    #print(filein + ': Exectime SMM Remap (DataSet) ' + str(two/nr))\n",
+    "    #print(filein + ': Exectime SMM Remap (DataArray) ' + str(three/nr))\n",
+    "    #print(filein + ': Exectime SMM Remap (DataSet+NoMask) ' + str(four/nr))\n",
+    "\n",
+    "cnames = ['CDO', 'SMM (Dataset)', 'SMM (DataArray)', 'SMM (DataSet+NoMask)']\n",
+    "df = pd.DataFrame(data, index = filelist, columns = cnames)\n",
+    "df.div(df[cnames[0]],axis =0)\n",
+    "\n",
+    "client.shutdown()\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "DevECmean4",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "d1a27f430e855354fabe9b58ad426cbc88af57f8b66247655f5de977d5b44f64"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/environment.yml b/environment.yml
@@ -5,11 +5,17 @@ name : smmregrid
 channels:
   - conda-forge
 dependencies:
-  - python>=3.8,<3.11
+  - python>=3.7,<3.11
   - numpy
   - netcdf4 
   - dask
   - xarray
+  - cfgrib
   - xesmf
-  - sparse
   - cfunits
+  - cdo
+  - python-cdo
+  - pytest
+  - ipykernel
+  - pip:
+    - sparse==0.13.0