Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dust emissions #140

Open
znichollscr opened this issue Oct 22, 2024 · 47 comments
Open

Add dust emissions #140

znichollscr opened this issue Oct 22, 2024 · 47 comments
Labels
enhancement New feature or request

Comments

@znichollscr
Copy link
Collaborator

znichollscr commented Oct 22, 2024

@jfkok has done some great work pulling together dust emissions. This issue is for tracking their inclusion in input4MIPs.

@znichollscr znichollscr added the enhancement New feature or request label Oct 22, 2024
@znichollscr
Copy link
Collaborator Author

An initial draft of the files is here, which includes a nice briefing on how to use the data.

My initial comments on this are:

  • the way the files are currently written, the scenario (historical, future increase, future decrease, future constant) and the region (global or regional) are both blended into the variable name. That isn’t how most of the data is handled. The more common pattern for this (where the variable is the same every time if I am not mistaken) would be to use the same variable name for every file (e.g. dustscalefactor) and then the region of application is identified by the grid label. It’s a bit weird, but the current practice is to identify the scenario by the source ID (although I think we could do something clearer than this, full discussion is here: Altering the DRS #64).
  • is there a reason that future starts in 2001? That will be different to how ScenarioMIP will be handling the split buf it this is for a custom experiment, it may not matter (but it will matter if we want these dust emissions included in the DECK’s historical experiment)
  • the data does present a bit of a challenge for input4MIPs (the same comments apply to Stephanie’s data and it would be great to address them for CMIP7 (in CMIP6 there unfortunately wasn’t time)). The data has this idea of regions to which the data applies. The CF conventions do have a standard for this (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#geographic-regions), which is probably what we would want to use. I think that would then all just work, but it might require a bit of a tweak to how the data is written(and the grid label for regional stuff would probably then become “gr”, see the full list of options here: https://github.com/PCMDI/mip-cmor-tables/blob/main/MIP_grid_label.json).
  • Just to check my understanding: the variable is the same every time right? It’s a scaling factor and the idea is that models use this to scale their internally calculated dust to get closer to the change in dust over time?

@znichollscr
Copy link
Collaborator Author

@jfkok making sure you get a notification and can find this

@jfkok
Copy link

jfkok commented Oct 24, 2024

Hi Zeb, thanks for your very helpful input. Responses follow below:

* the way the files are currently written, the scenario (historical, future increase, future decrease, future constant) and the region (global or regional) are both blended into the variable name. That isn’t how most of the data is handled. The more common pattern for this (where the variable is the same every time if I am not mistaken) would be to use the same variable name for every file (e.g. dustscalefactor) and then the region of application is identified by the grid label. It’s a bit weird, but the current practice is to identify the scenario by the source ID (although I think we could do something clearer than this, full discussion is here: [Altering the DRS #64](https://github.com/PCMDI/input4MIPs_CVs/discussions/64)).

Ah okay, that makes sense. So I'd need to create one file for the global scaling factor (currently in this file) and seven additional files for the regional scaling factors for the seven regions (currently in this file), is that correct? And then do I still keep the historical and the (three) future scaling factors in separate files? So for 4x8 = 32 files total?

* is there a reason that future starts in 2001? That will be different to how ScenarioMIP will be handling the split buf it this is for a custom experiment, it may not matter (but it will matter if we want these dust emissions included in the DECK’s historical experiment)

Yes, the reason is that the observational reconstruction ends in the year 2000 because it is based on sedimentary records of dust deposition (like ice cores), so there's much less data for the last two decades. However, my group is working on using satellite data to extend the reconstruction to 2023 and I expect that to be ready sometime next year.

* the data does present a bit of a challenge for input4MIPs (the same comments apply to Stephanie’s data and it would be great to address them for CMIP7 (in CMIP6 there unfortunately wasn’t time)). The data has this idea of regions to which the data applies. The CF conventions do have a standard for this (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#geographic-regions), which is probably what we would want to use. I think that would then all just work, but it might require a bit of a tweak to how the data is written(and the grid label for regional stuff would probably then become “gr”, see the full list of options here: https://github.com/PCMDI/mip-cmor-tables/blob/main/MIP_grid_label.json).

Thanks for pointing this out. The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

* Just to check my understanding: the variable is the same every time right? It’s a scaling factor and the idea is that models use this to scale their internally calculated dust to get closer to the change in dust over time?

Yes, that's exactly right. It's only the year and the region of application (global versus one of seven major dust aerosol source regions) that changes.

Thanks so much!

Jasper

@znichollscr
Copy link
Collaborator Author

And then do I still keep the historical and the (three) future scaling factors in separate files? So for 4x8 = 32 files total?

Sorry bad explanation from me. I would suggest one file for global and one file for regional (you can put all seven regions in one file, just use the 'region' dimension or whatever it is that the CF conventions calls it to differentiate them). Then yes, one file for historical and one file for each scenario. So you'll end up with 4 x 2 = 8 files.

Just to try and be a bit clearer:

  • in the global file, there should be one variable. Its only dimension should be "time". The grid label should be "gm" (for global mean).
  • in the regional file, there should be one variable. It should be two-dimensional, ("time", "region") (or whatever the CF-convention name for 'region' is). I would make the grid label "gn" probably (this is your native grid). Then there'll probably be other auxillary co-ordinates in this file to help with defining the boundaries of each region, region names etc.

Got it re the historical vs. scenario split. That's fine. If we want these files to be used for DECK simulations, we'll have to do a bit of thinking. If they're just for a specific MIP experiment, they can stay as they are.

The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

Ah ok nice. Defining those boundary files would definitely be sufficient. (If it were me, I would just give the regions names first, make sure I can write the files, then go back and add the boundary definition second, because that boundary definition could be fiddly, but you may be able to skip straight to writing the boundary conditions!)

@jfkok
Copy link

jfkok commented Oct 24, 2024

Thanks, that's helpful. I'm working on implementing these changes and obtaining corrected files. In doing so, I realized that making the variable name the same for all files also means that the variable are identical for the four different global files (1 historical and 3 future scenarios) and for the four different regional files. So how would I distinguish the files if I can't put the scenario in the variable name?

The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

Ah ok nice. Defining those boundary files would definitely be sufficient. (If it were me, I would just give the regions names first, make sure I can write the files, then go back and add the boundary definition second, because that boundary definition could be fiddly, but you may be able to skip straight to writing the boundary conditions!)
I'm not sure I quite understand what you mean by "boundary files". Would it be sufficient to specify the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable? Or should I do something different?

Thanks!

Jasper

@znichollscr
Copy link
Collaborator Author

So how would I distinguish the files if I can't put the scenario in the variable name?

Excellent question. The answer is, at the moment, you put it in the "source_id" bit of the file name. So, for example, your filenames would become (where I've also dropped off the noleap prefix that isn't part of the DRS):

  • historical: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_gm_185001-200012.nc
  • constant scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-constant_gm_200101-210012.nc
  • decreasing scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-decreasing_gm_200101-210012.nc
  • future scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-increasing_gm_200101-210012.nc

As you can tell, this doesn't make that much sense and is easy to miss, which is why we're having the discussion in #64. That may lead to a renaming in future, but for now the above is what to go for.

@jfkok
Copy link

jfkok commented Oct 25, 2024

Thanks so much! I've implemented all your comments (I think) and uploaded the updated files here. Let me know if you think any further changes are needed.

One thing to note is that I added the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable. Let me know in case I should be doing something differently.

@znichollscr
Copy link
Collaborator Author

Nice, thanks.

Let me know if you think any further changes are needed.

Underscores in the source ID have to be changed to hyphens i.e.:

  • dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_constant_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-constant_gm_200101-210012.nc
  • dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_increasing_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-increasing_gm_200101-210012.nc
  • dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_decreasing_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-decreasing_gm_200101-210012.nc

Then I would suggest trying to run them through the validator and write them in the DRS: https://input4mips-validation.readthedocs.io/en/latest/how-to-guides/how-to-write-a-single-file-in-the-drs/

One thing to note is that I added the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable. Let me know in case I should be doing something differently.

Not sure, I haven't done this particular step before. If you run the files through the validator, the CF-checker will flag anything really wrong. In a couple of weeks I can pull the files down and have a look myself (meeting next week will take up time before then).

@jfkok
Copy link

jfkok commented Oct 28, 2024

Thanks Zeb, I corrected the file names.

I tried running the validator after installing it as an application per the instructions here. However, the command import input4mips_validation.io triggers an error message, pasted below. Do you know if there is an easy solution for this? Thanks!

image

@znichollscr
Copy link
Collaborator Author

Do you know if there is an easy solution for this?

Hmm that's not very good. Let's dive down the rabbit hole here: climate-resource/input4mips_validation#78

@znichollscr
Copy link
Collaborator Author

@jfkok are there any updated files or should I just use this link (https://drive.google.com/drive/u/0/folders/1Xr1A4oqPj35h43MFV1864IXSvYSVtYKs) again?

@jfkok
Copy link

jfkok commented Dec 11, 2024 via email

@znichollscr
Copy link
Collaborator Author

znichollscr commented Dec 11, 2024

Looking overall good. I had to make some minor tweaks to the files to get them to pass validation, that was pretty straight forward. The Python code I used is below (runs in the same environment as input4mips-validation), but you can probably do the same with Matlab easily. Looking at the changes, I assume they'll need to be applied to all files.

Python code for tweaks
import netCDF4

ds = netCDF4.Dataset(
    "dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_gn_185001-200012.nc",
    "a",
)

# Capitalisation matters according to the spec
# (doesn't make sense to me, but here we are)
ds["dustscalefactor"].delncattr("Units")
# Apparently dimensionless is represented by "1"
# (again, wouldn't be my choice of how to do this)
ds["dustscalefactor"].setncattr("units", "1")
ds["dustscalefactor"].setncattr("long_name", "dustscalefactor")

# Capitalisation matters according to the spec
# (doesn't make sense to me, but here we are)
ds["region"].delncattr("Units")
# Apparently dimensionless is represented by "1"
# (again, wouldn't be my choice of how to do this)
ds["region"].setncattr("units", "1")
# Whitespace isn't allowed in attribute names
# (not sure why, it's just the spec)
ds["region"].setncattr(
    "boundary_coordinates", ds["region"].getncattr("Boundary coordinates")
)
ds["region"].delncattr("Boundary coordinates")
ds["region"].setncattr("long_name", "region")

# We use 'yr' rather than 'annual'
ds.setncattr("frequency", "yr")
# License info
ds.setncattr("license_id", "CC BY 4.0")
ds.setncattr(
    "license",
    (
        "The input4MIPs data linked to this entry is licensed "
        "under a Creative Commons Attribution 4.0 International "
        "(https://creativecommons.org/licenses/by/4.0/). "
        "Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse "
        "for terms of use governing CMIP6Plus output, "
        "including citation requirements and proper acknowledgment. "
        "The data producers and data providers make no warranty, either express "
        "or implied, including, but not limited to, warranties of merchantability "
        "and fitness for a particular purpose. "
        "All liabilities arising from the supply of the information "
        "(including any liability arising in negligence) "
        "are excluded to the fullest extent permitted by law."
    ),
)

# If you know this, helpful.
# If not, can just leave vague like this
ds.setncattr("nominal_resolution", "250 km")

ds.close()

The only other minor change is that, because you have annual data, the end of the file should be "YYYY-YYYY" not "YYYYMM-YYYYMM" as you have now i.e. for your historical files, "185001-200012.nc" becomes "1850-2000.nc".

If you're able to make those tweaks, we should be on the home straight.

@znichollscr
Copy link
Collaborator Author

The other thing we need to do is register UCLA as an institute. I've started that here if you're able to take a look: PCMDI/mip-cmor-tables#84

Thanks!

@jfkok
Copy link

jfkok commented Dec 12, 2024 via email

@znichollscr
Copy link
Collaborator Author

Thanks Zeb, that was very helpful. I've addressed all these issues and the file now passes the validation, including the cf checker!

Oh cool, I thought the cf-checker wasn't running on your machine, great to hear it's working!

Does that conclude the process or are there other steps we need to take?

Well, if you want to publish the data on ESGF there are a couple more steps. If you're not worried about that, we're done (although I'm also not sure that you needed to do any of these steps if you didn't want to publish on ESGF in the first place).

@znichollscr
Copy link
Collaborator Author

Assuming you do want to publish on ESGF, the next steps are:

  • check that this is the right institute to associate with your data: Add UCLA mip-cmor-tables#84
  • I'll pull your data down onto our server
  • Paul and I will then push it through the publication queue
  • Good to go

@jfkok
Copy link

jfkok commented Dec 13, 2024 via email

@znichollscr
Copy link
Collaborator Author

znichollscr commented Dec 13, 2024

Ok perfect. There is one last step that would be very helpful, can you please run the script below (should be clear what it's doing, any questions, fire away. As above, it will run in your normal input4mips-validation environment). Once you've done that, ping me and @durack1 and we can get the files in the publication queue.

Python script to run
import shutil
from pathlib import Path

import iris
import netCDF4
from input4mips_validation.cvs import load_cvs
from input4mips_validation.dataset import Input4MIPsDataset
from input4mips_validation.inference.from_data import infer_time_start_time_end
from input4mips_validation.logging import setup_logging
from input4mips_validation.upload_ftp import upload_ftp
from input4mips_validation.validation.file import get_validate_file_result
from input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes


setup_logging(enable=True, logging_level="INFO_FILE")


def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    fixed_file = fixed_folder / file.name
    shutil.copy(file, fixed_file)

    source_id = file.name.split("_")[4]

    ds = netCDF4.Dataset(fixed_file, "a")

    ds.setncattr("source_id", source_id)
    ds.setncattr("realm", "atmos")
    if "gn" in fixed_file.name:
        ds["region"].setncattr("long_name", "region")

    ds.close()

    return fixed_file


def validate_file_then_rewrite_file_in_drs(
    file: Path, output_root: Path, cv_source: str
) -> None:
    get_validate_file_result(
        file,
        cv_source=cv_source,
    ).raise_if_errors()

    cvs = load_cvs(cv_source=cv_source)

    ds = ds_from_iris_cubes(
        iris.load(file),
    )
    ds.attrs["nominal_resolution"] = "250 km"
    ds.attrs["target_mip"] = "AerChemMIP2"  # To confirm with Jasper

    time_start, time_end = infer_time_start_time_end(
        ds=ds,
        frequency_metadata_key="frequency",
        no_time_axis_frequency="fx",
        time_dimension="time",
    )

    full_file_path = cvs.DRS.get_file_path(
        root_data_dir=output_root,
        available_attributes=ds.attrs,
        time_start=time_start,
        time_end=time_end,
    )

    if full_file_path.exists():
        raise FileExistsError(full_file_path)

    full_file_path.parent.mkdir(parents=True, exist_ok=True)

    if full_file_path.name != file.name:
        Input4MIPsDataset.from_ds(ds, cvs=cvs).write(
            root_data_dir=output_root,
        )

    else:
        shutil.copy(file, full_file_path)

    print(f"File written according to the DRS in {full_file_path}")


def main() -> None:
    # Obviously point these wherever makes sense for you
    DATA_FOLDER = Path("dust-files")
    TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes")
    OUTPUT_ROOT = Path("dust-rewritten")

    # You shouldn't need to change this
    CV_SOURCE = "gh:main"

    TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True)

    files_to_upload = DATA_FOLDER.rglob("*.nc")

    for file in files_to_upload:
        tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER)
        validate_file_then_rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE)

    cvs = load_cvs(cv_source=CV_SOURCE)
    upload_ftp(
        tree_root=OUTPUT_ROOT,
        ftp_dir_rel_to_root="UCLA-1-0-1-upload-1",
        password="[email protected]",
        cvs=cvs,
        username="anonymous",
        ftp_server="ftp.llnl.gov",
        ftp_dir_root="/incoming",
        # You can try making this 8, but the FTP server seems to not enjoy parallel uploads
        n_threads=1,
        dry_run=False,
        continue_on_error=False,
    )


if __name__ == "__main__":
    main()

@jfkok
Copy link

jfkok commented Jan 6, 2025

Happy 2025, @durack1, @znichollscr! I ran the python script and copied the output below. It seems to have run correctly, as far as I can tell.

image

@znichollscr
Copy link
Collaborator Author

znichollscr commented Jan 6, 2025

Hmm I don't think so, no files were uploaded. Did you update DATA_FOLDER in the script to point to the directory with your files?

@znichollscr
Copy link
Collaborator Author

If yes, change

files_to_upload = DATA_FOLDER.rglob("*.nc")

to

files_to_upload = DATA_FOLDER.rglob("*.nc")
assert files_to_upload, f"No re-written files found in {DATA_FOLDER}, please check"

because something isn't working...

@jfkok
Copy link

jfkok commented Jan 6, 2025 via email

@znichollscr
Copy link
Collaborator Author

Hmmm I can't see the screenshot and we don't seem to have anything on the server. Can you try sending the screenshot again please?

@jfkok
Copy link

jfkok commented Jan 6, 2025

That's odd! I've added the line of code you mentioned and that warning doesn't trigger.

Here's the screenshot again, hopefully it comes across this time.
image

@znichollscr
Copy link
Collaborator Author

Super weird, is there anything in the folder pointed to by OUTPUT_ROOT ?

@znichollscr
Copy link
Collaborator Author

(For what it's worth, the timestamp for the message about uploading files and success are the same, so it's clear that there's no uploading happening)

@jfkok
Copy link

jfkok commented Jan 27, 2025

Sorry for the 3-weeks hiatus - things got busy and of course we had the fires here.

Now when I run the code I get a permission error (copied below). This seems odd because the code creates a whole string of directories in the OUTPUT_ROOT directory, but it seems it somehow can't write the file in the final directory it creates. Do you have any ideas what the problem could be/

Image

Image

@znichollscr
Copy link
Collaborator Author

Sorry for the 3-weeks hiatus - things got busy and of course we had the fires here

No worries. Hope you're ok with everything else that is going on.

On windows it is possible to hit file path length issues. I would guess it is that, so you might have to write in e.g. C:\Users\Jasper Kok rather than in the sub0folder you're currently trying to write in.

@jfkok
Copy link

jfkok commented Jan 30, 2025

Ah thanks, that seems to have fixed (part of) it!

I now have output in the output_root directory for each of the 8 original data files. Do you see that on your end? I've copied a screenshot of the directory structure that was created below.

After writing those files, the script still runs into an error though (see second screenshot). It's a file not found error, but the path listed is that the first file that was written out and which is present (see first screenshot), so I'm confused... Thanks for any help!

Image

Image

@znichollscr
Copy link
Collaborator Author

znichollscr commented Jan 30, 2025 via email

@jfkok
Copy link

jfkok commented Jan 31, 2025

Okay, I think it finally worked now! Do you see it on your end?

Image

@znichollscr
Copy link
Collaborator Author

Got the files!

So, there are still a few issues with them (see below). However, we could also just publish, pros and cons below.

Issues that I can see right now:

  • none of the files have bounds for time
  • apparently, according to CF-conventions, region should be a string, not an integer

Pros and cons of publishing now:

Pros

  • data is out there, people can probably use it as is
  • you get feedback on the actual numbers now
  • the validation tools we're writing are constantly improving anyway, so even if we wait, there will probably be other things we pick up and it'd be good to get something published rather than being stuck in a seemingly endless validation cycle

Cons

  • these files won't necessarily work smoothly with tools that expect CF-compliant data (although there aren't so many of those)
  • we really should clean these things up at some point, which means users will get a slightly different update when the fixes are done

I'm happy to go either way, so I'd be interested in your thoughts/wishes and also those of @durack1

@jfkok
Copy link

jfkok commented Feb 3, 2025 via email

@durack1
Copy link
Contributor

durack1 commented Feb 4, 2025

My two cents, adding time bounds is an easy step (@jfkok just means we know when a data value is valid for e.g. start and end of month for monthly mean data, irrespective of your time axis value).

I am yet to peek these data, so would definitely fix any issues in loading the data using xarray (the most common package in climate these days), other simple nits we could also capture as the time axis bounds are being created.

@znichollscr
Copy link
Collaborator Author

@jfkok an updated script that now also adds time bounds is below. The changed lines are these ones:

import cftime
...
def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    ...
    # Add time bounds
    ds.createDimension("bnds", 2)
    ds.createVariable("time_bnds", "f8", ("time", "bnds"))

    # Assumes yearly steps, but probably ok
    time_dates = cftime.num2date(
        ds["time"][:], units=ds["time"].units, calendar=ds["time"].calendar
    )
    time_years = [dt.year for dt in time_dates]
    time_bounds_lower = cftime.date2num(
        [cftime.datetime(y, 1, 1, calendar=ds["time"].calendar) for y in time_years],
        units=ds["time"].units,
    )
    time_bounds_upper = cftime.date2num(
        [
            cftime.datetime(y + 1, 1, 1, calendar=ds["time"].calendar)
            for y in time_years
        ],
        units=ds["time"].units,
    )
    ds["time_bnds"][:, 0] = time_bounds_lower
    ds["time_bnds"][:, 1] = time_bounds_upper

    ds["time"].setncattr("bounds", "time_bnds")
    ...
Full script
import shutil
from pathlib import Path

import cftime
import iris
import netCDF4
from input4mips_validation.cvs import load_cvs
from input4mips_validation.dataset import Input4MIPsDataset
from input4mips_validation.inference.from_data import infer_time_start_time_end
from input4mips_validation.logging import setup_logging
from input4mips_validation.validation.file import get_validate_file_result
from input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes

setup_logging(enable=True, logging_level="INFO_FILE")


def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    fixed_file = fixed_folder / file.name
    shutil.copy(file, fixed_file)

    source_id = file.name.split("_")[4]

    ds = netCDF4.Dataset(fixed_file, "a")

    ds.setncattr("source_id", source_id)
    ds.setncattr("realm", "atmos")
    if "gn" in fixed_file.name:
        ds["region"].setncattr("long_name", "region")

    # Add time bounds
    ds.createDimension("bnds", 2)
    ds.createVariable("time_bnds", "f8", ("time", "bnds"))

    # Assumes yearly steps, but probably ok
    time_dates = cftime.num2date(
        ds["time"][:], units=ds["time"].units, calendar=ds["time"].calendar
    )
    time_years = [dt.year for dt in time_dates]
    time_bounds_lower = cftime.date2num(
        [cftime.datetime(y, 1, 1, calendar=ds["time"].calendar) for y in time_years],
        units=ds["time"].units,
    )
    time_bounds_upper = cftime.date2num(
        [
            cftime.datetime(y + 1, 1, 1, calendar=ds["time"].calendar)
            for y in time_years
        ],
        units=ds["time"].units,
    )
    ds["time_bnds"][:, 0] = time_bounds_lower
    ds["time_bnds"][:, 1] = time_bounds_upper

    ds["time"].setncattr("bounds", "time_bnds")

    ds.close()

    import xarray as xr

    ds = xr.open_dataset(file)

    return fixed_file


def validate_file_then_rewrite_file_in_drs(
    file: Path, output_root: Path, cv_source: str
) -> None:
    get_validate_file_result(
        file,
        cv_source=cv_source,
    ).raise_if_errors()

    cvs = load_cvs(cv_source=cv_source)

    ds = ds_from_iris_cubes(
        iris.load(file),
    )
    ds.attrs["nominal_resolution"] = "250 km"
    ds.attrs["target_mip"] = "AerChemMIP2"  # To confirm with Jasper

    time_start, time_end = infer_time_start_time_end(
        ds=ds,
        frequency_metadata_key="frequency",
        no_time_axis_frequency="fx",
        time_dimension="time",
    )

    full_file_path = cvs.DRS.get_file_path(
        root_data_dir=output_root,
        available_attributes=ds.attrs,
        time_start=time_start,
        time_end=time_end,
    )

    if full_file_path.exists():
        raise FileExistsError(full_file_path)

    full_file_path.parent.mkdir(parents=True, exist_ok=True)

    if full_file_path.name != file.name:
        Input4MIPsDataset.from_ds(ds, cvs=cvs).write(
            root_data_dir=output_root,
        )

    else:
        shutil.copy(file, full_file_path)

    print(f"File written according to the DRS in {full_file_path}")


def main() -> None:
    # Obviously point these wherever makes sense for you
    DATA_FOLDER = Path("dust-files")
    TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes")
    OUTPUT_ROOT = Path("dust-rewritten")

    # You shouldn't need to change this
    CV_SOURCE = "gh:main"

    TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True)

    files_to_upload = DATA_FOLDER.rglob("*.nc")

    for file in files_to_upload:
        tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER)
        validate_file_then_rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE)

    cvs = load_cvs(cv_source=CV_SOURCE)
    # upload_ftp(
    #     tree_root=OUTPUT_ROOT,
    #     ftp_dir_rel_to_root="UCLA-1-0-1-upload-1",
    #     password="[email protected]",
    #     cvs=cvs,
    #     username="anonymous",
    #     ftp_server="ftp.llnl.gov",
    #     ftp_dir_root="/incoming",
    #     # You can try making this 8, but the FTP server seems to not enjoy parallel uploads
    #     n_threads=1,
    #     dry_run=False,
    #     continue_on_error=False,
    # )


if __name__ == "__main__":
    main()

@znichollscr
Copy link
Collaborator Author

Any questions, fire away (the whole world of CF-conventions, bounds etc. is super confusing so very happy to help and there are no dumb questions)

@jfkok
Copy link

jfkok commented Feb 6, 2025 via email

@znichollscr
Copy link
Collaborator Author

znichollscr commented Feb 6, 2025

Do you have a suggestion or can we let that one slide?

I thought this would be easy. Turns out it's more complicated, but still doable. Have a look at the below.

Script to update region handling
import shutil
from pathlib import Path

import cftime
import iris
import netCDF4
import numpy as np
from input4mips_validation.cvs import load_cvs
from input4mips_validation.inference.from_data import infer_time_start_time_end
from input4mips_validation.io import (
    generate_creation_timestamp,
    generate_tracking_id,
)
from input4mips_validation.logging import setup_logging
from input4mips_validation.upload_ftp import upload_ftp
from input4mips_validation.validation.tree import get_validate_tree_result
from input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes

setup_logging(enable=True, logging_level="INFO_FILE")


def copy_dimension(
    ds: netCDF4.Dataset, name: str, dim_to_copy: netCDF4.Dimension
) -> None:
    ds.createDimension(
        name, (len(dim_to_copy) if not dim_to_copy.isunlimited else None)
    )


def copy_variable(
    ds: netCDF4.Dataset,
    name: str,
    variable_to_copy: netCDF4.Variable,
    dimensions: tuple[str, ...] | None = None,
) -> None:
    if dimensions is None:
        dimensions = variable_to_copy.dimensions

    ds.createVariable(name, variable_to_copy.datatype, dimensions)
    ds.variables[name][:] = variable_to_copy[:]
    for attr in variable_to_copy.ncattrs():
        ds[name].setncattr(attr, variable_to_copy.getncattr(attr))


def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    fixed_file = fixed_folder / file.name

    source_id = file.name.split("_")[4]

    ds = netCDF4.Dataset(file, "r")
    ds_new = netCDF4.Dataset(fixed_file, "w")

    # Directly copy over the things we want to keep
    # (can't delete because of netCDF4's API).
    # Attributes
    for name in ds.ncattrs():
        ds_new.setncattr(name, ds.getncattr(name))

    # Handle the dimension and variable creation more carefully
    gn_new_region_dim = "lbl"

    # Dimensions
    if "gn" in fixed_file.name:
        for name, dimension in ds.dimensions.items():
            if name == "region":
                ds_new.createDimension(gn_new_region_dim, len(dimension))

            else:
                copy_dimension(ds_new, name=name, dim_to_copy=dimension)

    else:
        for name, dimension in ds.dimensions.items():
            copy_dimension(ds_new, name=name, dim_to_copy=dimension)

    for name, variable in ds.variables.items():
        if "gn" in fixed_file.name:
            if name == "region":
                region_names = [
                    v.split("=")[-1]
                    for v in ds["region"].getncattr("description").split(";")
                ]
                region_names_max_length = max(len(v) for v in region_names)
                regions = np.array(region_names, dtype=f"S{region_names_max_length}")

                boundary_coordinates = "; ".join(
                    [
                        v.split("=")[-1].strip()
                        for v in ds["region"]
                        .getncattr("boundary_coordinates")
                        .split(";")
                    ]
                )

                ds_new.createDimension("strlen", region_names_max_length)
                ds_new.createVariable(
                    name,
                    "S1",
                    (gn_new_region_dim, "strlen"),
                )
                ds_new[name][:] = netCDF4.stringtochar(regions)

                ds_new[name].setncattr("boundary_coordinates", boundary_coordinates)
                ds_new[name].setncattr("long_name", "region")

            else:
                copy_variable(
                    ds_new,
                    name=name,
                    variable_to_copy=ds.variables[name],
                    dimensions=tuple(
                        v.replace("region", gn_new_region_dim)
                        for v in ds.variables[name].dimensions
                    ),
                )

        else:
            copy_variable(ds_new, name=name, variable_to_copy=ds.variables[name])

    ds_new.setncattr("source_id", source_id)
    ds_new.setncattr("realm", "atmos")
    # Update to make this easier to parse for users
    ds_new.setncattr("contact", "[email protected];[email protected]")
    ds_new.setncattr("nominal_resolution", "250 km")
    ds_new.setncattr("target_mip", "AerChemMIP2")  # To confirm with Jasper
    # Generate new tracking ID
    ds_new.setncattr("tracking_id", generate_tracking_id())
    ds_new.setncattr("creation_date", generate_creation_timestamp())

    # Add time bounds
    ds_new.createDimension("bnds", 2)
    ds_new.createVariable("time_bnds", "f8", ("time", "bnds"))

    # Assumes yearly steps, but probably ok
    time_dates = cftime.num2date(
        ds_new["time"][:], units=ds_new["time"].units, calendar=ds_new["time"].calendar
    )
    time_years = [dt.year for dt in time_dates]
    time_bounds_lower = cftime.date2num(
        [
            cftime.datetime(y, 1, 1, calendar=ds_new["time"].calendar)
            for y in time_years
        ],
        units=ds_new["time"].units,
    )
    time_bounds_upper = cftime.date2num(
        [
            cftime.datetime(y + 1, 1, 1, calendar=ds_new["time"].calendar)
            for y in time_years
        ],
        units=ds_new["time"].units,
    )
    ds_new["time_bnds"][:, 0] = time_bounds_lower
    ds_new["time_bnds"][:, 1] = time_bounds_upper

    ds_new["time"].setncattr("bounds", "time_bnds")

    # Could use a context manager instead, whatever
    ds.close()
    ds_new.close()

    return fixed_file


def rewrite_file_in_drs(file: Path, output_root: Path, cv_source: str) -> None:
    cvs = load_cvs(cv_source=cv_source)

    ds = ds_from_iris_cubes(
        iris.load(file),
    )

    time_start, time_end = infer_time_start_time_end(
        ds=ds,
        frequency_metadata_key="frequency",
        no_time_axis_frequency="fx",
        time_dimension="time",
    )

    full_file_path = cvs.DRS.get_file_path(
        root_data_dir=output_root,
        available_attributes=ds.attrs,
        time_start=time_start,
        time_end=time_end,
    )

    if full_file_path.exists():
        raise FileExistsError(full_file_path)

    full_file_path.parent.mkdir(exist_ok=True, parents=True)
    shutil.copy(file, full_file_path)
    print(f"File written according to the DRS in {full_file_path}")


def main() -> None:
    # Obviously point these wherever makes sense for you
    DATA_FOLDER = Path("dust-files")
    TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes")
    OUTPUT_ROOT = Path("dust-rewritten")

    # You shouldn't need to change this
    CV_SOURCE = "gh:main"

    TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True)

    files_to_upload = DATA_FOLDER.rglob("*.nc")

    for file in files_to_upload:
        tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER)
        rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE)

    cvs = load_cvs(cv_source=CV_SOURCE)
    get_validate_tree_result(OUTPUT_ROOT, cv_source=None, cvs=cvs).raise_if_errors()
    upload_ftp(
        tree_root=OUTPUT_ROOT,
        ftp_dir_rel_to_root="UCLA-1-0-1-upload-1",
        password="[email protected]",
        cvs=cvs,
        username="anonymous",
        ftp_server="ftp.llnl.gov",
        ftp_dir_root="/incoming",
        # You can try making this 8, but the FTP server seems to not enjoy parallel uploads
        n_threads=1,
        dry_run=False,
        continue_on_error=False,
    )


if __name__ == "__main__":
    main()

@jfkok
Copy link

jfkok commented Feb 7, 2025 via email

@znichollscr
Copy link
Collaborator Author

lgtm! @durack1 they're in /incoming/UCLA-1-0-1-upload-1/input4MIPs if you want to pull them and get them in the publication queue

@znichollscr
Copy link
Collaborator Author

Nice work @jfkok

@durack1
Copy link
Contributor

durack1 commented Feb 10, 2025

@znichollscr @jfkok great!

I'm just peeking at the subdirs. There are several versions in that path. Is the latest 20250207 the only one we need? Also, "CMIP6" is the MIP era, which is wrong (for next-gen prototype data, we're using CMIP6Plus). So we'll want to catch that. Looks like the only missing global attribute is nominal_resolution, which should be gm = 10000 km. For the gn this is different sized regions, not quite sure how best to deal with that, but 5000 km might be about right for the specified regions. @jfkok you might also want to correct the licensedunder -> licensed under in the license entry.

One extra nit, if these are observed data (rather than modelled), then the calendar = "noleap" is invalid.

@durack1
Copy link
Contributor

durack1 commented Feb 11, 2025

@jfkok please ping this thread if any of the above tweaks aren't clear.. If you can upload these to a new subdir (e.g. /incoming/UCLA-1-0-1-upload-4), it'll save me sifting through old versions with exactly the same filename

@jfkok
Copy link

jfkok commented Feb 11, 2025 via email

@jfkok
Copy link

jfkok commented Feb 12, 2025 via email

@znichollscr
Copy link
Collaborator Author

So do you mean I should change "mip_era" to "CMIP7"? Or something different?

"CMIP6Plus" (our halfway house)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants