Add dust emissions #140

znichollscr · 2024-10-22T15:27:09Z

@jfkok has done some great work pulling together dust emissions. This issue is for tracking their inclusion in input4MIPs.

znichollscr · 2024-10-22T15:28:20Z

An initial draft of the files is here, which includes a nice briefing on how to use the data.

My initial comments on this are:

the way the files are currently written, the scenario (historical, future increase, future decrease, future constant) and the region (global or regional) are both blended into the variable name. That isn’t how most of the data is handled. The more common pattern for this (where the variable is the same every time if I am not mistaken) would be to use the same variable name for every file (e.g. dustscalefactor) and then the region of application is identified by the grid label. It’s a bit weird, but the current practice is to identify the scenario by the source ID (although I think we could do something clearer than this, full discussion is here: Altering the DRS #64).
is there a reason that future starts in 2001? That will be different to how ScenarioMIP will be handling the split buf it this is for a custom experiment, it may not matter (but it will matter if we want these dust emissions included in the DECK’s historical experiment)
the data does present a bit of a challenge for input4MIPs (the same comments apply to Stephanie’s data and it would be great to address them for CMIP7 (in CMIP6 there unfortunately wasn’t time)). The data has this idea of regions to which the data applies. The CF conventions do have a standard for this (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#geographic-regions), which is probably what we would want to use. I think that would then all just work, but it might require a bit of a tweak to how the data is written(and the grid label for regional stuff would probably then become “gr”, see the full list of options here: https://github.com/PCMDI/mip-cmor-tables/blob/main/MIP_grid_label.json).
Just to check my understanding: the variable is the same every time right? It’s a scaling factor and the idea is that models use this to scale their internally calculated dust to get closer to the change in dust over time?

znichollscr · 2024-10-23T05:28:12Z

@jfkok making sure you get a notification and can find this

jfkok · 2024-10-24T01:55:34Z

Hi Zeb, thanks for your very helpful input. Responses follow below:

* the way the files are currently written, the scenario (historical, future increase, future decrease, future constant) and the region (global or regional) are both blended into the variable name. That isn’t how most of the data is handled. The more common pattern for this (where the variable is the same every time if I am not mistaken) would be to use the same variable name for every file (e.g. dustscalefactor) and then the region of application is identified by the grid label. It’s a bit weird, but the current practice is to identify the scenario by the source ID (although I think we could do something clearer than this, full discussion is here: [Altering the DRS #64](https://github.com/PCMDI/input4MIPs_CVs/discussions/64)).

Ah okay, that makes sense. So I'd need to create one file for the global scaling factor (currently in this file) and seven additional files for the regional scaling factors for the seven regions (currently in this file), is that correct? And then do I still keep the historical and the (three) future scaling factors in separate files? So for 4x8 = 32 files total?

* is there a reason that future starts in 2001? That will be different to how ScenarioMIP will be handling the split buf it this is for a custom experiment, it may not matter (but it will matter if we want these dust emissions included in the DECK’s historical experiment)

Yes, the reason is that the observational reconstruction ends in the year 2000 because it is based on sedimentary records of dust deposition (like ice cores), so there's much less data for the last two decades. However, my group is working on using satellite data to extend the reconstruction to 2023 and I expect that to be ready sometime next year.

* the data does present a bit of a challenge for input4MIPs (the same comments apply to Stephanie’s data and it would be great to address them for CMIP7 (in CMIP6 there unfortunately wasn’t time)). The data has this idea of regions to which the data applies. The CF conventions do have a standard for this (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#geographic-regions), which is probably what we would want to use. I think that would then all just work, but it might require a bit of a tweak to how the data is written(and the grid label for regional stuff would probably then become “gr”, see the full list of options here: https://github.com/PCMDI/mip-cmor-tables/blob/main/MIP_grid_label.json).

Thanks for pointing this out. The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

* Just to check my understanding: the variable is the same every time right? It’s a scaling factor and the idea is that models use this to scale their internally calculated dust to get closer to the change in dust over time?

Yes, that's exactly right. It's only the year and the region of application (global versus one of seven major dust aerosol source regions) that changes.

Thanks so much!

Jasper

znichollscr · 2024-10-24T06:06:48Z

And then do I still keep the historical and the (three) future scaling factors in separate files? So for 4x8 = 32 files total?

Sorry bad explanation from me. I would suggest one file for global and one file for regional (you can put all seven regions in one file, just use the 'region' dimension or whatever it is that the CF conventions calls it to differentiate them). Then yes, one file for historical and one file for each scenario. So you'll end up with 4 x 2 = 8 files.

Just to try and be a bit clearer:

in the global file, there should be one variable. Its only dimension should be "time". The grid label should be "gm" (for global mean).
in the regional file, there should be one variable. It should be two-dimensional, ("time", "region") (or whatever the CF-convention name for 'region' is). I would make the grid label "gn" probably (this is your native grid). Then there'll probably be other auxillary co-ordinates in this file to help with defining the boundaries of each region, region names etc.

Got it re the historical vs. scenario split. That's fine. If we want these files to be used for DECK simulations, we'll have to do a bit of thinking. If they're just for a specific MIP experiment, they can stay as they are.

The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

Ah ok nice. Defining those boundary files would definitely be sufficient. (If it were me, I would just give the regions names first, make sure I can write the files, then go back and add the boundary definition second, because that boundary definition could be fiddly, but you may be able to skip straight to writing the boundary conditions!)

jfkok · 2024-10-24T23:26:02Z

Thanks, that's helpful. I'm working on implementing these changes and obtaining corrected files. In doing so, I realized that making the variable name the same for all files also means that the variable are identical for the four different global files (1 historical and 3 future scenarios) and for the four different regional files. So how would I distinguish the files if I can't put the scenario in the variable name?

The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

Ah ok nice. Defining those boundary files would definitely be sufficient. (If it were me, I would just give the regions names first, make sure I can write the files, then go back and add the boundary definition second, because that boundary definition could be fiddly, but you may be able to skip straight to writing the boundary conditions!)
I'm not sure I quite understand what you mean by "boundary files". Would it be sufficient to specify the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable? Or should I do something different?

Thanks!

Jasper

znichollscr · 2024-10-25T08:27:32Z

So how would I distinguish the files if I can't put the scenario in the variable name?

Excellent question. The answer is, at the moment, you put it in the "source_id" bit of the file name. So, for example, your filenames would become (where I've also dropped off the noleap prefix that isn't part of the DRS):

historical: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_gm_185001-200012.nc
constant scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-constant_gm_200101-210012.nc
decreasing scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-decreasing_gm_200101-210012.nc
future scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-increasing_gm_200101-210012.nc

As you can tell, this doesn't make that much sense and is easy to miss, which is why we're having the discussion in #64. That may lead to a renaming in future, but for now the above is what to go for.

jfkok · 2024-10-25T18:52:05Z

Thanks so much! I've implemented all your comments (I think) and uploaded the updated files here. Let me know if you think any further changes are needed.

One thing to note is that I added the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable. Let me know in case I should be doing something differently.

znichollscr · 2024-10-26T07:59:08Z

Nice, thanks.

Let me know if you think any further changes are needed.

Underscores in the source ID have to be changed to hyphens i.e.:

dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_constant_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-constant_gm_200101-210012.nc
dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_increasing_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-increasing_gm_200101-210012.nc
dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_decreasing_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-decreasing_gm_200101-210012.nc

Then I would suggest trying to run them through the validator and write them in the DRS: https://input4mips-validation.readthedocs.io/en/latest/how-to-guides/how-to-write-a-single-file-in-the-drs/

One thing to note is that I added the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable. Let me know in case I should be doing something differently.

Not sure, I haven't done this particular step before. If you run the files through the validator, the CF-checker will flag anything really wrong. In a couple of weeks I can pull the files down and have a look myself (meeting next week will take up time before then).

jfkok · 2024-10-28T01:26:53Z

Thanks Zeb, I corrected the file names.

I tried running the validator after installing it as an application per the instructions here. However, the command import input4mips_validation.io triggers an error message, pasted below. Do you know if there is an easy solution for this? Thanks!

znichollscr · 2024-10-28T07:12:23Z

Do you know if there is an easy solution for this?

Hmm that's not very good. Let's dive down the rabbit hole here: climate-resource/input4mips_validation#78

znichollscr · 2024-12-11T17:12:00Z

@jfkok are there any updated files or should I just use this link (https://drive.google.com/drive/u/0/folders/1Xr1A4oqPj35h43MFV1864IXSvYSVtYKs) again?

jfkok · 2024-12-11T17:14:52Z

I've just updated the files to the latest version. So yes, that links is the correct one.

…

On Wed, Dec 11, 2024 at 9:12 AM znichollscr ***@***.***> wrote: @jfkok <https://github.com/jfkok> are there any updated files or should I just use this link ( https://drive.google.com/drive/u/0/folders/1Xr1A4oqPj35h43MFV1864IXSvYSVtYKs) again? — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTK3FXBLWMVLMGQQE4T2FBW7NAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWGU4DQNRZGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

znichollscr · 2024-12-11T18:54:58Z

Looking overall good. I had to make some minor tweaks to the files to get them to pass validation, that was pretty straight forward. The Python code I used is below (runs in the same environment as input4mips-validation), but you can probably do the same with Matlab easily. Looking at the changes, I assume they'll need to be applied to all files.

Python code for tweaks

import netCDF4

ds = netCDF4.Dataset(
    "dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_gn_185001-200012.nc",
    "a",
)

# Capitalisation matters according to the spec
# (doesn't make sense to me, but here we are)
ds["dustscalefactor"].delncattr("Units")
# Apparently dimensionless is represented by "1"
# (again, wouldn't be my choice of how to do this)
ds["dustscalefactor"].setncattr("units", "1")
ds["dustscalefactor"].setncattr("long_name", "dustscalefactor")

# Capitalisation matters according to the spec
# (doesn't make sense to me, but here we are)
ds["region"].delncattr("Units")
# Apparently dimensionless is represented by "1"
# (again, wouldn't be my choice of how to do this)
ds["region"].setncattr("units", "1")
# Whitespace isn't allowed in attribute names
# (not sure why, it's just the spec)
ds["region"].setncattr(
    "boundary_coordinates", ds["region"].getncattr("Boundary coordinates")
)
ds["region"].delncattr("Boundary coordinates")
ds["region"].setncattr("long_name", "region")

# We use 'yr' rather than 'annual'
ds.setncattr("frequency", "yr")
# License info
ds.setncattr("license_id", "CC BY 4.0")
ds.setncattr(
    "license",
    (
        "The input4MIPs data linked to this entry is licensed "
        "under a Creative Commons Attribution 4.0 International "
        "(https://creativecommons.org/licenses/by/4.0/). "
        "Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse "
        "for terms of use governing CMIP6Plus output, "
        "including citation requirements and proper acknowledgment. "
        "The data producers and data providers make no warranty, either express "
        "or implied, including, but not limited to, warranties of merchantability "
        "and fitness for a particular purpose. "
        "All liabilities arising from the supply of the information "
        "(including any liability arising in negligence) "
        "are excluded to the fullest extent permitted by law."
    ),
)

# If you know this, helpful.
# If not, can just leave vague like this
ds.setncattr("nominal_resolution", "250 km")

ds.close()

The only other minor change is that, because you have annual data, the end of the file should be "YYYY-YYYY" not "YYYYMM-YYYYMM" as you have now i.e. for your historical files, "185001-200012.nc" becomes "1850-2000.nc".

If you're able to make those tweaks, we should be on the home straight.

znichollscr · 2024-12-11T18:58:45Z

The other thing we need to do is register UCLA as an institute. I've started that here if you're able to take a look: PCMDI/mip-cmor-tables#84

Thanks!

jfkok · 2024-12-12T01:28:23Z

Thanks Zeb, that was very helpful. I've addressed all these issues and the file now passes the validation, including the cf checker! Does that conclude the process or are there other steps we need to take? Thanks! Jasper

…

On Wed, Dec 11, 2024 at 10:55 AM znichollscr ***@***.***> wrote: Looking overall good. I had to make some minor tweaks to the files to get them to pass validation, that was pretty straight forward. The Python code I used is below (runs in the same environment as input4mips-validation), but you can probably do the same with Matlab easily. Looking at the changes, I assume they'll need to be applied to all files. Python code for tweaks import netCDF4 ds = netCDF4.Dataset( "dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_gn_185001-200012.nc", "a", ) # Capitalisation matters according to the spec # (doesn't make sense to me, but here we are) ds["dustscalefactor"].delncattr("Units") # Apparently dimensionless is represented by "1" # (again, wouldn't be my choice of how to do this) ds["dustscalefactor"].setncattr("units", "1") ds["dustscalefactor"].setncattr("long_name", "dustscalefactor") # Capitalisation matters according to the spec # (doesn't make sense to me, but here we are) ds["region"].delncattr("Units") # Apparently dimensionless is represented by "1" # (again, wouldn't be my choice of how to do this) ds["region"].setncattr("units", "1") # Whitespace isn't allowed in attribute names # (not sure why, it's just the spec) ds["region"].setncattr( "boundary_coordinates", ds["region"].getncattr("Boundary coordinates") ) ds["region"].delncattr("Boundary coordinates") ds["region"].setncattr("long_name", "region") # We use 'yr' rather than 'annual' ds.setncattr("frequency", "yr") # License info ds.setncattr("license_id", "CC BY 4.0") ds.setncattr( "license", ( "The input4MIPs data linked to this entry is licensed " "under a Creative Commons Attribution 4.0 International " "(https://creativecommons.org/licenses/by/4.0/). " "Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse " "for terms of use governing CMIP6Plus output, " "including citation requirements and proper acknowledgment. " "The data producers and data providers make no warranty, either express " "or implied, including, but not limited to, warranties of merchantability " "and fitness for a particular purpose. " "All liabilities arising from the supply of the information " "(including any liability arising in negligence) " "are excluded to the fullest extent permitted by law." ), ) # If you know this, helpful. # If not, can just leave vague like this ds.setncattr("nominal_resolution", "250 km") ds.close() The only other minor change is that, because you have annual data, the end of the file should be "YYYY-YYYY" not "YYYYMM-YYYYMM" as you have now i.e. for your historical files, "185001-200012.nc" becomes "1850-2000.nc". If you're able to make those tweaks, we should be on the home straight. — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTLAJCK2H3HEZF777B32FCDBTAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWHA3DKNBZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

znichollscr · 2024-12-12T06:53:38Z

Thanks Zeb, that was very helpful. I've addressed all these issues and the file now passes the validation, including the cf checker!

Oh cool, I thought the cf-checker wasn't running on your machine, great to hear it's working!

Does that conclude the process or are there other steps we need to take?

Well, if you want to publish the data on ESGF there are a couple more steps. If you're not worried about that, we're done (although I'm also not sure that you needed to do any of these steps if you didn't want to publish on ESGF in the first place).

znichollscr · 2024-12-12T06:54:41Z

Assuming you do want to publish on ESGF, the next steps are:

check that this is the right institute to associate with your data: Add UCLA mip-cmor-tables#84
I'll pull your data down onto our server
Paul and I will then push it through the publication queue
Good to go

jfkok · 2024-12-13T00:24:55Z

Yes, I'd definitely like to publish this on ESGF as an input4MIPs data set.

- check that this is the right institute to associate with your data: PCMDI/mip-cmor-tables#84 <PCMDI/mip-cmor-tables#84> That's correct, thanks for adding UCLA - I'll pull your data down onto our server - Paul and I will then push it through the publication queue - Good to go Perfect. I've verified that the data on the google drive (here

<https://drive.google.com/drive/u/0/folders/1Xr1A4oqPj35h43MFV1864IXSvYSVtYKs>) is correct. Let me know if you need anything else from me. Thanks for all your help! Jasper

…

— Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTJAM3WALQKHKWZRBTT2FEXMNAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZXHE3DEMJRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

znichollscr · 2024-12-13T10:12:34Z

Ok perfect. There is one last step that would be very helpful, can you please run the script below (should be clear what it's doing, any questions, fire away. As above, it will run in your normal input4mips-validation environment). Once you've done that, ping me and @durack1 and we can get the files in the publication queue.

Python script to run

import shutil
from pathlib import Path

import iris
import netCDF4
from input4mips_validation.cvs import load_cvs
from input4mips_validation.dataset import Input4MIPsDataset
from input4mips_validation.inference.from_data import infer_time_start_time_end
from input4mips_validation.logging import setup_logging
from input4mips_validation.upload_ftp import upload_ftp
from input4mips_validation.validation.file import get_validate_file_result
from input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes


setup_logging(enable=True, logging_level="INFO_FILE")


def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    fixed_file = fixed_folder / file.name
    shutil.copy(file, fixed_file)

    source_id = file.name.split("_")[4]

    ds = netCDF4.Dataset(fixed_file, "a")

    ds.setncattr("source_id", source_id)
    ds.setncattr("realm", "atmos")
    if "gn" in fixed_file.name:
        ds["region"].setncattr("long_name", "region")

    ds.close()

    return fixed_file


def validate_file_then_rewrite_file_in_drs(
    file: Path, output_root: Path, cv_source: str
) -> None:
    get_validate_file_result(
        file,
        cv_source=cv_source,
    ).raise_if_errors()

    cvs = load_cvs(cv_source=cv_source)

    ds = ds_from_iris_cubes(
        iris.load(file),
    )
    ds.attrs["nominal_resolution"] = "250 km"
    ds.attrs["target_mip"] = "AerChemMIP2"  # To confirm with Jasper

    time_start, time_end = infer_time_start_time_end(
        ds=ds,
        frequency_metadata_key="frequency",
        no_time_axis_frequency="fx",
        time_dimension="time",
    )

    full_file_path = cvs.DRS.get_file_path(
        root_data_dir=output_root,
        available_attributes=ds.attrs,
        time_start=time_start,
        time_end=time_end,
    )

    if full_file_path.exists():
        raise FileExistsError(full_file_path)

    full_file_path.parent.mkdir(parents=True, exist_ok=True)

    if full_file_path.name != file.name:
        Input4MIPsDataset.from_ds(ds, cvs=cvs).write(
            root_data_dir=output_root,
        )

    else:
        shutil.copy(file, full_file_path)

    print(f"File written according to the DRS in {full_file_path}")


def main() -> None:
    # Obviously point these wherever makes sense for you
    DATA_FOLDER = Path("dust-files")
    TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes")
    OUTPUT_ROOT = Path("dust-rewritten")

    # You shouldn't need to change this
    CV_SOURCE = "gh:main"

    TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True)

    files_to_upload = DATA_FOLDER.rglob("*.nc")

    for file in files_to_upload:
        tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER)
        validate_file_then_rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE)

    cvs = load_cvs(cv_source=CV_SOURCE)
    upload_ftp(
        tree_root=OUTPUT_ROOT,
        ftp_dir_rel_to_root="UCLA-1-0-1-upload-1",
        password="[email protected]",
        cvs=cvs,
        username="anonymous",
        ftp_server="ftp.llnl.gov",
        ftp_dir_root="/incoming",
        # You can try making this 8, but the FTP server seems to not enjoy parallel uploads
        n_threads=1,
        dry_run=False,
        continue_on_error=False,
    )


if __name__ == "__main__":
    main()

jfkok · 2025-01-06T17:45:21Z

Happy 2025, @durack1, @znichollscr! I ran the python script and copied the output below. It seems to have run correctly, as far as I can tell.

znichollscr · 2025-01-06T18:07:19Z

Hmm I don't think so, no files were uploaded. Did you update DATA_FOLDER in the script to point to the directory with your files?

znichollscr · 2025-01-06T18:08:48Z

If yes, change

files_to_upload = DATA_FOLDER.rglob("*.nc")

to

files_to_upload = DATA_FOLDER.rglob("*.nc")
assert files_to_upload, f"No re-written files found in {DATA_FOLDER}, please check"

because something isn't working...

jfkok · 2025-01-06T18:16:42Z

Oops sorry, I missed that part. I updated the directories to point to the right places and ran it again, seemingly successfully (see screenshot). I assume it worked now and you can see the uploaded files? [image: image.png]

…

On Mon, Jan 6, 2025 at 10:09 AM znichollscr ***@***.***> wrote: If yes, change files_to_upload = DATA_FOLDER.rglob("*.nc") to files_to_upload = DATA_FOLDER.rglob("*.nc")assert files_to_upload, f"No re-written files found in {DATA_FOLDER}, please check" because something isn't working... — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTNSUFYHBJUYHSMGCFD2JLBEPAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGY2DGNZXG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

znichollscr · 2025-01-06T18:57:46Z

Hmmm I can't see the screenshot and we don't seem to have anything on the server. Can you try sending the screenshot again please?

jfkok · 2025-01-06T19:04:02Z

That's odd! I've added the line of code you mentioned and that warning doesn't trigger.

Here's the screenshot again, hopefully it comes across this time.

znichollscr · 2025-01-06T19:12:11Z

Super weird, is there anything in the folder pointed to by OUTPUT_ROOT ?

znichollscr · 2025-01-06T19:12:50Z

(For what it's worth, the timestamp for the message about uploading files and success are the same, so it's clear that there's no uploading happening)

jfkok · 2025-01-27T20:51:38Z

Sorry for the 3-weeks hiatus - things got busy and of course we had the fires here.

Now when I run the code I get a permission error (copied below). This seems odd because the code creates a whole string of directories in the OUTPUT_ROOT directory, but it seems it somehow can't write the file in the final directory it creates. Do you have any ideas what the problem could be/

znichollscr · 2025-01-28T06:04:52Z

Sorry for the 3-weeks hiatus - things got busy and of course we had the fires here

No worries. Hope you're ok with everything else that is going on.

On windows it is possible to hit file path length issues. I would guess it is that, so you might have to write in e.g. C:\Users\Jasper Kok rather than in the sub0folder you're currently trying to write in.

jfkok · 2025-01-30T01:37:35Z

Ah thanks, that seems to have fixed (part of) it!

I now have output in the output_root directory for each of the 8 original data files. Do you see that on your end? I've copied a screenshot of the directory structure that was created below.

After writing those files, the script still runs into an error though (see second screenshot). It's a file not found error, but the path listed is that the first file that was written out and which is present (see first screenshot), so I'm confused... Thanks for any help!

znichollscr · 2025-01-30T06:17:31Z

Hmm that's weird. Are the relative paths set correctly? Maybe it's going up/down too many levels? It should also be relatively clear how to just run the upload part once you have the files on disk of the relative path stuff is causing issues

…

On Thu, Jan 30, 2025, 02:37 jfkok ***@***.***> wrote: Ah thanks, that seems to have fixed (part of) it! I now have output in the output_root directory for each of the 8 original data files. Do you see that on your end? I've copied a screenshot of the directory structure that was created below. After writing those files, the script still runs into an error though (see second screenshot). It's a file not found error, but the path listed is that the first file that was written out and which is present (see first screenshot), so I'm confused... Thanks for any help! image.png (view on web) <https://github.com/user-attachments/assets/0c681e08-bec6-46ec-a399-28fd2308e8cf> image.png (view on web) <https://github.com/user-attachments/assets/a4d1585a-b947-46b5-8ddd-bf9a38431c09> — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3KEXH63WVI7XM46CNUDKGD2NF67NAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRTGMZDQMBQHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jfkok · 2025-01-31T21:32:48Z

Okay, I think it finally worked now! Do you see it on your end?

znichollscr · 2025-02-01T13:29:24Z

Got the files!

So, there are still a few issues with them (see below). However, we could also just publish, pros and cons below.

Issues that I can see right now:

none of the files have bounds for time
apparently, according to CF-conventions, region should be a string, not an integer

Pros and cons of publishing now:

Pros

data is out there, people can probably use it as is
you get feedback on the actual numbers now
the validation tools we're writing are constantly improving anyway, so even if we wait, there will probably be other things we pick up and it'd be good to get something published rather than being stuck in a seemingly endless validation cycle

Cons

these files won't necessarily work smoothly with tools that expect CF-compliant data (although there aren't so many of those)
we really should clean these things up at some point, which means users will get a slightly different update when the fixes are done

I'm happy to go either way, so I'd be interested in your thoughts/wishes and also those of @durack1

jfkok · 2025-02-03T20:00:37Z

Hi Zeb, That's unfortunate that there are still issues. If these two are easy to fix then I'm fine just doing that. If not, let's go ahead and publish. I think it'd be straightforward to write out the region as a string. For the second issue, do you know how I would add "bounds for time"? Thanks! Jasper

…

On Sat, Feb 1, 2025 at 5:29 AM znichollscr ***@***.***> wrote: Got the files! So, there are still a few issues with them (see below). However, we could also just publish, pros and cons below. Issues that I can see right now: - none of the files have bounds for time - apparently, according to CF-conventions, region should be a string, not an integer Pros and cons of publishing now: Pros - data is out there, people can probably use it as is - you get feedback on the actual numbers now - the validation tools we're writing are constantly improving anyway, so even if we wait, there will probably be other things we pick up and it'd be good to get something published rather than being stuck in a seemingly endless validation cycle Cons - these files won't necessarily work smoothly with tools that expect CF-compliant data (although there aren't so many of those) - we really should clean these things up at some point, which means users will get a slightly different update when the fixes are done I'm happy to go either way, so I'd be interested in your thoughts/wishes and also those of @durack1 <https://github.com/durack1> — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTPIZPJOXWYJSJKB4DL2NTD4TAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYHE2TIMJZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

durack1 · 2025-02-04T05:45:45Z

My two cents, adding time bounds is an easy step (@jfkok just means we know when a data value is valid for e.g. start and end of month for monthly mean data, irrespective of your time axis value).

I am yet to peek these data, so would definitely fix any issues in loading the data using xarray (the most common package in climate these days), other simple nits we could also capture as the time axis bounds are being created.

znichollscr · 2025-02-04T17:28:24Z

@jfkok an updated script that now also adds time bounds is below. The changed lines are these ones:

import cftime
...
def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    ...
    # Add time bounds
    ds.createDimension("bnds", 2)
    ds.createVariable("time_bnds", "f8", ("time", "bnds"))

    # Assumes yearly steps, but probably ok
    time_dates = cftime.num2date(
        ds["time"][:], units=ds["time"].units, calendar=ds["time"].calendar
    )
    time_years = [dt.year for dt in time_dates]
    time_bounds_lower = cftime.date2num(
        [cftime.datetime(y, 1, 1, calendar=ds["time"].calendar) for y in time_years],
        units=ds["time"].units,
    )
    time_bounds_upper = cftime.date2num(
        [
            cftime.datetime(y + 1, 1, 1, calendar=ds["time"].calendar)
            for y in time_years
        ],
        units=ds["time"].units,
    )
    ds["time_bnds"][:, 0] = time_bounds_lower
    ds["time_bnds"][:, 1] = time_bounds_upper

    ds["time"].setncattr("bounds", "time_bnds")
    ...

Full script

import shutil
from pathlib import Path

import cftime
import iris
import netCDF4
from input4mips_validation.cvs import load_cvs
from input4mips_validation.dataset import Input4MIPsDataset
from input4mips_validation.inference.from_data import infer_time_start_time_end
from input4mips_validation.logging import setup_logging
from input4mips_validation.validation.file import get_validate_file_result
from input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes

setup_logging(enable=True, logging_level="INFO_FILE")


def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    fixed_file = fixed_folder / file.name
    shutil.copy(file, fixed_file)

    source_id = file.name.split("_")[4]

    ds = netCDF4.Dataset(fixed_file, "a")

    ds.setncattr("source_id", source_id)
    ds.setncattr("realm", "atmos")
    if "gn" in fixed_file.name:
        ds["region"].setncattr("long_name", "region")

    # Add time bounds
    ds.createDimension("bnds", 2)
    ds.createVariable("time_bnds", "f8", ("time", "bnds"))

    # Assumes yearly steps, but probably ok
    time_dates = cftime.num2date(
        ds["time"][:], units=ds["time"].units, calendar=ds["time"].calendar
    )
    time_years = [dt.year for dt in time_dates]
    time_bounds_lower = cftime.date2num(
        [cftime.datetime(y, 1, 1, calendar=ds["time"].calendar) for y in time_years],
        units=ds["time"].units,
    )
    time_bounds_upper = cftime.date2num(
        [
            cftime.datetime(y + 1, 1, 1, calendar=ds["time"].calendar)
            for y in time_years
        ],
        units=ds["time"].units,
    )
    ds["time_bnds"][:, 0] = time_bounds_lower
    ds["time_bnds"][:, 1] = time_bounds_upper

    ds["time"].setncattr("bounds", "time_bnds")

    ds.close()

    import xarray as xr

    ds = xr.open_dataset(file)

    return fixed_file


def validate_file_then_rewrite_file_in_drs(
    file: Path, output_root: Path, cv_source: str
) -> None:
    get_validate_file_result(
        file,
        cv_source=cv_source,
    ).raise_if_errors()

    cvs = load_cvs(cv_source=cv_source)

    ds = ds_from_iris_cubes(
        iris.load(file),
    )
    ds.attrs["nominal_resolution"] = "250 km"
    ds.attrs["target_mip"] = "AerChemMIP2"  # To confirm with Jasper

    time_start, time_end = infer_time_start_time_end(
        ds=ds,
        frequency_metadata_key="frequency",
        no_time_axis_frequency="fx",
        time_dimension="time",
    )

    full_file_path = cvs.DRS.get_file_path(
        root_data_dir=output_root,
        available_attributes=ds.attrs,
        time_start=time_start,
        time_end=time_end,
    )

    if full_file_path.exists():
        raise FileExistsError(full_file_path)

    full_file_path.parent.mkdir(parents=True, exist_ok=True)

    if full_file_path.name != file.name:
        Input4MIPsDataset.from_ds(ds, cvs=cvs).write(
            root_data_dir=output_root,
        )

    else:
        shutil.copy(file, full_file_path)

    print(f"File written according to the DRS in {full_file_path}")


def main() -> None:
    # Obviously point these wherever makes sense for you
    DATA_FOLDER = Path("dust-files")
    TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes")
    OUTPUT_ROOT = Path("dust-rewritten")

    # You shouldn't need to change this
    CV_SOURCE = "gh:main"

    TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True)

    files_to_upload = DATA_FOLDER.rglob("*.nc")

    for file in files_to_upload:
        tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER)
        validate_file_then_rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE)

    cvs = load_cvs(cv_source=CV_SOURCE)
    # upload_ftp(
    #     tree_root=OUTPUT_ROOT,
    #     ftp_dir_rel_to_root="UCLA-1-0-1-upload-1",
    #     password="[email protected]",
    #     cvs=cvs,
    #     username="anonymous",
    #     ftp_server="ftp.llnl.gov",
    #     ftp_dir_root="/incoming",
    #     # You can try making this 8, but the FTP server seems to not enjoy parallel uploads
    #     n_threads=1,
    #     dry_run=False,
    #     continue_on_error=False,
    # )


if __name__ == "__main__":
    main()

znichollscr · 2025-02-04T17:29:03Z

Any questions, fire away (the whole world of CF-conventions, bounds etc. is super confusing so very happy to help and there are no dumb questions)

jfkok · 2025-02-06T00:33:12Z

Hi Zeb, Great thank you, I appreciate both the clear script and the offer to help!! Adding the time bounds to the files with your updated script worked like a charm. I don't seem to be able to change the region to a string though as that's a dimension, which seems to be required to be specified in terms of numbers. Do you have a suggestion or can we let that one slide? Thanks! Jasper

…

On Tue, Feb 4, 2025 at 9:29 AM znichollscr ***@***.***> wrote: Any questions, fire away (the whole world of CF-conventions, bounds etc. is super confusing so very happy to help and there are no dumb questions) — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTK2KZZBB3SILVPNNJL2OD2HLAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZUGYZDENRYGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

znichollscr · 2025-02-06T13:12:43Z

Do you have a suggestion or can we let that one slide?

I thought this would be easy. Turns out it's more complicated, but still doable. Have a look at the below.

Script to update region handling

import shutil
from pathlib import Path

import cftime
import iris
import netCDF4
import numpy as np
from input4mips_validation.cvs import load_cvs
from input4mips_validation.inference.from_data import infer_time_start_time_end
from input4mips_validation.io import (
    generate_creation_timestamp,
    generate_tracking_id,
)
from input4mips_validation.logging import setup_logging
from input4mips_validation.upload_ftp import upload_ftp
from input4mips_validation.validation.tree import get_validate_tree_result
from input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes

setup_logging(enable=True, logging_level="INFO_FILE")


def copy_dimension(
    ds: netCDF4.Dataset, name: str, dim_to_copy: netCDF4.Dimension
) -> None:
    ds.createDimension(
        name, (len(dim_to_copy) if not dim_to_copy.isunlimited else None)
    )


def copy_variable(
    ds: netCDF4.Dataset,
    name: str,
    variable_to_copy: netCDF4.Variable,
    dimensions: tuple[str, ...] | None = None,
) -> None:
    if dimensions is None:
        dimensions = variable_to_copy.dimensions

    ds.createVariable(name, variable_to_copy.datatype, dimensions)
    ds.variables[name][:] = variable_to_copy[:]
    for attr in variable_to_copy.ncattrs():
        ds[name].setncattr(attr, variable_to_copy.getncattr(attr))


def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    fixed_file = fixed_folder / file.name

    source_id = file.name.split("_")[4]

    ds = netCDF4.Dataset(file, "r")
    ds_new = netCDF4.Dataset(fixed_file, "w")

    # Directly copy over the things we want to keep
    # (can't delete because of netCDF4's API).
    # Attributes
    for name in ds.ncattrs():
        ds_new.setncattr(name, ds.getncattr(name))

    # Handle the dimension and variable creation more carefully
    gn_new_region_dim = "lbl"

    # Dimensions
    if "gn" in fixed_file.name:
        for name, dimension in ds.dimensions.items():
            if name == "region":
                ds_new.createDimension(gn_new_region_dim, len(dimension))

            else:
                copy_dimension(ds_new, name=name, dim_to_copy=dimension)

    else:
        for name, dimension in ds.dimensions.items():
            copy_dimension(ds_new, name=name, dim_to_copy=dimension)

    for name, variable in ds.variables.items():
        if "gn" in fixed_file.name:
            if name == "region":
                region_names = [
                    v.split("=")[-1]
                    for v in ds["region"].getncattr("description").split(";")
                ]
                region_names_max_length = max(len(v) for v in region_names)
                regions = np.array(region_names, dtype=f"S{region_names_max_length}")

                boundary_coordinates = "; ".join(
                    [
                        v.split("=")[-1].strip()
                        for v in ds["region"]
                        .getncattr("boundary_coordinates")
                        .split(";")
                    ]
                )

                ds_new.createDimension("strlen", region_names_max_length)
                ds_new.createVariable(
                    name,
                    "S1",
                    (gn_new_region_dim, "strlen"),
                )
                ds_new[name][:] = netCDF4.stringtochar(regions)

                ds_new[name].setncattr("boundary_coordinates", boundary_coordinates)
                ds_new[name].setncattr("long_name", "region")

            else:
                copy_variable(
                    ds_new,
                    name=name,
                    variable_to_copy=ds.variables[name],
                    dimensions=tuple(
                        v.replace("region", gn_new_region_dim)
                        for v in ds.variables[name].dimensions
                    ),
                )

        else:
            copy_variable(ds_new, name=name, variable_to_copy=ds.variables[name])

    ds_new.setncattr("source_id", source_id)
    ds_new.setncattr("realm", "atmos")
    # Update to make this easier to parse for users
    ds_new.setncattr("contact", "[email protected];[email protected]")
    ds_new.setncattr("nominal_resolution", "250 km")
    ds_new.setncattr("target_mip", "AerChemMIP2")  # To confirm with Jasper
    # Generate new tracking ID
    ds_new.setncattr("tracking_id", generate_tracking_id())
    ds_new.setncattr("creation_date", generate_creation_timestamp())

    # Add time bounds
    ds_new.createDimension("bnds", 2)
    ds_new.createVariable("time_bnds", "f8", ("time", "bnds"))

    # Assumes yearly steps, but probably ok
    time_dates = cftime.num2date(
        ds_new["time"][:], units=ds_new["time"].units, calendar=ds_new["time"].calendar
    )
    time_years = [dt.year for dt in time_dates]
    time_bounds_lower = cftime.date2num(
        [
            cftime.datetime(y, 1, 1, calendar=ds_new["time"].calendar)
            for y in time_years
        ],
        units=ds_new["time"].units,
    )
    time_bounds_upper = cftime.date2num(
        [
            cftime.datetime(y + 1, 1, 1, calendar=ds_new["time"].calendar)
            for y in time_years
        ],
        units=ds_new["time"].units,
    )
    ds_new["time_bnds"][:, 0] = time_bounds_lower
    ds_new["time_bnds"][:, 1] = time_bounds_upper

    ds_new["time"].setncattr("bounds", "time_bnds")

    # Could use a context manager instead, whatever
    ds.close()
    ds_new.close()

    return fixed_file


def rewrite_file_in_drs(file: Path, output_root: Path, cv_source: str) -> None:
    cvs = load_cvs(cv_source=cv_source)

    ds = ds_from_iris_cubes(
        iris.load(file),
    )

    time_start, time_end = infer_time_start_time_end(
        ds=ds,
        frequency_metadata_key="frequency",
        no_time_axis_frequency="fx",
        time_dimension="time",
    )

    full_file_path = cvs.DRS.get_file_path(
        root_data_dir=output_root,
        available_attributes=ds.attrs,
        time_start=time_start,
        time_end=time_end,
    )

    if full_file_path.exists():
        raise FileExistsError(full_file_path)

    full_file_path.parent.mkdir(exist_ok=True, parents=True)
    shutil.copy(file, full_file_path)
    print(f"File written according to the DRS in {full_file_path}")


def main() -> None:
    # Obviously point these wherever makes sense for you
    DATA_FOLDER = Path("dust-files")
    TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes")
    OUTPUT_ROOT = Path("dust-rewritten")

    # You shouldn't need to change this
    CV_SOURCE = "gh:main"

    TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True)

    files_to_upload = DATA_FOLDER.rglob("*.nc")

    for file in files_to_upload:
        tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER)
        rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE)

    cvs = load_cvs(cv_source=CV_SOURCE)
    get_validate_tree_result(OUTPUT_ROOT, cv_source=None, cvs=cvs).raise_if_errors()
    upload_ftp(
        tree_root=OUTPUT_ROOT,
        ftp_dir_rel_to_root="UCLA-1-0-1-upload-1",
        password="[email protected]",
        cvs=cvs,
        username="anonymous",
        ftp_server="ftp.llnl.gov",
        ftp_dir_root="/incoming",
        # You can try making this 8, but the FTP server seems to not enjoy parallel uploads
        n_threads=1,
        dry_run=False,
        continue_on_error=False,
    )


if __name__ == "__main__":
    main()

jfkok · 2025-02-07T20:55:02Z

Perfect, thank you! It seems like that worked! Did you receive the files on your end? Jasper

…

On Thu, Feb 6, 2025 at 5:13 AM znichollscr ***@***.***> wrote: Do you have a suggestion or can we let that one slide? I thought this would be easy. Turns out it's more complicated, but still doable. Have a look at the below. Script to update region handling import shutilfrom pathlib import Path import cftimeimport irisimport netCDF4import numpy as npfrom input4mips_validation.cvs import load_cvsfrom input4mips_validation.inference.from_data import infer_time_start_time_endfrom input4mips_validation.io import ( generate_creation_timestamp, generate_tracking_id, )from input4mips_validation.logging import setup_loggingfrom input4mips_validation.upload_ftp import upload_ftpfrom input4mips_validation.validation.tree import get_validate_tree_resultfrom input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes setup_logging(enable=True, logging_level="INFO_FILE") def copy_dimension( ds: netCDF4.Dataset, name: str, dim_to_copy: netCDF4.Dimension ) -> None: ds.createDimension( name, (len(dim_to_copy) if not dim_to_copy.isunlimited else None) ) def copy_variable( ds: netCDF4.Dataset, name: str, variable_to_copy: netCDF4.Variable, dimensions: tuple[str, ...] | None = None, ) -> None: if dimensions is None: dimensions = variable_to_copy.dimensions ds.createVariable(name, variable_to_copy.datatype, dimensions) ds.variables[name][:] = variable_to_copy[:] for attr in variable_to_copy.ncattrs(): ds[name].setncattr(attr, variable_to_copy.getncattr(attr)) def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path: fixed_file = fixed_folder / file.name source_id = file.name.split("_")[4] ds = netCDF4.Dataset(file, "r") ds_new = netCDF4.Dataset(fixed_file, "w") # Directly copy over the things we want to keep # (can't delete because of netCDF4's API). # Attributes for name in ds.ncattrs(): ds_new.setncattr(name, ds.getncattr(name)) # Handle the dimension and variable creation more carefully gn_new_region_dim = "lbl" # Dimensions if "gn" in fixed_file.name: for name, dimension in ds.dimensions.items(): if name == "region": ds_new.createDimension(gn_new_region_dim, len(dimension)) else: copy_dimension(ds_new, name=name, dim_to_copy=dimension) else: for name, dimension in ds.dimensions.items(): copy_dimension(ds_new, name=name, dim_to_copy=dimension) for name, variable in ds.variables.items(): if "gn" in fixed_file.name: if name == "region": region_names = [ v.split("=")[-1] for v in ds["region"].getncattr("description").split(";") ] region_names_max_length = max(len(v) for v in region_names) regions = np.array(region_names, dtype=f"S{region_names_max_length}") boundary_coordinates = "; ".join( [ v.split("=")[-1].strip() for v in ds["region"] .getncattr("boundary_coordinates") .split(";") ] ) ds_new.createDimension("strlen", region_names_max_length) ds_new.createVariable( name, "S1", (gn_new_region_dim, "strlen"), ) ds_new[name][:] = netCDF4.stringtochar(regions) ds_new[name].setncattr("boundary_coordinates", boundary_coordinates) ds_new[name].setncattr("long_name", "region") else: copy_variable( ds_new, name=name, variable_to_copy=ds.variables[name], dimensions=tuple( v.replace("region", gn_new_region_dim) for v in ds.variables[name].dimensions ), ) else: copy_variable(ds_new, name=name, variable_to_copy=ds.variables[name]) ds_new.setncattr("source_id", source_id) ds_new.setncattr("realm", "atmos") # Update to make this easier to parse for users ds_new.setncattr("contact", ***@***.******@***.***") # Add time bounds ds_new.createDimension("bnds", 2) ds_new.createVariable("time_bnds", "f8", ("time", "bnds")) # Assumes yearly steps, but probably ok time_dates = cftime.num2date( ds_new["time"][:], units=ds_new["time"].units, calendar=ds_new["time"].calendar ) time_years = [dt.year for dt in time_dates] time_bounds_lower = cftime.date2num( [ cftime.datetime(y, 1, 1, calendar=ds_new["time"].calendar) for y in time_years ], units=ds_new["time"].units, ) time_bounds_upper = cftime.date2num( [ cftime.datetime(y + 1, 1, 1, calendar=ds_new["time"].calendar) for y in time_years ], units=ds_new["time"].units, ) ds_new["time_bnds"][:, 0] = time_bounds_lower ds_new["time_bnds"][:, 1] = time_bounds_upper ds_new["time"].setncattr("bounds", "time_bnds") # Could use a context manager instead, whatever ds.close() ds_new.close() return fixed_file def rewrite_file_in_drs(file: Path, output_root: Path, cv_source: str) -> None: cvs = load_cvs(cv_source=cv_source) ds = ds_from_iris_cubes( iris.load(file), ) ds.attrs["nominal_resolution"] = "250 km" ds.attrs["target_mip"] = "AerChemMIP2" # To confirm with Jasper # Generate new tracking ID ds.attrs["tracking_id"] = generate_tracking_id() ds.attrs["creation_date"] = generate_creation_timestamp() time_start, time_end = infer_time_start_time_end( ds=ds, frequency_metadata_key="frequency", no_time_axis_frequency="fx", time_dimension="time", ) full_file_path = cvs.DRS.get_file_path( root_data_dir=output_root, available_attributes=ds.attrs, time_start=time_start, time_end=time_end, ) if full_file_path.exists(): raise FileExistsError(full_file_path) full_file_path.parent.mkdir(exist_ok=True, parents=True) shutil.copy(file, full_file_path) print(f"File written according to the DRS in {full_file_path}") def main() -> None: # Obviously point these wherever makes sense for you DATA_FOLDER = Path("dust-files") TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes") OUTPUT_ROOT = Path("dust-rewritten") # You shouldn't need to change this CV_SOURCE = "gh:main" TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True) files_to_upload = DATA_FOLDER.rglob("*.nc") for file in files_to_upload: tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER) rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE) cvs = load_cvs(cv_source=CV_SOURCE) get_validate_tree_result(OUTPUT_ROOT, cv_source=None, cvs=cvs).raise_if_errors() upload_ftp( tree_root=OUTPUT_ROOT, ftp_dir_rel_to_root="UCLA-1-0-1-upload-1", ***@***.***", cvs=cvs, username="anonymous", ftp_server="ftp.llnl.gov", ftp_dir_root="/incoming", # You can try making this 8, but the FTP server seems to not enjoy parallel uploads n_threads=1, dry_run=False, continue_on_error=False, ) if __name__ == "__main__": main() — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTMO244SKSUQEDTHV632ONNWFAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZZG44TIMRXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

znichollscr · 2025-02-08T08:03:04Z

lgtm! @durack1 they're in /incoming/UCLA-1-0-1-upload-1/input4MIPs if you want to pull them and get them in the publication queue

znichollscr · 2025-02-08T08:03:09Z

Nice work @jfkok

durack1 · 2025-02-10T16:39:20Z

@znichollscr @jfkok great!

I'm just peeking at the subdirs. There are several versions in that path. Is the latest 20250207 the only one we need? Also, "CMIP6" is the MIP era, which is wrong (for next-gen prototype data, we're using CMIP6Plus). So we'll want to catch that. Looks like the only missing global attribute is nominal_resolution, which should be gm = 10000 km. For the gn this is different sized regions, not quite sure how best to deal with that, but 5000 km might be about right for the specified regions. @jfkok you might also want to correct the licensedunder -> licensed under in the license entry.

One extra nit, if these are observed data (rather than modelled), then the calendar = "noleap" is invalid.

durack1 · 2025-02-11T21:21:50Z

@jfkok please ping this thread if any of the above tweaks aren't clear.. If you can upload these to a new subdir (e.g. /incoming/UCLA-1-0-1-upload-4), it'll save me sifting through old versions with exactly the same filename

jfkok · 2025-02-11T21:28:49Z

Thanks Paul! I appreciate your help on this. I hope to have time to have a look at this tomorrow and to get back to you then. Cheers, Jasper

…

On Tue, Feb 11, 2025 at 1:22 PM Paul J. Durack ***@***.***> wrote: @jfkok <https://github.com/jfkok> please ping this thread if any of the above tweaks aren't clear.. If you can upload these to a new subdir (e.g. /incoming/UCLA-1-0-1-upload-4), it'll save me sifting through old versions with exactly the same filename — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTPL46EIMALI7HUI7UD2PJSYNAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJSGA4TMNZUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***> [image: durack1]*durack1* left a comment (PCMDI/input4MIPs_CVs#140) <#140 (comment)> @jfkok <https://github.com/jfkok> please ping this thread if any of the above tweaks aren't clear.. If you can upload these to a new subdir (e.g. /incoming/UCLA-1-0-1-upload-4), it'll save me sifting through old versions with exactly the same filename — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGCCLTPL46EIMALI7HUI7UD2PJSYNAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJSGA4TMNZUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

jfkok · 2025-02-12T19:59:27Z

Hi Paul, Thanks for catching these issues!

Also, "CMIP6" is the MIP era, which is wrong. So we'll want to catch that.

So do you mean I should change "mip_era" to "CMIP7"? Or something different? I've corrected the issues and will upload to the UCLA-1-0-1-upload-4 folder once I know what to do with the mip_era attribute :) Thanks! Jasper

…

On Tue, Feb 11, 2025 at 1:34 PM Jasper Kok ***@***.***> wrote: Thanks Paul! I appreciate your help on this. I hope to have time to have a look at this tomorrow and to get back to you then. Cheers, Jasper On Tue, Feb 11, 2025 at 1:22 PM Paul J. Durack ***@***.***> wrote: > @jfkok <https://github.com/jfkok> please ping this thread if any of the > above tweaks aren't clear.. If you can upload these to a new subdir (e.g. > /incoming/UCLA-1-0-1-upload-4), it'll save me sifting through old > versions with exactly the same filename > > — > Reply to this email directly, view it on GitHub > <#140 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGCCLTPL46EIMALI7HUI7UD2PJSYNAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJSGA4TMNZUHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > [image: durack1]*durack1* left a comment (PCMDI/input4MIPs_CVs#140) > <#140 (comment)> > > @jfkok <https://github.com/jfkok> please ping this thread if any of the > above tweaks aren't clear.. If you can upload these to a new subdir (e.g. > /incoming/UCLA-1-0-1-upload-4), it'll save me sifting through old > versions with exactly the same filename > > — > Reply to this email directly, view it on GitHub > <#140 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGCCLTPL46EIMALI7HUI7UD2PJSYNAVCNFSM6AAAAABQMY7RSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJSGA4TMNZUHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

-- Professor Department of Atmospheric and Oceanic Sciences University of California, Los Angeles Math Sciences Building 7142 Email: ***@***.*** Website: http://jasperfkok.com Pronouns: he/his/him

znichollscr · 2025-02-12T20:07:29Z

So do you mean I should change "mip_era" to "CMIP7"? Or something different?

"CMIP6Plus" (our halfway house)

znichollscr added the enhancement New feature or request label Oct 22, 2024

znichollscr mentioned this issue Oct 28, 2024

Using locked version error climate-resource/input4mips_validation#78

Closed

znichollscr mentioned this issue Dec 19, 2024

Add UCLA-1-0-1 source ID #164

Draft

4 tasks

znichollscr mentioned this issue Feb 7, 2025

String handling pp-mo/ncdata#111

Open

Add dust emissions #140

Add dust emissions #140

Comments

znichollscr commented Oct 22, 2024 • edited Loading

znichollscr commented Oct 22, 2024

znichollscr commented Oct 23, 2024

jfkok commented Oct 24, 2024

znichollscr commented Oct 24, 2024

jfkok commented Oct 24, 2024

znichollscr commented Oct 25, 2024

jfkok commented Oct 25, 2024

znichollscr commented Oct 26, 2024

jfkok commented Oct 28, 2024

znichollscr commented Oct 28, 2024

znichollscr commented Dec 11, 2024

jfkok commented Dec 11, 2024 via email

znichollscr commented Dec 11, 2024 • edited Loading

znichollscr commented Dec 11, 2024

jfkok commented Dec 12, 2024 via email

znichollscr commented Dec 12, 2024

znichollscr commented Dec 12, 2024

jfkok commented Dec 13, 2024 via email

znichollscr commented Dec 13, 2024 • edited Loading

jfkok commented Jan 6, 2025

znichollscr commented Jan 6, 2025 • edited Loading

znichollscr commented Jan 6, 2025

jfkok commented Jan 6, 2025 via email

znichollscr commented Jan 6, 2025

jfkok commented Jan 6, 2025

znichollscr commented Jan 6, 2025

znichollscr commented Jan 6, 2025

jfkok commented Jan 27, 2025

znichollscr commented Jan 28, 2025

jfkok commented Jan 30, 2025

znichollscr commented Jan 30, 2025 via email

jfkok commented Jan 31, 2025

znichollscr commented Feb 1, 2025

jfkok commented Feb 3, 2025 via email

durack1 commented Feb 4, 2025

znichollscr commented Feb 4, 2025

znichollscr commented Feb 4, 2025

jfkok commented Feb 6, 2025 via email

znichollscr commented Feb 6, 2025 • edited Loading

jfkok commented Feb 7, 2025 via email

znichollscr commented Feb 8, 2025

znichollscr commented Feb 8, 2025

durack1 commented Feb 10, 2025 • edited Loading

durack1 commented Feb 11, 2025

jfkok commented Feb 11, 2025 via email

jfkok commented Feb 12, 2025 via email

znichollscr commented Feb 12, 2025

znichollscr commented Oct 22, 2024 •

edited

Loading

znichollscr commented Dec 11, 2024 •

edited

Loading

znichollscr commented Dec 13, 2024 •

edited

Loading

znichollscr commented Jan 6, 2025 •

edited

Loading

znichollscr commented Feb 6, 2025 •

edited

Loading

durack1 commented Feb 10, 2025 •

edited

Loading