Recast as a stand alone document #20

davidhassell · 2021-06-21T10:30:27Z

Previously the document silently assumed that it was part of the CF conventions, which is not to be the case (at this time, at least). This PR aims to make the document independent.

JonathanGregory

Dear David et al.

Thanks for the new version. I have some comments.

Since this is now a separate convention, will you put its attributes in Appendix A? Perhaps that would be a good idea, like for UGRID.
I am confused about the location variable. The data is contained in an orthogonal array of fragments. I understand that to mean that the decomposition of any given aggregated dimension into fragments is broadcast along all the other fragment dimensions. That means the decomposition needs to be specified only once for each fragment dimension, and I'd therefore expect the location variable to have dimensions (number of fragment dimensions, largest fragment dimension size, 2).
The file variable could be omitted if all fragments are in the same file as the aggregation variable.
The format variable could be omitted if every fragment is either in the same file as the aggregation variable or is a netCDF file i.e. assume by default than an external file is netCDF.
I am confused about the appearance of the heading "Example 1". This doesn't seem to begin an example.
Future-proofing would replace Gregorian with standard.

Best wishes

Jonathan

davidhassell · 2021-07-06T13:46:08Z

Dear @JonathanGregory

Thank you for reviewing this. Here are some responses to your points, which are largely meant to explain why it has been done like it has, not to necessarily imply that we won't change it! We can take from here with this background in mind.

Since this is now a separate convention, will you put its attributes in Appendix A? Perhaps that would be a good idea, like for UGRID.

We're currently assuming that the CF conventions do not recognise CFA. If/when this is the case, we'd either move the text into CF, or link to like we do for UGRID. In either case, appendix would indeed get some entries.

I am confused about the location variable. The data is contained in an orthogonal array of fragments. I understand that to mean that the decomposition of any given aggregated dimension into fragments is broadcast along all the other fragment dimensions. That means the decomposition needs to be specified only once for each fragment dimension, and I'd therefore expect the location variable to have dimensions (number of fragment dimensions, largest fragment dimension size, 2).

I see. Example 2 gives us:

    // Existing. Location has shape (2, 1, 1, 1, 4, 2)
    // Location has shape (4, 2) = shape of fragment array + (number of fragment dims, 2)
    location = 0, 5,
               0, 0,
               0, 72,
               0, 143,
               6, 11,
               0, 0,
               0, 72,
               0, 143 ;

I think you are suggesting:

    // Jonathan's proposal.
    // Location has shape (4, 2, 2) = (number of fragment dims, largest fragment dimension size, 2)
    location = 0, 5, 6, 11,
               0, 0, _, _, 
               0, 72, _, _,
               0, 143, _ , _;

I might also suggest

    // Another "dask-like" idea: record the sizes, not the ranges.
    // Location has shape (4, 2) = (number of fragment dims, largest fragment dimension size)
    location = 6, 6,
               1, _,
               73, _
               144, _ ;

The file variable could be omitted if all fragments are in the same file as the aggregation variable.

The format variable could be omitted if every fragment is either in the same file as the aggregation variable or is a netCDF file i.e. assume by default than an external file is netCDF.

This was touched on in #7. We felt that a key feature of these conventions was the ability to aggregate arbitrary file formats, so making a special case for netCDF fragments doesn't really fit in with that. Admittedly the conventions currently only specify how to address netCDF files, as that is our primary use case, but we have wondered about including UM/PP at this time (possibly even decided to do so - can't remember!).

The case when all fragments are in the parent file is certainly possible, but would be unusual - so creating a special case for address in this circumstance didn't seem worth it.

I am confused about the appearance of the heading "Example 1". This doesn't seem to begin an example.

I have assumed CFA does not need to be used with CF, and this example was meant to to show how they can play nicely together.

I now think that perhaps this is a mistake, and CFA must be used with CF, in which case the italicised text of example 1 would become main body text (like it was in previous versions). What do @bnlawrence @nmassey001 @sadielbartholomew think?

Future-proofing would replace Gregorian with standard.

Cutting edge stuff! Done.

JonathanGregory · 2021-07-06T15:16:23Z

Dear David

    // Jonathan's proposal.
    // Location has shape (4, 2, 2) = (number of fragment dims, largest fragment dimension size, 2)
    location = 0, 5, 6, 11,
               0, 0, _, _,
               0, 72, _, _,
               0, 143, _ , _;

Yes, that's right. Isn't there a lot of redundancy in your earlier version? Your dask-like version is even better, since they must be contiguous.

The format variable could be omitted if every fragment is either in the same file as the aggregation variable or is a netCDF file i.e. assume by default than an external file is netCDF.

We felt that a key feature of these conventions was the ability to aggregate arbitrary file formats, so making a special case for netCDF fragments doesn't really fit in with that.

Alternatively, you could allow the format variable to be a scalar if all the entries in it are the same (all nc or pp, for instance), which is the most likely case, I would say - it feels wasteful to have a variable all full of repetition. Even more compactly, you could specify the unique value in the aggregate_data attribute, rather than supplying a separate variable.

Cheers

Jonathan

davidhassell · 2021-07-06T15:43:02Z

Dear @JonathanGregory,

I like the idea of using a scalar format value when all fragments are either in external files of common format (as will be the common case), or are all in the parent file (in which case the scalar format value would be missing data).

I'm not so keen on encoding a common format in the aggregation_data attribute, since it would be hard to discern if the format name was a format identifier or a variable name.

@nmassey001 - what do you thank about these format suggestions, and the location suggestions. I personally am quite taken by the dask-like version, now ...

bnlawrence · 2021-07-06T16:16:49Z

(You may not be surprised to learn that my take on the UML is the text is more ambiguous and if the UML is correct, I personally don't care about the text :-) ... however, I'm in a minority of one on this, but you can bet that I care about 0.6.1!)

Before getting into the detail I think we need to address the first paragraph :-).

The existing readme states:

The CFA (Climate and Forecast Aggregation) conventions are designed for storing a netCDF file which does not contain the data of selected variables ("aggregation variables"), rather those variables contain special attributes that provide instructions on how to create their data as an aggregation of data from other sources, which may be self-describing datasets in their own right.

But I don't think that's quite right because they don't start from the problem. I'd say:

The CFA (Climate and Forecast Aggregation) conventions describe how a netCDF file can be used to describe a dataset distributed across multiple other data files. A CFA compliant aggregation can be described in NetCDF in such way that the describing file does not contain the data of selected variables ("aggregation variables"), rather it contains variables with special attributes that provide instructions on how to create the aggregated variable data as an aggregation of data from other sources, each of which may be self-describing datasets in their own right.

In doing so we allow the possibility of a CFA aggregation sitting alongside a normal variable and avoid using the phrase "master" file (which I know wasn't in your version, but I am conscious we need to avoid it.)

I believe I am otherwise content with things now and agree with using a scalar for format!

davidhassell · 2021-07-07T07:24:39Z

Thanks, @bnlawrence.

I have updated the Introduction as you suggest - much better.

Do you have an opinion on whether CFA must be used with CF? I'm now of the opinion that there is no use-case for creating a CFA dataset that is not otherwise CF-compliant, and so CFA should be cast as as "add-on" to CF, rather than an add-on to vanilla netCDF.

I'll put together a commit for allowing a scalar format variable. (edited)

davidhassell · 2021-07-07T08:01:24Z

Scalar formatvariable implementation: 1c50acc?short_path=ef074b2#diff-ef074b2373736cf397d3e2e3613197bc8190ed058f661b1d62713b6c4cbfd577

bnlawrence · 2021-07-07T09:10:19Z

I think there is a use case for using CFA conventions for the aggregation of non-CF compliant netcdf files, but they would need to have enough compliance around coordinates for it to work. Specific examples :

Could we use the CFA conventions to persist the view of an aggregation that could be constructed by other tools on the fly (e.g. xarray) ... i.e. how much coordinate information is the minimal needed to do this aggregation?
We already know how to aggregate PP files, which are by definition not CF compliant.

davidhassell · 2021-07-07T09:38:16Z

Hi Bryan, I think the CF-compliance of the fragments is not the issue, rather the CF-compliance of the of the CFA file. Does that make sense?

nmassey001 · 2021-07-07T09:49:50Z

@nmassey001 - what do you thank about these format suggestions, and the location suggestions. I personally am quite taken by the dask-like version, now ...

I don't understand how the indexing works here:

 location = 6, 6,
               1, _,
               72, _
               143, _ ;

Should the first 6 in your version of the location array be a 0?

As an aside, I don't think I've picked up on this before, but can we switch to Python / Numpy / C / every sensible programming language style slice definitions? i.e. there are five elements between 0,6 , not six elements.

davidhassell · 2021-07-07T10:03:03Z

Hi @nmassey001,

The first 6 maps to [0, 5], and the second 6 maps to [6, 11].

The 72 maps to [0, 71]

For example, the current:

        location = 0, 5,
                   0, 0,
                   0, 35,
                   0, 143,
                   0, 5,
                   0, 0,
                   36, 72,
                   0, 143,
                   6, 11,
                   0, 0,
                   0, 36,
                   0, 143,
                   6, 11,
                   0, 0,
                   36, 72,
                   0, 143 ;

would become:

        location = 6, 6
                   1, _,
                   36, 37,
                   143, _;

I'm not sure the Fortran and Julia folks would agree with you, but moving to sizes would make the slice definition go away, anyway!

nmassey001 · 2021-07-07T10:06:09Z

@davidhassell Aha, they are sizes. Got it!
I'm all for minimal and compact representations, as you probably know, so I'd be happy with this change

bnlawrence · 2021-07-07T10:39:24Z

Hi Bryan, I think the CF-compliance of the fragments is not the issue, rather the CF-compliance of the of the CFA file. Does that make sense?

Yes, that's fine, other software can ignore the CF'ness and just use aggregation goodies. Sorry for the misunderstanding.

davidhassell · 2021-07-07T14:49:16Z

Thanks @bnlawrence and @nmassey001 - I'll address the CF compliance and location issues in separate commits ...

davidhassell · 2021-07-07T16:35:36Z

New CF integration: 6b9296c?short_path=ef074b2#diff-ef074b2373736cf397d3e2e3613197bc8190ed058f661b1d62713b6c4cbfd577

New location variable: ac815cb?short_path=ef074b2#diff-ef074b2373736cf397d3e2e3613197bc8190ed058f661b1d62713b6c4cbfd577

davidhassell · 2021-07-08T09:35:47Z

Hi @bnlawrence,

(You may not be surprised to learn that my take on the UML is the text is more ambiguous and if the UML is correct, I personally don't care about the text :-) ... however, I'm in a minority of one on this, but you can bet that I care about 0.6.1!)

I look forward to developing this over at #21 ... :)

davidhassell · 2021-07-28T07:59:31Z

Following an off-line discussion on 2021-07-27, merging this now and tagging master at v0.6

davidhassell added 29 commits June 18, 2021 19:24

abstract, intro, etc.

f1a5ed7

intro

df19f7a

layout

732126f

canonical fragments

6783111

mandatory format

284da01

mandatory format

16d24cf

intro

1cd885c

intro

cc1729d

intro

a6ee19c

intro

b2303ea

revision history, refs

9896e55

revision history

41f0c88

revision history

95e1ad4

revision history

8f16ce1

revision history

67f6101

revision history

5c8bbc3

revision history

35a245b

revision history

cb91934

typo

4f652f2

typo

de46a11

typo

1b34e50

remove duplicate text

c893d3f

general

0e8b0a0

canonical form

5b563b4

tidy

080ece7

tidy

1397af9

tidy

537982f

tidy

0cd225c

date

e87897c

davidhassell added the v0.6 label Jun 21, 2021

JonathanGregory reviewed Jul 6, 2021

View reviewed changes

davidhassell added 2 commits July 6, 2021 14:07

Gregorian -> standard

2904e95

typo

2adb6e7

davidhassell added 2 commits July 7, 2021 08:13

Introduction 1st paragraph

2b9d781

1st paragraph

ab20997

scalar format variable

1c50acc

davidhassell added 2 commits July 7, 2021 16:21

Connection with CF

6b9296c

new location variable

ac815cb

example typo

6cbca6c

davidhassell requested review from sadielbartholomew and nmassey001 July 8, 2021 09:29

0.6 release date

6c94258

davidhassell merged commit 1508e0b into NCAS-CMS:master Jul 28, 2021

davidhassell mentioned this pull request Jun 6, 2022

Format of the location AggregationInstruction term #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recast as a stand alone document #20

Recast as a stand alone document #20

davidhassell commented Jun 21, 2021

JonathanGregory left a comment

davidhassell commented Jul 6, 2021 •

edited

Loading

JonathanGregory commented Jul 6, 2021

davidhassell commented Jul 6, 2021

bnlawrence commented Jul 6, 2021 •

edited

Loading

davidhassell commented Jul 7, 2021 •

edited

Loading

davidhassell commented Jul 7, 2021

bnlawrence commented Jul 7, 2021

davidhassell commented Jul 7, 2021

nmassey001 commented Jul 7, 2021

davidhassell commented Jul 7, 2021

nmassey001 commented Jul 7, 2021

bnlawrence commented Jul 7, 2021

davidhassell commented Jul 7, 2021

davidhassell commented Jul 7, 2021

davidhassell commented Jul 8, 2021

davidhassell commented Jul 28, 2021

Recast as a stand alone document #20

Recast as a stand alone document #20

Conversation

davidhassell commented Jun 21, 2021

JonathanGregory left a comment

Choose a reason for hiding this comment

davidhassell commented Jul 6, 2021 • edited Loading

JonathanGregory commented Jul 6, 2021

davidhassell commented Jul 6, 2021

bnlawrence commented Jul 6, 2021 • edited Loading

davidhassell commented Jul 7, 2021 • edited Loading

davidhassell commented Jul 7, 2021

bnlawrence commented Jul 7, 2021

davidhassell commented Jul 7, 2021

nmassey001 commented Jul 7, 2021

davidhassell commented Jul 7, 2021

nmassey001 commented Jul 7, 2021

bnlawrence commented Jul 7, 2021

davidhassell commented Jul 7, 2021

davidhassell commented Jul 7, 2021

davidhassell commented Jul 8, 2021

davidhassell commented Jul 28, 2021

davidhassell commented Jul 6, 2021 •

edited

Loading

bnlawrence commented Jul 6, 2021 •

edited

Loading

davidhassell commented Jul 7, 2021 •

edited

Loading