Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the netcdf API to support programmatic changes to the plugin search path #3024

Closed

Conversation

DennisHeimbigner
Copy link
Collaborator

re: #2753

As suggested by Ed Hartnett, This PR extends the netcdf.h API to support programmatic control over the search path used to locate plugins.

I created several different APIs, but finally settled on the following API as being the simplest possible. It has the disadvantage that it requires use of a global lock (not implemented) if used in a threaded environment.

Specifically, note that modifying the plugin paths must be done "atomically". That is, in a multi-threaded environment, it is important that the sequence of actions involved in setting up the plugin paths must be done by a single processor or in some other way as to guarantee that two or more processors are not simultaneously accessing the plugin path read/write operations.

As an example, assume there exists a mutex lock called PLUGINLOCK., then any processor accessing the plugin paths should operate as follows:

lock(PLUGINLOCK);
nc_plugin_path_read(...);
<rebuild plugin path>
nc_plugin_path_write(...);
unlock(PLUGINLOCK);

The API proposed in this PR looks like this (from netcdf-c/include/netcdf_filter.h).

  • int nc_plugin_path_read(int formatx, size_t* ndirsp, char** dirs);

    This function returns the current sequence of directories in the internal plugin path list. Since this function does not modify the plugin path, it can be called at any time.

    The arguments are as follows:

    • formatx specify which dispatch implementation to read: currently NC_FORMATX_NC_HDF5 or NC_FORMATX_NCZARR.
    • ndirsp return the number of dirs in the internal path list
    • dirs memory for storing the sequence of directies in the internal path list.

    In practice, this function needs to be called twice. The first time with npaths not NULL and pathlist set to NULL to get the size of the path list. The second time with pathlist not NULL to get the actual sequence of paths.

  • int nc_plugin_path_write(int formatx, size_t ndirs, char** const dirs);

    This function empties the current internal path sequence and replaces it with the sequence of directories argument. Using a paths argument of NULL or npaths argument of 0 will clear the set of plugin paths.

    The arguments are as follows:

    • formatx specify which dispatch implementation to write: currently NC_FORMATX_NC_HDF5 or NC_FORMATX_NCZARR or 0 (zero).
    • ndirs length of the dirs argument
    • dirs a vector of directory path string used to overwrite the current internal path list

    If the value zero is used for the formatx argument, then the value being written is applied to all implemention: currently NC_FORMATX_NC_HDF5 and NC_FORMATX_NCZARR.

In addition, two other API functions are defined.

int nc_plugin_path_initialize(void);
int nc_plugin_path_finalize(void);

As a rule, the initialize and finalize functions do not need to be explicitly called by the user because they are called as part of nc_initialize()/nc_finalize().

In addition to the above changes, add a plugin path testcase:
unit_tests/run_pluginpaths.sh+tst_pluginpaths.c.

Misc. Changes

  1. Added a version number for the formatx dispatcher.
  2. Setup a per-dispatcher global state mechanism.
  3. Add some path manipulation utilities to netcf_aux.h
  4. Fix the construction of netcdf_json.h as a BUILT_SOURCE.
  5. Fix some minor bugs in netcdf_json.h
  6. Fix the construction of netcdf_proplist.h as a BUILT_SOURCE.

DennisHeimbigner and others added 4 commits September 13, 2024 18:42
…earch path

re: Unidata#2753

As suggested by Ed Hartnett, This PR extends the netcdf.h API to support programmatic control over the search path used to locate plugins.

I created several different APIs, but finally settled on the following
API as being the simplest possible. It has the disadvantage that
it requires use of a global lock (not implemented) if used
in a threaded environment.

Specifically, note that modifying the plugin paths must be done "atomically".
That is, in a multi-threaded environment, it is important that the sequence of actions involved in setting up the plugin paths must be done by a single processor or in some other way as to guarantee that two or more processors are not simultaneously accessing the plugin path read/write operations.

As an example, assume there exists a mutex lock called PLUGINLOCK.
Then any processor accessing the plugin paths should operate as follows:
````
lock(PLUGINLOCK);
nc_plugin_path_read(...);
<rebuild plugin path>
nc_plugin_path_write(...);
unlock(PLUGINLOCK);
````

The API proposed in this PR looks like this (from netcdf-c/include/netcdf_filter.h).

* ````int nc_plugin_path_read(int formatx, size_t* ndirsp, char** dirs);````

    This function returns the current sequence of directories in the internal plugin path list. Since this function does not modify the plugin path, it can be called at any time.

    The arguments are as follows:
    - _formatx_ specify which dispatch implementation to read: currently NC_FORMATX_NC_HDF5 or NC_FORMATX_NCZARR.
    - _ndirsp_ return the number of dirs in the internal path list
    - _dirs_ memory for storing the sequence of directies in the internal path list.

    In practice, this function needs to be called twice. The first time with npaths not NULL and pathlist set to NULL to get the size of the path list. The second time with pathlist not NULL to get the actual sequence of paths.

* ````int nc_plugin_path_write(int formatx, size_t ndirs, char** const dirs);````

    This function empties the current internal path sequence and replaces it with the sequence of directories argument. Using a paths argument of NULL or npaths argument of 0 will clear the set of plugin paths.

    The arguments are as follows:
    - _formatx_ specify which dispatch implementation to write: currently NC_FORMATX_NC_HDF5 or NC_FORMATX_NCZARR or 0 (zero).
    - _ndirs_ length of the dirs argument
    - _dirs_ a vector of directory path string used to overwrite the current internal path list

    If the value zero is used for the formatx argument, then the value being written is applied to all implemention: currently NC_FORMATX_NC_HDF5 and NC_FORMATX_NCZARR.

In addition, two other API functions are defined.
````
int nc_plugin_path_initialize(void);
int nc_plugin_path_finalize(void);
````
As a rule, the initialize and finalize functions do not need to be explicitly called by the user because they are called as part of *nc_initialize()/nc_finalize()*.

In addition to the above changes, add a plugin path testcase:
unit_tests/run_pluginpaths.sh+tst_pluginpaths.c.

## Misc. Changes
1. Added a version number for the formatx dispatcher.
2. Setup a per-dispatcher global state mechanism.
3. Add some path manipulation utilities to netcf_aux.h
4. Fix the construction of netcdf_json.h as a BUILT_SOURCE.
5. Fix some minor bugs in netcdf_json.h
6. Fix the construction of netcdf_proplist.h as a BUILT_SOURCE.
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ DennisHeimbigner
❌ Dennis Heimbigner


Dennis Heimbigner seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@edwardhartnett
Copy link
Contributor

Excellent. Some feedback:

  • I think the need for a lock in multi-threading environments is perfectly reasonable and will not be a problem. (I don't see how anything else would be possible.)
  • I suggest you use "get" instead of "read" in your function name, and "set" instead of "write". This will match the cache functions in the API, which use get/set instead of inq/put to denote that the metadata is not part of the data file, but is associated with the current run instance, as the plugin path is.
  • I don't think the initialize/finalize should be part of the public API. I can't see why the user would ever need to call them.

I really like your simplification of the API to just two functions. Elegant.

This is a good addition to help netCDF users cope with some of the complexities of plugins. It will help not just with zstandard users, but also even more advanced users who are currently using plugins. (I know the European Space Agency is doing this to get JPEG compression of some satellite data.) All these advanced users will also find it much more useful to set the plugin path in the code, instead of relying on an environment variable being set correctly.

@DennisHeimbigner
Copy link
Collaborator Author

All reasonable changes.

@DennisHeimbigner
Copy link
Collaborator Author

In thinking about this API, it occurs to me that users should not be forced to figure
out what dispatcher is being used (HDF5 vs NCZarr). The question is, perhaps, are
there use cases where the plugin path for HDF5
should ever differ from that of NCZarr.Rather there should be a single global
plugin path list seen by the user. The guarantee should be that whenever a set is initiated,
it modifies the global plugin path and changes are pushed to the underlying dispatchers in a
transparent way.

@edwardhartnett
Copy link
Contributor

I think the capability for different plugin paths is marginal. That is, I don't know if anyone would really need that. (Why would they? So a different filter of the same number could be used? That doesn't sound like a good idea anyway.)

So I think it's fine if you want to have one plugin path for every dispatch to use.

I also think it's fine to have the user specify the dispatch layer. Since this is a capability only for very advanced users, I assume they could handle it.

So either approach seems acceptable to me.

@DennisHeimbigner
Copy link
Collaborator Author

Closed in favor of PR #3033

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this pull request Sep 30, 2024
re: Unidata#2753

As suggested by Ed Hartnett, This PR extends the netcdf.h API to support programmatic control over the search path used to locate plugins.

I created several different APIs, but finally settled on the following API as being the simplest possible. It does have the disadvantage that it requires use of a global lock (not implemented) if used in a threaded environment.

Specifically, note that modifying the plugin path must be done "atomically". That is, in a multi-threaded environment, it is important that the sequence of actions involved in setting up the plugin path must be done by a single processor or in some other way as to guarantee that two or more processors are not simultaneously accessing the plugin path get/set operations.

As an example, assume there exists a mutex lock called PLUGINLOCK. Then any processor accessing the plugin paths should operate as follows:
````
lock(PLUGINLOCK);
nc_plugin_path_get(...);
<rebuild plugin path>
nc_plugin_path_set(...);
unlock(PLUGINLOCK);
````
## Internal Architecture

It is assumed here that there only needs to be a single set of plugin path directories that is shared by all filter code and is independent of any file descriptor; it is global in other words. This means, for example, that the path list for NCZarr and for HDF5 will always be the same.

However internally, processing the set of plugin paths depends on the particular NC_FORMATX value (NC_FORMATX_NC_HDF5 and NC_FORMATX_NCZARR, currently). So the *nc_plugin_path_set* function, will take the paths it is given and propagate them to each of the NC_FORMATX dispatchers to store in a way that is appropriate to the given dispatcher.

There is a complication with respect to the *nc_plugin_path_get* function. It is possible for users to bypass the netcdf API and modify the HDF5 plugin paths directly. This can result in an inconsistent plugin path between the value used by HDF5 and the global value used by netcdf-c. Since there is no obvious fix for this, we warn the user of this possibility and otherwise ignore it.

## Test Changes
* Two new tests
    a. unit_test/run_pluginpaths.sh -- was created to test this new capability.
    b. A new test utility has been added as *unit_test/run_dfaltpluginpath.sh* to test the default plugin path list.

## Documentation
* A new file -- docs/pluginpath.md -- provides documentation of the new API.

## Misc. Changes
1. Add some path manipulation utilities to netcf_aux.h
2. Fix the construction of netcdf_json.h as a BUILT_SOURCE
3. Fix some minor bugs in netcdf_json.h
4. Convert netcdf_json.h and netcdf_proplist.h to BUILT_SOURCE.
5. Add NETCDF_ENABLE_HDF5 as synonym for USE_HDF5
6. Fix some size_t <-> int conversion warnings.
7. Encountered and fixed the Windows \r\n problem in tst_pluginpaths.c.
8. Cleanup some minor CMakeLists.txt problems.

## Addendum: Proposed API

The API makes use of a counted vector of strings representing the sequence of directories in the path. The relevant type definition is as follows.
````
typedef struct NCPluginList {size_t ndirs; char** dirs;} NCPluginList;
````

The API proposed in this PR looks like this (from netcdf-c/include/netcdf_filter.h).

* ````int nc_plugin_path_ndirs(size_t* ndirsp);````
    Arguments: *ndirsp* -- store the number of directories in this memory.

    This function returns the number of directories in the sequence if internal directories of the internal plugin path list.

* ````int nc_plugin_path_get(NCPluginList* dirs);````
    Arguments:  *dirs* -- counted vector for storing the sequence of directies in the internal path list.

    This function returns the current sequence of directories from the internal plugin path list. Since this function does not modify the plugin path, it does not need to be locked; it is only when used to get the path to be modified that locking is required.  If the value of *dirs.dirs* is NULL (the normal case), then memory is allocated to hold the vector of directories. Otherwise, use the memory of *dirs.dirs* to hold the vector of directories.

* ````int nc_plugin_path_set(const NCPluginList* dirs);````
    Arguments: *dirs* -- counted vector for providing the new sequence of directories in the internal path list.

    This function empties the current internal path sequence and replaces it with the sequence of directories argument. Using an *ndirs* argument of 0 will clear the set of plugin paths.
DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this pull request Oct 19, 2024
…earch path

Replaces PR Unidata#3024
         and PR Unidata#3033

re: Unidata#2753

As suggested by Ed Hartnett, This PR extends the netcdf.h API to support programmatic control over the search path used to locate plugins.

I created several different APIs, but finally settled on the following API as being the simplest possible. It does have the disadvantage that it requires use of a global lock (not implemented) if used in a threaded environment.

Specifically, note that modifying the plugin path must be done "atomically". That is, in a multi-threaded environment, it is important that the sequence of actions involved in setting up the plugin path must be done by a single processor or in some other way as to guarantee that two or more processors are not simultaneously accessing the plugin path get/set operations.

As an example, assume there exists a mutex lock called PLUGINLOCK. Then any processor accessing the plugin paths should operate as follows:
````
lock(PLUGINLOCK);
nc_plugin_path_get(...);
<rebuild plugin path>
nc_plugin_path_set(...);
unlock(PLUGINLOCK);
````
## Internal Architecture

It is assumed here that there only needs to be a single set of plugin path directories that is shared by all filter code and is independent of any file descriptor; it is global in other words. This means, for example, that the path list for NCZarr and for HDF5 will always be the same.

However internally, processing the set of plugin paths depends on the particular NC_FORMATX value (NC_FORMATX_NC_HDF5 and NC_FORMATX_NCZARR, currently). So the *nc_plugin_path_set* function, will take the paths it is given and propagate them to each of the NC_FORMATX dispatchers to store in a way that is appropriate to the given dispatcher.

There is a complication with respect to the *nc_plugin_path_get* function. It is possible for users to bypass the netcdf API and modify the HDF5 plugin paths directly. This can result in an inconsistent plugin path between the value used by HDF5 and the global value used by netcdf-c. Since there is no obvious fix for this, we warn the user of this possibility and otherwise ignore it.

## Test Changes
* New tests<br>
    a. unit_test/run_pluginpaths.sh -- was created to test this new capability.<br>
    b. A new test utility has been added as *unit_test/run_dfaltpluginpath.sh* to test the default plugin path list.
* New test support utilities<br>
    a. unit_test/ncpluginpath.c -- report current state of the plugin path<br>
    b. unit_test/tst_pluginpaths.c -- test program to support run_pluginpaths.sh

## Documentation
* A new file -- docs/pluginpath.md -- provides documentation of the new API. It includes some
  material taken fro filters.md.

## Other Major Changes
1. Cleanup the whole plugin path decision tree. This is described in the *docs/pluginpath.md* document and summarized in Addendum 2 below.
2. I noticed that the ncdump/testpathcvt.sh had been disabled, so fixed and re-enabled it. This necessitated some significant changes to dpathmgr.c.

## Misc. Changes
1. Add some path manipulation utilities to netcf_aux.h
2. Fix some minor bugs in netcdf_json.h
3. Convert netcdf_json.h and netcdf_proplist.h to BUILT_SOURCE.
4. Add NETCDF_ENABLE_HDF5 as synonym for USE_HDF5
5. Fix some size_t <-> int conversion warnings.
6. Encountered and fixed the Windows \r\n problem in tst_pluginpaths.c.
7. Cleanup some minor CMakeLists.txt problems.
8. Provide an implementation of echo -n since it appears to not be
   available on all platforms.
9. Add a property list mechanism to pass environmental information to filters.
10. Cleanup Doxyfile.in
11. Fixed a memory leak in libdap2; surprised that I did not find this earlier.

## Addendum 1: Proposed API

The API makes use of a counted vector of strings representing the sequence of directories in the path. The relevant type definition is as follows.
````
typedef struct NCPluginList {size_t ndirs; char** dirs;} NCPluginList;
````

The API proposed in this PR looks like this (from netcdf-c/include/netcdf_filter.h).

* ````int nc_plugin_path_ndirs(size_t* ndirsp);````
    Arguments: *ndirsp* -- store the number of directories in this memory.

    This function returns the number of directories in the sequence if internal directories of the internal plugin path list.

* ````int nc_plugin_path_get(NCPluginList* dirs);````
    Arguments:  *dirs* -- counted vector for storing the sequence of directies in the internal path list.

    This function returns the current sequence of directories from the internal plugin path list. Since this function does not modify the plugin path, it does not need to be locked; it is only when used to get the path to be modified that locking is required.  If the value of *dirs.dirs* is NULL (the normal case), then memory is allocated to hold the vector of directories. Otherwise, use the memory of *dirs.dirs* to hold the vector of directories.

* ````int nc_plugin_path_set(const NCPluginList* dirs);````
    Arguments: *dirs* -- counted vector for providing the new sequence of directories in the internal path list.

    This function empties the current internal path sequence and replaces it with the sequence of directories argument. Using an *ndirs* argument of 0 will clear the set of plugin paths.

## Addendum 2: Build-Time and Run-Time Constants.

### Build-Time Constants
<table style="border:2px solid black;border-collapse:collapse">
<tr style="outline: thin solid;" align="center"><td colspan="4">Table showing the build-time computation of NETCDF_PLUGIN_INSTALL_DIR and NETCDF_PLUGIN_SEARCH_PATH.</td>
<tr style="outline: thin solid" ><th>--with-plugin-dir<th>--prefix<th>NETCDF_PLUGIN_INSTALL_DIR<th>NETCDF_PLUGIN_SEARCH_PATH
<tr style="outline: thin solid" ><td>undefined<td>undefined<td>undefined<td>PLATFORMDEFALT
<tr style="outline: thin solid" ><td>undefined<td>&lt;abspath-prefix&gt;<td>&lt;abspath-prefix&gt;/hdf5/lib/plugin<td>&lt;abspath-prefix&gt;/hdf5/lib/plugin&lt;SEP&gt;PLATFORMDEFALT
<tr style="outline: thin solid" ><td>&lt;abspath-plugins&gt;<td>N.A.<td>&lt;abspath-plugins&gt;<td>&lt;abspath-plugins&gt;&lt;SEP&gt;PLATFORMDEFALT
</table>

<table style="border:2px solid black;border-collapse:collapse">
<tr style="outline: thin solid" align="center"><td colspan="2">Table showing the computation of the initial global plugin path</td>
<tr style="outline: thin solid"><th>HDF5_PLUGIN_PATH<th>Initial global plugin path
<tr style="outline: thin solid"><td>undefined<td>NETCDF_PLUGIN_SEARCH_PATH
<tr style="outline: thin solid"><td>&lt;path1;...pathn&gt;<td>&lt;path1;...pathn&gt;
</table>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants