diff --git a/CHANGELOG.md b/CHANGELOG.md index df94fe3..dbbf9de 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,15 @@ # CHANGE LOG +## 1.7.2 + +* Introducing support for _bag idempotentcy_, or reproducible bags. A reproducible bag is a bag that has content-equivalence (in both payload _and_ metadata, including manifests) +to another bag created a different time with the same content, structure, bagging tool, and profile (if used). When this bag creation and bag archive mode is enabled, +two separately created bags (or bag archive files) with content-equivalence will _hash equally_, whether the hash is calculated on the bytes of the resultant archive file or calculated on the equivalently ordered set of individual file hashes of the bag's contents. See the [API Guide](https://github.com/fair-research/bdbag/blob/master/doc/api.md) for additional information. +* PR: [#59](https://github.com/fair-research/bdbag/pull/59) Only require the external package `importlib_metadata` for Python < 3.8. This module is already included as `importlib.metadata` in Python versions 3.8 and above. +* Fix issue with HTTP fetch handler and auth header bearer-token stripping on redirects not getting restored to the cached `requests` session after redirect. +* Remove dependency on deprecated `distutils` and `distutils.util.strtobool` function. +* The `is_bag` API function will no longer attempt to instantiate a `Bag` object on non-directories. + ## 1.7.1 Fix issue with `packaging.parse` throwing `InvalidVersion` in the `upgrade_config()` function when trying to parse the informational version string `VERSION` set by `bdbag` when it is running in a "frozen" (e.g., with `cx_Freeze`) environment. diff --git a/bdbag/bdbag_cli.py b/bdbag/bdbag_cli.py index 1d18409..d557325 100644 --- a/bdbag/bdbag_cli.py +++ b/bdbag/bdbag_cli.py @@ -100,7 +100,7 @@ def parse_cli(): standard_args.add_argument( idempotent_arg, action="store_true", help="Create an idempotent (reproducible) bag directory and/or bag archive by removing timestamp attributes " - "from bag metadata (bag-info.txt) and setting fixed modification times (unix epoch) to non-payload files " + "from bag metadata (bag-info.txt) and setting fixed modification times (unix epoch) to files " "and directories contained within bag archive files.") checksum_arg = "--checksum" diff --git a/doc/api.md b/doc/api.md index 36f18eb..d6fe5f6 100644 --- a/doc/api.md +++ b/doc/api.md @@ -53,10 +53,13 @@ compliant, i.e., complies with the rules of **"Section 4: Serialization"** of th [BagIt Specification](https://datatracker.ietf.org/doc/draft-kunze-bagit/). ##### Parameters -| Param | Type | Description | -|--------------|----------|------------------------------------------------------------------------------| -| bag_path | `string` | A normalized, absolute path to a bag directory. | -| bag_archiver | `string` | One of the following case-insensitive string values: `zip`, `tar`, or `tgz`. | + +| Param | Type | Description | +|--------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| bag_path | `string` | A normalized, absolute path to a bag directory. | +| bag_archiver | `string` | One of the following case-insensitive string values: `zip`, `tar`, or `tgz`. | +| config_file | `string` | A JSON file representation of configuration data that is used during bag creation and update. The format of this file is described [here](./config.md#bdbag.json). | +| idempotent | `boolean` | A boolean value indicating that idempotent (or reproducible) archiving is desired. Reproducible archive files are made by setting fixed modification times (unix epoch, `00:00:00 UTC, 1 January 1970` in the case of `tar` archives, or `00:00:00 UTC, 1 January 1980` in the case of `zip` archives) to all files and directory entries contained within bag archive files. When extracted with `bdbag`, these fixed modification times will be set to the current system time. NOTE: If an idempotently created bag archive is extracted with other software besides `bdbag`, it may be required to specify additional arguments to overwrite the fixed `mtime` in the archive file to the current system time, e.g., using `-m` with `tar`. | **Returns**: `string` - The normalized, absolute path of the directory of the created archive file. @@ -185,24 +188,26 @@ make_bag(bag_path, remote_file_manifest=None, config_file=None, ro_metadata=None, - ro_metadata_file=None) + ro_metadata_file=None, + idempotent=None) ``` Creates or updates the bag denoted by the `bag_path` argument. ##### Parameters -| Param | Type | Description | -|----------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| bag_path | `string` | A normalized, absolute path to a bag directory. | -| algs | `list` | A list of checksum algorithms to use for calculating file fixities. When creating a bag, only the checksums present in this variable will be used. When updating a bag, this function will take the union of any existing bag algorithms and what is specified by this parameter, ***except*** when the `prune_manifests` parameter is specified, in which case then only the algorithms specifed by this parameter will be used. | -| update | `boolean` | If `bag_path` represents an existing bag, update it. If this parameter is not specified when invoking this function on an existing bag, the function is essentially a NOOP and will emit a logging message to that effect. | -| save_manifests | `boolean` | Defaults to `True`. If true, saves all manifests, recalculating all checksums and regenerating `fetch.txt`. If false, only tagfile manifest checksums are recalculated. Use this flag as an optimization (to avoid recalculating payload file checksums) when only the bag metadata has been changed. This parameter is only meaningful during update operations, otherwise it is ignored. | -| prune_manifests | `boolean` | Removes any file and tagfile manifests for checksums that are not listed in the `algs` variable. This parameter is only meaningful during update operations, otherwise it is ignored. | -| metadata | `dict` | A dictionary of key-value pairs that will be written directly to the bag's 'bag-info.txt' file. | -| metadata_file | `string` | A JSON file representation of metadata that will be written directly to the bag's 'bag-info.txt' file. The format of this metadata is described [here](./config.md#metadata). | -| remote_file_manifest | `string` | A path to a JSON file representation of remote file entries that will be used to add remote files to the bag file manifest(s) and used to create the bag's `fetch.txt`. The format of this file is described [here](./config.md/#remote-file-manifest). | -| config_file | `string` | A JSON file representation of configuration data that is used during bag creation and update. The format of this file is described [here](./config.md#bdbag.json). | -| ro_metadata | `dict` | A dictionary that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). | -| ro_metadata_file | `string` | A path to a JSON file representation of RO metadata that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). | +| Param | Type | Description | +|----------------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| bag_path | `string` | A normalized, absolute path to a bag directory. | +| algs | `list` | A list of checksum algorithms to use for calculating file fixities. When creating a bag, only the checksums present in this variable will be used. When updating a bag, this function will take the union of any existing bag algorithms and what is specified by this parameter, ***except*** when the `prune_manifests` parameter is specified, in which case then only the algorithms specifed by this parameter will be used. | +| update | `boolean` | If `bag_path` represents an existing bag, update it. If this parameter is not specified when invoking this function on an existing bag, the function is essentially a NOOP and will emit a logging message to that effect. | +| save_manifests | `boolean` | Defaults to `True`. If true, saves all manifests, recalculating all checksums and regenerating `fetch.txt`. If false, only tagfile manifest checksums are recalculated. Use this flag as an optimization (to avoid recalculating payload file checksums) when only the bag metadata has been changed. This parameter is only meaningful during update operations, otherwise it is ignored. | +| prune_manifests | `boolean` | Removes any file and tagfile manifests for checksums that are not listed in the `algs` variable. This parameter is only meaningful during update operations, otherwise it is ignored. | +| metadata | `dict` | A dictionary of key-value pairs that will be written directly to the bag's 'bag-info.txt' file. | +| metadata_file | `string` | A JSON file representation of metadata that will be written directly to the bag's 'bag-info.txt' file. The format of this metadata is described [here](./config.md#metadata). | +| remote_file_manifest | `string` | A path to a JSON file representation of remote file entries that will be used to add remote files to the bag file manifest(s) and used to create the bag's `fetch.txt`. The format of this file is described [here](./config.md/#remote-file-manifest). | +| config_file | `string` | A JSON file representation of configuration data that is used during bag creation and update. The format of this file is described [here](./config.md#bdbag.json). | +| ro_metadata | `dict` | A dictionary that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). | +| ro_metadata_file | `string` | A path to a JSON file representation of RO metadata that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). | +| idempotent | `boolean` | If `True`, date and time specific metadata such as `Bagging-Date` and `Bagging-Time` will be _removed_ (if present) from `bag-info.txt`. This value defaults to `False` if not passed via argument. However, a global override default value of `True` can be enabled in the [config file](./config.md). NOTE: use of `ro_metadata` and `ro_metadata_file` in conjunction with `idempotent` is not recommended at this time due to the generated RO Metadata not being compatible with bag idempotency. | **Returns**: `bag` - An instantiated [bagit-python](https://github.com/LibraryOfCongress/bagit-python/blob/master/bagit.py) `bag` compatible class object. diff --git a/doc/cli.md b/doc/cli.md index b122e01..50e6474 100644 --- a/doc/cli.md +++ b/doc/cli.md @@ -19,7 +19,8 @@ usage: bdbag [--version] [--update] [--revert] -[--archiver {zip,tar,tgz}] +[--archiver {zip,tar,tgz,bz2,xz}] +[--idempotent] [--checksum {md5,sha1,sha256,sha512,all}] [--skip-manifests] [--prune-manifests] @@ -27,7 +28,7 @@ usage: bdbag [--resolve-fetch {all,missing}] [--fetch-filter ] [--validate {fast,full,structure,completeness}] -[--validate-profile {profile-only,full}] +[--validate-profile [{bag-only,full}]] [--profile-path ] [--config-file ] [--keychain-file ] @@ -35,10 +36,10 @@ usage: bdbag [--ro-metadata-file ] [--ro-manifest-generate {overwrite, update}] [--remote-file-manifest ] +[--output-path] [--quiet] [--debug] [--help] -[--output-path] ``` @@ -88,8 +89,13 @@ Update an existing bag dir, recalculating tag-manifest checksums and regeneratin Revert an existing bag directory back to a normal directory, deleting all bag metadata files. Payload files in the `data` directory will be moved back to the directory root, and the `data` directory will be deleted. ---- -#### `--archiver {zip,tar,tgz}` -Archive a bag using the specified format. +#### `--archiver {zip,tar,tgz,bz2,xz}` +Archive a bag using the specified format. Note that `xz` (LZMA) compression is not available on Python versions lower than `3.3`. + +---- +#### `--idempotent` +Create an idempotent (reproducible) bag directory and/or bag archive by removing timestamp attributes from bag metadata (`bag-info.txt`) and setting fixed modification times (unix epoch) to files and directories contained within bag archive files. +More information on bag idempotency can be found in the [make_bag](api.md#make_bag) and the [archive_bag](api.md#archive_bag) API functions. ---- #### `--checksum {md5,sha1,sha256,sha512,all}` diff --git a/doc/config.md b/doc/config.md index 94e6cc6..707ac3d 100644 --- a/doc/config.md +++ b/doc/config.md @@ -32,13 +32,14 @@ This is the parent object for the entire configuration. ##### Object: `bag_config` This object contains all bag-related configuration parameters. -| Parameter | Description | -|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Parameter | Description | +|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `bag_algorithms` | This is an array of strings representing the default checksum algorithms to use for bag manifests, if not otherwise specified. Valid values are "md5", "sha1", "sha256", and "sha512". | -| `bag_archiver` | This is a string representing the default archiving format to use if not otherwise specified. Valid values are "zip", "tar", and "tgz". | -| `bag_metadata` | This is a list of simple JSON key-value pairs that will be written as-is to bag-info.txt. | -| `bag_processes` | This is a numeric value representing the default number of concurrent processes to use when calculating checksums. | -| `bagit_spec_version` | The version of the `bagit` specification that created bags will conform to. Valid values are "0.97" or "1.0". | +| `bag_archiver` | This is a string representing the default archiving format to use if not otherwise specified. Valid values are "zip", "tar", and "tgz". | +| `bag_metadata` | This is a list of simple JSON key-value pairs that will be written as-is to bag-info.txt. | +| `bag_processes` | This is a numeric value representing the default number of concurrent processes to use when calculating checksums. | +| `bagit_spec_version` | The version of the `bagit` specification that created bags will conform to. Valid values are "0.97" or "1.0". | +| `bag_archive_idempotent` | A boolean value indicating that `idempotent` mode should be used by default when creating and archiving new bags. | ##### Object: `fetch_config` The `fetch_config` object contains a set of child objects each keyed by the scheme of the transport protocol that contains the transport handler configuration parameters.