Skip to content

Command-line utility and async Python library for checksumming FASTA files and individual contigs.

License

Notifications You must be signed in to change notification settings

bento-platform/fasta-checksum-utils

Repository files navigation

fasta-checksum-utils

Asynchronous library and command-line utility for checksumming FASTA files and individual contigs. Implements two checksumming algorithms: MD5 and GA4GH, in order to fulfill the needs of the Refget v2 API specification.

Installation

To install fasta-checksum-utils, run the following pip command:

pip install fasta-checksum-utils

CLI Usage

To generate a text report of checksums in the FASTA document, run the following command:

fasta-checksum-utils ./my-fasta.fa[.gz]

This will print output in the following tab-delimited format:

file  [file size in bytes]    md5 [file MD5 hash]           ga4gh  [file GA4GH hash]
chr1  [chr1 sequence length]  md5 [chr1 sequence MD5 hash]  ga4gh  [chr1 sequence GA4GH hash]
chr2  [chr2 sequence length]  md5 [chr2 sequence MD5 hash]  ga4gh  [chr2 sequence GA4GH hash]
...

The following example is the output generated by specifying the SARS-CoV-2 genome FASTA from NCBI:

file	    30428	md5	825ab3c54b7a67ff2db55262eb532438	ga4gh	SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd
NC_045512.2	29903	md5	105c82802b67521950854a851fc6eefd	ga4gh	SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D

If the --out-format bento-json arguments are passed, the tool will instead output the report in a JSON format, designed to be compatible with the requirements of the Bento Reference Service. The following example is the output generated by specifying the SARS-CoV-2 genome:

{
  "fasta": "sars_cov_2.fa",
  "fasta_size": 30428,
  "md5": "825ab3c54b7a67ff2db55262eb532438",
  "ga4gh": "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd",
  "contigs": [
    {
      "name": "NC_045512.2",
      "md5": "105c82802b67521950854a851fc6eefd",
      "ga4gh": "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D",
      "length": 29903
    }
  ]
}

If an argument like --fai [path or URL] is passed, an additional "fai": "..." property will be added to the JSON object output.

If an argument like --genome-id GRCh38 is provided, an additional "id": "GRCh38" property will be added to the JSON object output.

Library Usage

Below are some examples of how fasta-checksum-utils can be used as an asynchronous Python library:

import asyncio
import fasta_checksum_utils as fc
import pysam
from pathlib import Path


async def demo():
    covid_genome: Path = Path("./sars_cov_2.fa")
    
    # calculate an MD5 checksum for a whole file
    file_checksum: str = await fc.algorithms.AlgorithmMD5.checksum_file(covid_genome)
    print(file_checksum)
    # prints "863ee5dba1da0ca3f87783782284d489"
    
    all_algorithms = (fc.algorithms.AlgorithmMD5, fc.algorithms.AlgorithmGA4GH)
    
    # calculate multiple checksums for a whole file
    all_checksums: tuple[str, ...] = await fc.checksum_file(file=covid_genome, algorithms=all_algorithms)
    print(all_checksums)
    # prints tuple: ("863ee5dba1da0ca3f87783782284d489", "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd")
    
    # calculate an MD5 and GA4GH checksum for a specific contig in a PySAM FASTA file:
    fh = pysam.FastaFile(str(covid_genome))
    try:
        contig_checksums: tuple[str, ...] = await fc.checksum_contig(
            fh=fh, 
            contig_name="NC_045512.2", 
            algorithms=all_algorithms,
        )
        print(contig_checksums)
        # prints tuple: ("105c82802b67521950854a851fc6eefd", "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D")
    finally:
        fh.close()  # always close the file handle


asyncio.run(demo())

About

Command-line utility and async Python library for checksumming FASTA files and individual contigs.

Resources

License

Stars

Watchers

Forks

Packages

No packages published