Add HamronizationNormalizer #81

Vedanth-Ramji · 2025-01-30T01:41:34Z

Removed the is_hamronized property for all Normalizers.
All hamronized results now go through the HamronizationNormalizer class.
HamronizationNormalizer reads a hamronized file line by line, procures input genes, and loads all ARO mapping tables to support hamronized results that combine the outputs from multiple tools and databases.
Updated preprocessing of resfinder genes. Concatenating entries from 'gene_name' and 'reference_accession' in hamronized results to form input genes for HamronizationNormalizer. While this improves ARO mapping accuracy (previously only gene_symbol was used and several genes can have the same gene_symbol), this is intended to simplify preprocessing of resfinder inputs (if gene_symbol is used, two different preprocessing functions are required for resfinder and abricate for resfinder db).

Notes on changes to example outputs in outputs/ directory:

abricate.ncbi.tsv: Many new genes have been mapped to the ARO (e.g. dfrB3). However, there has been no change to the preprocessing of ncbi or amrfinderplus inputs. These genes are getting mapped as HamronizationNormalizer combines ARO mapping tables of all dbs, and these genes are being mapped to ARO mapping tables of other dbs. This will be corrected in a future PR asap where the gene_name of ncbi and amrfinderplus inputs will be used (this will pinpoint the exact unique gene in NCBI ARO mapping).

- Removed the `is_hamronized` property for all Normalizers. - All hamronized results now go through the `HamronizationNormalizer` class. - HamronizationNormalizer reads a hamronized file line by line, procures input genes, and loads all ARO mapping tables to support hamronized results that combine the outputs from multiple tools and databases. - Updated preprocessing of resfinder genes. Concatenating entries from 'gene_name' and 'reference_accession' in hamronized results to form input genes for HamronizationNormalizer. While this improves ARO mapping accuracy (previously only `gene_symbol` was used and several genes can have the same `gene_symbol`), this is intended to simplify preprocessing of resfinder inputs (if `gene_symbol` is used, two different preprocessing functions are required for `resfinder` and `abricate` for resfinder db).

- Removed '--hamronized' flag for CLI - Added `hamronization` as a tool option which invokes `HamronizationNormalizer`. `hamronization` doesn't required `--db` option, and is intended to be used in the following manner: "argnorm hamronization -i PATH_TO_INPUT -o PATH_TO_OUTPUT"

- hAMRonized result from https://raw.githubusercontent.com/ewissel/hAMRoaster/refs/heads/main/study_data/ham_sum.tsv - Used to test HarmonizationNormalizer on a hamronized file with the combined outputs from different ARG annotation tools (along with their respective databases)

- Updated 'run_argnorm.py' to support the 'hamronization' tool option rather than the '--hamronization' flag - Removed '@pytest.mark.parametrize('hamronized', [True, False])' in 'test_normalizers.py'. Added test_hamronization_normalizer() which normalizes and tests every file in examples/hamronized - Updated example outputs in `outputs/` directory to match changes to input_gene and reference_gene processing in HamronizationNormalizer - Notes on changes to example outputs: 1) `resfinder.resfinder.reads.tsv` and `resfinder.resfinder.orfs.tsv`: ant(6)-Ia, cat, tet(X), and tet(O/32/O) have been mapped to their exact entries in resfinder ARO mapping because their reference_accession is also being used 2) `abricate.resfinder.tsv`: aadA24_1 has been mapped to its exact entry in resfinder ARO mapping. blaCTX-M-63_1_EU660216 and blaCARB-4_1_FJ785525 don't exist in the resfinder database anymore. They were previously mapped as their gene_symbols (blaCTX-M-63 and blaCARB-4) could be mapped to other genes. Now, as their reference_accession is also used, they aren't mapped to AROs anymore. 3) `abricate.ncbi.tsv`: Many new genes have been mapped to the ARO (e.g. dfrB3). However, there has been no change to the preprocessing of ncbi or amrfinderplus inputs. These genes are getting mapped as HamronizationNormalizer combines ARO mapping tables of all dbs, and these genes are being mapped to ARO mapping tables of other dbs. This will be corrected in a future commit where the gene_name of ncbi and amrfinderplus inputs will be used (this will pinpoint the exact unique gene in NCBI ARO mapping).

luispedro · 2025-01-30T04:01:14Z

I had a few comments on the general API, but fundamentally, the implementation here is wrong and the fact that there are changes in the example outputs is proof of that.

There should be no difference between the results when you ran hAMRonization and when you didn't. This should not be left to a future PR.

Vedanth-Ramji added 5 commits January 30, 2025 06:23

DOC: add docs for HamronizationNormalizer

d95dc22

Vedanth-Ramji closed this Feb 1, 2025

Vedanth-Ramji deleted the add_hamronization_normalizer branch February 4, 2025 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HamronizationNormalizer #81

Add HamronizationNormalizer #81

Vedanth-Ramji commented Jan 30, 2025

luispedro commented Jan 30, 2025

Add HamronizationNormalizer #81

Add HamronizationNormalizer #81

Conversation

Vedanth-Ramji commented Jan 30, 2025

luispedro commented Jan 30, 2025