Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HamronizationNormalizer #81

Conversation

Vedanth-Ramji
Copy link
Member

  • Removed the is_hamronized property for all Normalizers.
  • All hamronized results now go through the HamronizationNormalizer class.
  • HamronizationNormalizer reads a hamronized file line by line, procures input genes, and loads all ARO mapping tables to support hamronized results that combine the outputs from multiple tools and databases.
  • Updated preprocessing of resfinder genes. Concatenating entries from 'gene_name' and 'reference_accession' in hamronized results to form input genes for HamronizationNormalizer. While this improves ARO mapping accuracy (previously only gene_symbol was used and several genes can have the same gene_symbol), this is intended to simplify preprocessing of resfinder inputs (if gene_symbol is used, two different preprocessing functions are required for resfinder and abricate for resfinder db).

Notes on changes to example outputs in outputs/ directory:

abricate.ncbi.tsv: Many new genes have been mapped to the ARO (e.g. dfrB3). However, there has been no change to the preprocessing of ncbi or amrfinderplus inputs. These genes are getting mapped as HamronizationNormalizer combines ARO mapping tables of all dbs, and these genes are being mapped to ARO mapping tables of other dbs. This will be corrected in a future PR asap where the gene_name of ncbi and amrfinderplus inputs will be used (this will pinpoint the exact unique gene in NCBI ARO mapping).

- Removed the `is_hamronized` property for all Normalizers.
- All hamronized results now go through the `HamronizationNormalizer` class.
- HamronizationNormalizer reads a hamronized file line by line, procures input genes, and loads all ARO mapping tables to support hamronized results that combine the outputs from multiple tools and databases.
- Updated preprocessing of resfinder genes. Concatenating entries from 'gene_name' and 'reference_accession' in hamronized results to form input genes for HamronizationNormalizer. While this improves ARO mapping accuracy (previously only `gene_symbol` was used and several genes can have the same `gene_symbol`), this is intended to simplify preprocessing of resfinder inputs (if `gene_symbol` is used, two different preprocessing functions are required for `resfinder` and `abricate` for resfinder db).
- Removed '--hamronized' flag for CLI
- Added `hamronization` as a tool option which invokes `HamronizationNormalizer`. `hamronization` doesn't required `--db` option, and is intended to be used in the following manner: "argnorm hamronization -i PATH_TO_INPUT -o PATH_TO_OUTPUT"
- hAMRonized result from https://raw.githubusercontent.com/ewissel/hAMRoaster/refs/heads/main/study_data/ham_sum.tsv
- Used to test HarmonizationNormalizer on a hamronized file with the combined outputs from different ARG annotation tools (along with their respective databases)
- Updated 'run_argnorm.py' to support the 'hamronization' tool option rather than the '--hamronization' flag
- Removed '@pytest.mark.parametrize('hamronized', [True, False])' in 'test_normalizers.py'. Added test_hamronization_normalizer() which normalizes and tests every file in examples/hamronized
- Updated example outputs in `outputs/` directory to match changes to input_gene and reference_gene processing in HamronizationNormalizer

- Notes on changes to example outputs:
1) `resfinder.resfinder.reads.tsv` and `resfinder.resfinder.orfs.tsv`: ant(6)-Ia, cat, tet(X), and tet(O/32/O) have been mapped to their exact entries in resfinder ARO mapping because their reference_accession is also being used
2) `abricate.resfinder.tsv`: aadA24_1 has been mapped to its exact entry in resfinder ARO mapping. blaCTX-M-63_1_EU660216 and blaCARB-4_1_FJ785525 don't exist in the resfinder database anymore. They were previously mapped as their gene_symbols (blaCTX-M-63 and blaCARB-4) could be mapped to other genes. Now, as their reference_accession is also used, they aren't mapped to AROs anymore.
3) `abricate.ncbi.tsv`: Many new genes have been mapped to the ARO (e.g. dfrB3). However, there has been no change to the preprocessing of ncbi or amrfinderplus inputs. These genes are getting mapped as HamronizationNormalizer combines ARO mapping tables of all dbs, and these genes are being mapped to ARO mapping tables of other dbs. This will be corrected in a future commit where the gene_name of ncbi and amrfinderplus inputs will be used (this will pinpoint the exact unique gene in NCBI ARO mapping).
@luispedro
Copy link
Member

I had a few comments on the general API, but fundamentally, the implementation here is wrong and the fact that there are changes in the example outputs is proof of that.

There should be no difference between the results when you ran hAMRonization and when you didn't. This should not be left to a future PR.

@Vedanth-Ramji Vedanth-Ramji deleted the add_hamronization_normalizer branch February 4, 2025 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants