v0.2.0

xinehc · Jun 27, 2024 · 2ab59e4 · 2ab59e4
1 parent 215920a
commit 2ab59e4
Show file tree

Hide file tree

Showing 5 changed files with 32 additions and 15 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,19 +1,29 @@
 # Changelog
+## [0.2.0] - 2024-06-28
+### Added
+- [***breaking***] Add decoy protein sequences (RefSeq fungi, protozoa, viral, plant, and human GRCh38/hg38) which effectively trap non-prokaryotic reads and prevent them from inflating total prokaryotic genome copy estimates if the pre-filtering module (default with `Kraken2`) is not enabled. Pre-filtering is no longer necessary even if samples are contaminated with human DNA or other common eukaryotes/viruses, unless the mean genome size of prokaryotes needs to be estimated. See [8918168](https://github.com/xinehc/melon-supplementary/commit/891816897bb3c82dcfff7ff44b45907593ba0eac) for more details. This function requires a database released on or after 2024-06-28.
+### Changed
+- Simplify filtering criteria for alignments.
+
+
 ## [0.1.6] - 2024-05-30
 ### Changed
 - Prevent `extract_sequence` from loading all marker-containing reads into memory.
 - Change `-F` to `--frameshift` and `max_iteration` to `max_iterations` for consistency.
 - Switch from figshare to zenodo for better database versioning.
 
+
 ## [0.1.5] - 2024-04-26
 ### Fixed
 - Fix a bug causing `tqdm` being disabled ([3bbd087](https://github.com/xinehc/melon/commit/3bbd087b8867e3167973a746af14f1fd797f9746)).
 
+
 ## [0.1.4] - 2024-04-26
 ### Changed
 - Use `tqdm` for logging.
 - Reduce peak memory usage by parsing PAF files on the fly.
 
+
 ## [0.1.3] - 2024-03-29
 ### Changed
 - Change alignment filtering criteria: make `AS` cutoff more stringent, drop `MS`. See [7cc6dbd](https://github.com/xinehc/melon/commit/7cc6dbd866027cf5c1adaa5c69ed7919d8630607) for details.
@@ -25,10 +35,12 @@
 - Output both gap-compressed and gap-uncompressed (BLAST-like) identity.
 - Refine output format.
 
+
 ## [0.1.2] - 2023-12-20
 ### Added
 - Add gap-compressed ANI to output.
 
+
 ## [0.1.1] - 2023-11-29
 ### Added
 - Add options to control EM early stop.
@@ -40,6 +52,7 @@
 ### Fixed
 - Fix a bug causing chimeric reads not being aggregated.
 
+
 ## [0.1.0] - 2023-10-08
 ### Added
 - Output a json file to indicate the lineage of processed reads.
@@ -54,6 +67,7 @@
 ### Fixed
 - Prevent numpy from using all logical cores.
 
+
 ## [0.0.1] - 2023-09-19
 ### Added
 - First release.
diff --git a/README.md b/README.md
@@ -10,13 +10,16 @@ conda activate melon
 ```
 
 ### Database setup
-Download either the [NCBI](https://zenodo.org/records/11100615) or the [GTDB](https://zenodo.org/records/11076519) database:
-```bash
-## GTDB
-# wget -qN --show-progress https://zenodo.org/records/11076519/files/database.tar.gz
+> [!NOTE]
+> We suggest using the GTDB database for complex metagenomes, as it features less ambiguous taxonomic labels and is more comprehensive.
 
+Download either the [NCBI](https://zenodo.org/records/12571302) or the [GTDB](https://zenodo.org/records/12571554) database:
+```bash
 ## NCBI
-wget -qN --show-progress https://zenodo.org/records/11100615/files/database.tar.gz
+wget -qN --show-progress https://zenodo.org/records/12571302/files/database.tar.gz
+
+## GTDB
+# wget -qN --show-progress https://zenodo.org/records/12571554/files/database.tar.gz
 tar -zxvf database.tar.gz
 ```
 
@@ -37,13 +40,13 @@ rm -rf database/*.fa
 ```
 
 ### Run Melon
-> [!NOTE]  
-> Melon takes **quality-controlled** and **decontaminated** long reads as input. We suggest to remove low-quality raw reads before running Melon with e.g., `nanoq -q 10 -l 1000` (minimal quality score 10; minimal read length 1,000 bp). If your sample is known to have a large proportion of human DNAs or known eukaryotes/viruses, please consider removing them via proper mapping. If the origin of contamination is unknown, or if you want to estimate the mean genome size of prokaryotes, you may consider enabling the simple pre-filtering module. See [Run Melon with pre-filtering of non-prokaryotic reads](#run-melon-with-pre-filtering-of-non-prokaryotic-reads) for more details.
+> [!NOTE]
+> Melon takes **quality-controlled** long reads as input. We suggest removing low-quality raw reads before running Melon with e.g., `nanoq -q 10 -l 1000` (min. quality score 10; min. read length 1,000 bp). If your sample is known to have a large proportion of human DNAs or other eukaryotes/viruses and you want to estimate the **mean genome size** of prokaryotes, please consider removing them via proper mapping, or enabling the simple pre-filtering module. See [Run Melon with pre-filtering of non-prokaryotic reads](#run-melon-with-pre-filtering-of-non-prokaryotic-reads) for more details.
 
-We provide an example file comprising 10,000 quality-controlled (processed with `Porechop` and `nanoq`), prokaryotic reads (fungal and other reads removed with `minimap2`) randomly selected from the R10.3 mock sample of [Loman Lab Mock Community Experiments](https://lomanlab.github.io/mockcommunity/r10.html).
+We provide an example file comprising 10,000 quality-controlled (processed with `Porechop` and `nanoq`) prokaryotic reads (fungal and other reads removed with `minimap2`), randomly selected from the R10.3 mock sample of [Loman Lab Mock Community Experiments](https://lomanlab.github.io/mockcommunity/r10.html).
 
 ```bash
-wget -q --show-progress https://figshare.com/ndownloader/files/47279572/example.fa.gz
+wget -qN --show-progress https://zenodo.org/records/12571849/files/example.fa.gz
 melon example.fa.gz -d database -o .
 ```
 
@@ -70,7 +73,7 @@ The output file `*.tsv` contains the estimated genome copies for individual spec
 ...    1613|Limosilactobacillus fermentum    5.125    1.872146e-01    0.9654/0.9574
 ```
 
-The output file `*.json` contains the lineage and remark of each processed reads.
+The output file `*.json` contains the lineage and remark of each processed read.
 ```
 {
     "002617ff-697a-4cd5-8a97-1e136a792228": {
@@ -86,7 +89,7 @@ The output file `*.json` contains the lineage and remark of each processed reads
 ```
 
 ### Run Melon with pre-filtering of non-prokaryotic reads
-To enable the pre-filtering module, you need to download a database of Kraken that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:
+To enable the pre-filtering module, you need to download a database of Kraken2 that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:
 
 ```bash
 ## https://benlangmead.github.io/aws-indexes/k2

diff --git a/src/melon/__init__.py b/src/melon/__init__.py
@@ -1,3 +1,3 @@
-__version__ = '0.1.6'
+__version__ = '0.2.0'
 
 from .melon import GenomeProfiler
diff --git a/src/melon/melon.py b/src/melon/melon.py
@@ -160,15 +160,15 @@ def parse_minimap(self):
             qcoords[hit[0]].add(tuple(hit[3:5]))
 
         alignments = []
-        scores, max_scores = defaultdict(lambda: defaultdict(lambda: -np.inf)), defaultdict(lambda: -np.inf)
+        scores, max_scores = defaultdict(dict), dict()
         with open(f'{self.outfile}.minimap.tmp') as f:
             for line in f:
                 ls = line.rstrip().split('\t')
                 qstart, qend, qseqid, sseqid = int(ls[2]), int(ls[3]), ls[0], ls[5]
                 lineage = accession2lineage[sseqid.rsplit('_', 1)[0]]
 
                 ## filter out non-overlapping alignments
-                if (AS := int(ls[14].split('AS:i:')[-1])) > (AS_MAX := scores[qseqid].get(lineage, -np.inf)):
+                if (AS := int(ls[14].split('AS:i:')[-1])) > scores[qseqid].get(lineage, -np.inf):
                     if any(compute_overlap((qstart, qend, *qcoord)) > 0 for qcoord in qcoords[qseqid]):
                         scores[qseqid][lineage] = AS
                         max_scores[qseqid] = max(max_scores.get(qseqid, -np.inf), AS)

diff --git a/src/melon/utils.py b/src/melon/utils.py
@@ -2,7 +2,7 @@
 import re
 import sys
 import subprocess
-import logging    
+import logging
 
 ## setup logging format
 if not sys.stderr.isatty():