Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many additions #48

Merged
merged 120 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
9513796
initial work on the merge modules
DLBPointon Apr 5, 2024
3d4dbc7
Updates to the create_btk_dataset process
DLBPointon Apr 5, 2024
e040d3e
Updates to the create_btk_dataset process
DLBPointon Apr 5, 2024
b3fc8f2
Adding Diamond Blast subworkflow
DLBPointon Apr 5, 2024
992fe3f
Added the UNIPROT diamond blast subworkflow
DLBPointon Apr 5, 2024
5d28b47
Typo
DLBPointon Apr 5, 2024
951c248
additions to scripts and a rewrite
DLBPointon Apr 8, 2024
c45bc2c
Adding and updating modules
DLBPointon Apr 8, 2024
d07cb6b
Adding more modules
DLBPointon Apr 8, 2024
3c5102c
Updating the modules conf for the diamond blastx
DLBPointon Apr 8, 2024
62f528e
updates to modules
DLBPointon Apr 17, 2024
ee1bbdc
Update to config
DLBPointon Apr 17, 2024
c9dab6b
Addition of the V2 python scripts
DLBPointon Apr 18, 2024
c2c76c3
Modifications for malformed channels
DLBPointon Apr 18, 2024
13eabcc
Updated
DLBPointon Apr 18, 2024
947b1c7
Updated
DLBPointon Apr 18, 2024
c5cd5f7
Addition of a V2 python script
DLBPointon Apr 18, 2024
b0903e4
updates for merge BTK datasets
DLBPointon Apr 18, 2024
c09a852
Remove ls -lh and unnecessary args
DLBPointon Apr 18, 2024
ddc627d
Addition of the trimNs subworkflow
DLBPointon Apr 19, 2024
581bb91
Addition of the trimNs subworkflow
DLBPointon Apr 19, 2024
001b2f7
Addition of the trimNs subworkflow
DLBPointon Apr 19, 2024
1c086f7
Addition of the trimNs subworkflow
DLBPointon Apr 19, 2024
3a28d35
Addition of the ascc merge tables module
DLBPointon Apr 19, 2024
645df15
module which was missing from the diamond subworkflow, required to ge…
DLBPointon Apr 19, 2024
e71c6d5
Updates to the modules
DLBPointon Apr 19, 2024
98a8bc1
Updating modules config
DLBPointon Apr 19, 2024
3554bbf
Addition of a new module to generate the hits file
DLBPointon Apr 19, 2024
dcc82ba
Addition of re-written scripts
DLBPointon Apr 19, 2024
90359a3
Addition of re-written scripts
DLBPointon Apr 19, 2024
bed9234
addition of ascc merge tables script
DLBPointon Apr 19, 2024
5b62351
Skeleton module for the sanger-tol_btk module/pipeline
DLBPointon Apr 19, 2024
8f38b5b
Adding the expected values for the sanger-tol-blobltoolkit pipeline/m…
DLBPointon Apr 19, 2024
d95a683
adding args
DLBPointon Apr 19, 2024
5be3f3f
Adding IN-DEVELOPMENT banner
DLBPointon Apr 19, 2024
0497861
Merge branch 'dev' into dp24_btk_datasets
DLBPointon Apr 19, 2024
a0d5fae
linting fix
DLBPointon Apr 19, 2024
e6ea8ad
Prettier Fix
DLBPointon Apr 19, 2024
664eaf0
Adding skeleton module
DLBPointon Apr 19, 2024
bdf0248
Formatting and EditorConfig linting
DLBPointon Apr 19, 2024
2b89257
updating container
DLBPointon Apr 19, 2024
005e600
removed
DLBPointon Apr 19, 2024
ba4e00c
formatting
DLBPointon Apr 19, 2024
39c3cd0
Updating from sanger-tol containers to public more suitable ones
DLBPointon Apr 23, 2024
1929298
spelling
DLBPointon Apr 23, 2024
79d04d2
modified: modules/local/autofiltering.nf
DLBPointon Apr 30, 2024
5335a8d
new file: bin/abnormal_contamination_check.py
DLBPointon Apr 30, 2024
6f43fa8
Updates
DLBPointon May 9, 2024
cbde411
Updats
DLBPointon May 9, 2024
010f7c9
Adding the sanger tol btk pipeline
DLBPointon May 9, 2024
1243eee
Fix the script
DLBPointon May 9, 2024
85271c0
Add the script for generating the samplesheet
DLBPointon May 9, 2024
6362894
Adding the module for abnormal checks
DLBPointon May 9, 2024
00b2595
Updates to add contam check
DLBPointon May 9, 2024
85f2166
Updates to add contam check
DLBPointon May 9, 2024
369e132
Updates to add contam check
DLBPointon May 9, 2024
58561b5
Updates to all
DLBPointon May 13, 2024
5b7db25
Updates to all
DLBPointon May 13, 2024
df53c2d
Updates to all
DLBPointon May 13, 2024
57cbc06
Updates to all
DLBPointon May 13, 2024
e5563b0
Updates
DLBPointon May 23, 2024
0116b8f
Updates
DLBPointon May 23, 2024
e4bda76
Fixing the autofilter, wrong file was being passed and booling the ou…
DLBPointon May 28, 2024
61da0ac
Updates
DLBPointon Jun 11, 2024
dd247e8
Updates
DLBPointon Jun 20, 2024
60e5ae4
Fixes that were stopping the pipeline completing
DLBPointon Jun 25, 2024
8e76bc1
Update for ea
DLBPointon Jun 27, 2024
1fadcc5
Update to use the sorted bam file for generate_samplesheet
DLBPointon Jun 27, 2024
b651966
Update to use the sorted bam file for generate_samplesheet
DLBPointon Jun 27, 2024
5f6d384
Update to use the sorted bam file for generate_samplesheet
DLBPointon Jun 27, 2024
94251bb
Input Channel lacked meta and so failed the input check
DLBPointon Jun 27, 2024
0223b94
Updates to channel paths and channels for modules
DLBPointon Jun 27, 2024
0a4fd2c
Update based on work of ea12
Jul 3, 2024
16bad9a
Update to command
DLBPointon Jul 3, 2024
d1a571a
Linting fix
DLBPointon Jul 3, 2024
c96ba68
Updates - fixed errpr in test.yaml
DLBPointon Jul 3, 2024
20c4701
Updates for testing
DLBPointon Jul 5, 2024
b8ce787
Updates for testing
DLBPointon Jul 5, 2024
d21fd7c
Adding https rather than http
DLBPointon Jul 5, 2024
fb9cb82
add file exist arg to linting
DLBPointon Jul 5, 2024
32ad0b3
add file exist arg to linting
DLBPointon Jul 5, 2024
09bfcb4
add file exist arg to linting
DLBPointon Jul 5, 2024
8e111e1
add file exist arg to linting
DLBPointon Jul 5, 2024
c220c3c
Correct treeval to ascc
DLBPointon Jul 8, 2024
e4f4116
correction
DLBPointon Jul 9, 2024
fd55686
Update workflow
DLBPointon Jul 10, 2024
f32be17
Update test.yaml
DLBPointon Jul 10, 2024
37c8847
Updates to Workflows
DLBPointon Jul 10, 2024
7e6eae7
Updates, changes to flags and ci to allow for turning off btk
DLBPointon Jul 12, 2024
cf544bb
Fixing reviewer comments
DLBPointon Jul 18, 2024
6277abb
Fixed Variable name for cicd
DLBPointon Jul 18, 2024
560b0d3
Adding version output and stubs
DLBPointon Jul 25, 2024
9afd040
updates and version outputs
DLBPointon Jul 26, 2024
a3b2d98
Adding Merge Tables - spelling error - tuple error
DLBPointon Jul 26, 2024
7bf7391
Updates to spelling, name corrections to match OLD_ASCC
DLBPointon Aug 6, 2024
060e997
Adding include_exclude checker
DLBPointon Aug 6, 2024
b30dddb
Removed extra script and renamed V2
DLBPointon Aug 6, 2024
71f1f26
Updates to check
DLBPointon Aug 6, 2024
39393ef
ea10 edits to dp24_btk_datasets_branch
Aug 7, 2024
afc73ed
07.08.2024 edits
Aug 7, 2024
e3b8db9
ran linting with black
Aug 7, 2024
6d6b6fa
Updating the version of BTK, spelling and bug fixes
DLBPointon Aug 7, 2024
7e7370b
Merge branch 'dp24_btk_datasets' into dp24_btk_datasets_ea10_edits2
DLBPointon Aug 7, 2024
748b3ca
Merge pull request #55 from sanger-tol/dp24_btk_datasets_ea10_edits2
DLBPointon Aug 7, 2024
b1e214b
Addition of new scripts to filter and double check data
DLBPointon Aug 8, 2024
a599125
Adding new scripts for filtering and double checking data #56
DLBPointon Aug 8, 2024
aeb7097
Updates for #56
DLBPointon Aug 8, 2024
7405a9b
Updates for #56
DLBPointon Aug 8, 2024
6008006
Updates closes #56 and additions to the output.md file
DLBPointon Aug 8, 2024
00b375a
Prettier linting
DLBPointon Aug 8, 2024
646ae91
Update to the sanger-tol module to remove the yaml flag which is now …
DLBPointon Aug 9, 2024
124a50a
closes #57 filters the output to that directly needed for analysis
DLBPointon Aug 9, 2024
1db6429
Addition of 2/3 indicator files needed for integration into the curre…
DLBPointon Aug 9, 2024
0bbd384
Addition of 2/3 indicator files needed for integration into the curre…
DLBPointon Aug 9, 2024
dfdbb8d
Addition of indicator files, these are saved to the main outdir
DLBPointon Aug 9, 2024
fe1a847
Addition of indicator files, these are saved to the main outdir
DLBPointon Aug 9, 2024
a150b47
Addition of indicator files, these are saved to the main outdir
DLBPointon Aug 9, 2024
ddf15e2
Addition of indicator files, these are saved to the main outdir
DLBPointon Aug 9, 2024
7be5861
Adding 13 subworkflow images
DLBPointon Aug 14, 2024
d4024b7
minor update
DLBPointon Aug 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,18 @@ jobs:
run: |
curl https://tolit.cog.sanger.ac.uk/test-data/resources/ascc/asccTinyTest_V2.tar.gz | tar xzf -

- name: Temporary ASCC Diamond Data
run: |
curl https://dp24.cog.sanger.ac.uk/ascc/diamond.dmnd -o diamond.dmnd

- name: Temporary BLASTN Data
run: |
curl https://dp24.cog.sanger.ac.uk/blastn.tar.gz | tar xzf -

- name: Temporary Accession2TaxID Data
run: |
curl https://dp24.cog.sanger.ac.uk/ascc/accession2taxid.tar.gz | tar -xzf -
DLBPointon marked this conversation as resolved.
Show resolved Hide resolved

- name: Download the NCBI taxdump database
run: |
mkdir ncbi_taxdump
Expand Down Expand Up @@ -120,10 +132,11 @@ jobs:
run: |
mkdir vecscreen
curl -L https://ftp.ncbi.nlm.nih.gov/blast/db/v4/16SMicrobial_v4.tar.gz | tar -C vecscreen -xzf -
ls -lh

- name: Singularity - Run FULL pipeline with test data
# TODO nf-core: You can customise CI pipeline run tests as required
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,singularity --outdir ./results --steps ALL
nextflow run ./sanger-ascc/${{ steps.branch-names.outputs.current_branch }}/main.nf -profile test,singularity --outdir ./results --include ALL --exclude btk_busco
3 changes: 1 addition & 2 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,4 @@ lint:
nextflow_config:
- manifest.name
- manifest.homePage
multiqc_config:
- report_comment
multiqc_config: False
17 changes: 17 additions & 0 deletions assets/btk_draft.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
assembly:
level: bar
settings:
foo: 0
similarity:
diamond_blastx:
foo: 0
taxon:
class: class_name
family: family_name
genus: genus_name
kingdom: kingdom_name
name: species_name
order: order_name
phylum: phylum_name
superkingdom: superkingdom_name
taxid: 0
16 changes: 9 additions & 7 deletions assets/github_testing/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,20 @@ kmer_len: 7
dimensionality_reduction_methods: "pca,random_trees"
# all available methods
# "pca,umap,t-sne,isomap,lle_standard,lle_hessian,lle_modified,mds,se,random_trees,kernel_pca,pca_svd,autoencoder_sigmoid,autoencoder_linear,autoencoder_selu,autoencoder_relu,nmf"
nt_database: /home/runner/work/ascc/ascc/NT_database/
nt_database_prefix: 18S_fungal_sequences
nt_database: /home/runner/work/ascc/ascc/blastdb/
nt_database_prefix: tiny_plasmodium_blastdb.fa
nt_kraken_db_path: /home/runner/work/ascc/ascc/kraken2/kraken2
ncbi_accessionids_folder: /lustre/scratch123/tol/teams/tola/users/ea10/ascc_databases/ncbi_taxonomy/20230509_accession2taxid/
ncbi_accessionids_folder: /home/runner/work/ascc/ascc/20240709_tiny_accession2taxid/
ncbi_taxonomy_path: /home/runner/work/ascc/ascc/ncbi_taxdump/
ncbi_rankedlineage_path: /home/runner/work/ascc/ascc/ncbi_taxdump/rankedlineage.dmp
busco_lineages_folder: /home/runner/work/ascc/ascc/busco_database/lineages
busco_lineages: "diptera_odb10,insecta_odb10"
fcs_gx_database_path: /home/runner/work/ascc/ascc/FCS_gx/
diamond_uniprot_database_path: /home/runner/work/ascc/ascc/diamond/UP000000212_1234679_tax.dmnd
diamond_nr_database_path: /home/runner/work/ascc/ascc/diamond/UP000000212_1234679_tax.dmnd
diamond_uniprot_database_path: /home/runner/work/ascc/ascc/diamond.dmnd
diamond_nr_database_path: /home/runner/work/ascc/ascc/diamond.dmnd
vecscreen_database_path: /home/runner/work/ascc/ascc/vecscreen/
seqkit:
sliding: 6000
window: 100000
sliding: 100000
window: 6000
n_neighbours: 13
btk_yaml: /home/runner/work/ascc/ascc/assets/btk_draft.yaml
18 changes: 10 additions & 8 deletions assets/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,20 @@ kmer_len: 7
dimensionality_reduction_methods: "pca,random_trees"
# all available methods
# "pca,umap,t-sne,isomap,lle_standard,lle_hessian,lle_modified,mds,se,random_trees,kernel_pca,pca_svd,autoencoder_sigmoid,autoencoder_linear,autoencoder_selu,autoencoder_relu,nmf"
nt_database: /data/blastdb/Supported/NT/202308/dbv4/
nt_database_prefix: nt
nt_kraken_db_path: /lustre/scratch123/tol/teams/tola/users/ea10/ascc_databases/nt/nt
ncbi_accessionids_folder: /lustre/scratch123/tol/teams/tola/users/ea10/ascc_databases/ncbi_taxonomy/20230509_accession2taxid/
ncbi_taxonomy_path: /lustre/scratch123/tol/teams/tola/users/ea10/databases/taxdump/
nt_database: /lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20240704_blast_tiny_testdb/blastdb/
nt_database_prefix: tiny_plasmodium_blastdb.fa
nt_kraken_db_path: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/kraken2/kraken2/
ncbi_accessionids_folder: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/20240709_tiny_accession2taxid/
ncbi_taxonomy_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump
ncbi_rankedlineage_path: /lustre/scratch123/tol/teams/tola/users/ea10/databases/taxdump/rankedlineage.dmp
busco_lineages_folder: /lustre/scratch123/tol/resources/busco/data/v5/2021-08-27/lineages
fcs_gx_database_path: /lustre/scratch124/tol/projects/asg/sub_projects/ncbi_decon/0.4.0/gxdb
busco_lineages: "diptera_odb10,insecta_odb10"
fcs_gx_database_path: /lustre/scratch124/tol/projects/asg/sub_projects/ncbi_decon/0.4.0/gxdb/
vecscreen_database_path: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/vecscreen/
diamond_uniprot_database_path: /lustre/scratch123/tol/teams/tola/users/ea10/ascc_databases/uniprot/uniprot_reference_proteomes_with_taxonnames.dmnd
diamond_nr_database_path: /lustre/scratch123/tol/resources/nr/latest/nr.dmnd
diamond_uniprot_database_path: /lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20240704_diamond_tiny_testdb/ascc_tinytest_diamond_db.dmnd
diamond_nr_database_path: /lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20240704_diamond_tiny_testdb/ascc_tinytest_diamond_db.dmnd
seqkit:
sliding: 100000
window: 6000
n_neighbours: 13
btk_yaml: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/assets/btk_draft.yaml
144 changes: 144 additions & 0 deletions bin/abnormal_contamination_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
#!/usr/bin/env python3

VERSION = "V1.0.0"

DESCRIPTION = """
-------------------------------------
Abnormal Contamination Check
Version = {VERSION}
-------------------------------------
Written by James Torrance
Modified by Eerik Aunin
Modified by Damon-Lee Pointon
-------------------------------------

Script for determining if there is
enough contamination found by FCS-GX
to warrant an abnormal contamination
report alarm. Partially based on code
written by James Torrance
-------------------------------------

"""

import general_purpose_functions as gpf
import sys
import os.path
import pathlib
import argparse
import textwrap


def parse_args():
parser = argparse.ArgumentParser(
prog="Abnormal Contamination Check",
formatter_class=argparse.RawDescriptionHelpFormatter,
description=textwrap.dedent(DESCRIPTION),
)
parser.add_argument("assembly", type=str, help="Path to the fasta assembly file")
parser.add_argument("summary_path", type=str, help="Path to the tiara summary file")
parser.add_argument("-v", "--version", action="version", version=VERSION)
return parser.parse_args()


def get_sequence_lengths(assembly_fasta_path):
"""
Gets sequence lengths of a FASTA file and returns them as a dictionary
"""
seq_lengths_dict = dict()
fasta_data = gpf.read_fasta_in_chunks(assembly_fasta_path)
for header, seq in fasta_data:
seq_len = len(seq)
seq_lengths_dict[header] = dict()
seq_lengths_dict[header]["seq_len"] = seq_len
return seq_lengths_dict


def load_fcs_gx_results(seq_dict, fcs_gx_and_tiara_summary_path):
"""
Loads FCS-GX actions from the FCS-GX and Tiara results summary file, adds them to the dictionary that contains sequence lengths
"""
fcs_gx_and_tiara_summary_data = gpf.l(fcs_gx_and_tiara_summary_path)
fcs_gx_and_tiara_summary_data = fcs_gx_and_tiara_summary_data[1 : len(fcs_gx_and_tiara_summary_data)]
for line in fcs_gx_and_tiara_summary_data:
split_line = line.split(",")
assert len(split_line) == 5
seq_name = split_line[0]
fcs_gx_action = split_line[1]
seq_dict[seq_name]["fcs_gx_action"] = fcs_gx_action
return seq_dict


def main():
args = parse_args()
if os.path.isfile(args.summary_path) is False:
sys.stderr.write(
f"The FCS-GX and Tiara results file was not found at the expected location ({args.summary_path})\n"
)
sys.exit(1)

if os.path.isfile(args.assembly) is False:
sys.stderr.write(f"The assembly FASTA file was not found at the expected location ({args.assembly})\n")
sys.exit(1)

seq_dict = get_sequence_lengths(args.assembly)
seq_dict = load_fcs_gx_results(seq_dict, args.summary_path)

total_assembly_length = 0
lengths_removed = list()
scaffolds_removed = 0
scaffold_count = len(seq_dict)

for seq_name in seq_dict:
seq_len = seq_dict[seq_name]["seq_len"]
if seq_dict[seq_name]["fcs_gx_action"] == "EXCLUDE":
lengths_removed.append(seq_len)
scaffolds_removed += 1
total_assembly_length += seq_len

alarm_threshold_for_parameter = {
"TOTAL_LENGTH_REMOVED": 1e7,
"PERCENTAGE_LENGTH_REMOVED": 3,
"LARGEST_SCAFFOLD_REMOVED": 1.8e6,
}

report_dict = {
"TOTAL_LENGTH_REMOVED": sum(lengths_removed),
"PERCENTAGE_LENGTH_REMOVED": 100 * sum(lengths_removed) / total_assembly_length,
"LARGEST_SCAFFOLD_REMOVED": max(lengths_removed, default=0),
"SCAFFOLDS_REMOVED": scaffolds_removed,
"PERCENTAGE_SCAFFOLDS_REMOVED": 100 * scaffolds_removed / scaffold_count,
}

for param in report_dict:
sys.stderr.write(f"{param}: {report_dict[param]}\n")

fcs_gx_alarm_indicator_path = f"fcs-gx_alarm_indicator_file.txt"
pathlib.Path(fcs_gx_alarm_indicator_path).unlink(missing_ok=True)

alarm_list = []
stage1_decon_pass_flag = True
for param in alarm_threshold_for_parameter:
param_value = report_dict[param]
alarm_threshold = alarm_threshold_for_parameter[param]

# IF CONTAMINATING SEQ FOUND FILL FILE WITH ABNORMAL CONTAM
if param_value > alarm_threshold_for_parameter[param]:
stage1_decon_pass_flag = False
alarm_list.append(
f"YES_ABNORMAL_CONTAMINATION: Stage 1 decon alarm triggered for {param}: the value for this parameter in this assembly is {param_value} | alarm threshold is {alarm_threshold}\n"
)

# Seperated out to ensure that the file is written in one go and doesn't confuse Nextflow
with open(fcs_gx_alarm_indicator_path, "a") as f:
f.write("".join(alarm_list))

# IF NO CONTAM FILL FILE WITH NO CONTAM
if stage1_decon_pass_flag is True:
alarm_message = f"NO_ABNORMAL_CONTAMINATION: No scaffolds were tagged for removal by FCS-GX\n"
with open(fcs_gx_alarm_indicator_path, "a") as f:
f.write(alarm_message)


if __name__ == "__main__":
main()
Loading
Loading