Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bjorn1 sitrep #15

Open
wants to merge 135 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
135 commits
Select commit Hold shift + click to select a range
832fc8d
script for downloading GISAID API feed, generating fasta file and pro…
AlaaALatif Feb 21, 2021
caa3064
separate env for api data processing, due to issues installing geopan…
AlaaALatif Feb 21, 2021
bfa480b
fix strain processing in metadata, require username as input arg
AlaaALatif Feb 22, 2021
9c3d6d0
modify aggregate_replacements() and aggregate_deletions() to avoid pa…
AlaaALatif Feb 22, 2021
713703e
always add ref seq to input fasta to avoid downstream failures, teste…
AlaaALatif Feb 22, 2021
a61afa2
fix aggregate funcs to convert date to str to prevent NaN issues
AlaaALatif Feb 22, 2021
d7b2a70
minor formatting fixes
AlaaALatif Feb 22, 2021
a91940d
dev nb, transfer to local machine
AlaaALatif Feb 22, 2021
2a0df49
generate lowercased versions of location info
AlaaALatif Feb 22, 2021
75a1bdf
handle exceptions with mutation aggregation
AlaaALatif Feb 22, 2021
fd5365a
fix mutation aggregation funcs
AlaaALatif Feb 22, 2021
d085e0a
fix field names of normalized locations
AlaaALatif Feb 22, 2021
7fe5cdf
migrate to local machine
AlaaALatif Feb 23, 2021
29ca63c
fix bjorn.py to adapt to breaking changes made for outbreak, allowing…
AlaaALatif Feb 25, 2021
09e0605
updated script to tabulate mutations seen in A-lab sequences, excludi…
AlaaALatif Feb 27, 2021
1d0bb65
bash scripts used for sequencing and submission pipelines
AlaaALatif Feb 27, 2021
6bea875
not now...
AlaaALatif Feb 27, 2021
b8bd0fe
added batch_iterator() function for chunking big fasta files into sma…
AlaaALatif Feb 28, 2021
5da440b
script for splitting input fasta into smaller chunks with a user-spec…
AlaaALatif Feb 28, 2021
82aa1c3
baseline rules, still need to add chunk_fasta rule and run tests
AlaaALatif Feb 28, 2021
30fea4b
tested successfully, takes ~5mins to split file containing ~600K seqs…
AlaaALatif Feb 28, 2021
65f5b68
move current codebase to separate module
AlaaALatif Mar 1, 2021
100eaaf
tested successfully, still need to add API pull (rule_0) and gcloud u…
AlaaALatif Mar 1, 2021
f486717
initial readme for pipeline
AlaaALatif Mar 1, 2021
09cb203
replace hard-coded numbers with dynamic ones, which was used for rapi…
AlaaALatif Mar 1, 2021
fa171cd
fixed chunk_fasta to use single core only, fixed msa_2_mutations to a…
AlaaALatif Mar 1, 2021
8dcafcf
python scripts for pulling API data and merging final results
AlaaALatif Mar 1, 2021
9a83c7c
added rule_0 and rule_n for pulling API data and generating json for …
AlaaALatif Mar 1, 2021
0bc1894
changed json2fasta to function def to run inline within pipeline (fas…
AlaaALatif Mar 3, 2021
1321ada
add minimap2 to the env, which was the source of most cron errors...
AlaaALatif Mar 5, 2021
82badc7
rename bjorn_original to archive, containing legacy code
AlaaALatif Mar 6, 2021
c7d943e
added scripts for releasing alab sequences to new pipeline
AlaaALatif Mar 6, 2021
18ac2ed
run_alab_release.sh
AlaaALatif Mar 6, 2021
075776b
added pyarrow for efficient IO ops, thx 2 @parrt
AlaaALatif Mar 6, 2021
7b8df06
use abs path of gsutil to allow auto upload via cron job
AlaaALatif Mar 8, 2021
3d13d53
rectify location IDs, US counties represented as fips, deletions sepa…
AlaaALatif Mar 9, 2021
989c346
move ID getters toward end of logical statements
AlaaALatif Mar 9, 2021
d111e08
add DEL tag to deletion names
AlaaALatif Mar 9, 2021
e0b9ecd
prelim code for automated assingment of characteristic mutations for …
AlaaALatif Mar 10, 2021
f11cac8
sync local updates with online repo
AlaaALatif Mar 10, 2021
d56fe38
does not belong here, yet
AlaaALatif Mar 12, 2021
4d94603
bash script for running bjorn for outbreak.info
AlaaALatif Mar 12, 2021
3c62eaa
update bash
AlaaALatif Mar 13, 2021
af9d26b
_
AlaaALatif Mar 13, 2021
b0d68f6
minimum coverage can be user-specified as CLI argument
AlaaALatif Mar 17, 2021
0283bff
update logging to report user-specified min coverage as QC filter
AlaaALatif Mar 17, 2021
9a8db4a
update bash script to reflect metrics used in Alab sequence release
AlaaALatif Mar 17, 2021
cb98e25
consolidate col names for coverage to prevent redundant code
AlaaALatif Mar 17, 2021
4d5ada8
add documentation, include mutations in non-coding regions, refactor …
AlaaALatif Mar 19, 2021
304789e
ignore mutations in non-coding regions for outbreak sitrep
AlaaALatif Mar 19, 2021
584d533
add bash command for dev run, testing changes on internal alab releas…
AlaaALatif Mar 19, 2021
2eb9320
Merge pull request #3 from andersen-lab/dev
AlaaALatif Mar 19, 2021
0f33060
Incorporate align_fasta_viralMSA function definition, allowing sequen…
AlaaALatif Mar 19, 2021
c58a54c
playground notebooks for testing sitrep updates and alab releases
AlaaALatif Mar 19, 2021
7c1201f
_
AlaaALatif Mar 19, 2021
1adfff9
basic script to find mutations from fasta files, for Issa
AlaaALatif Mar 21, 2021
9da2339
added consistent behavior when metadata is missing
AlaaALatif Mar 21, 2021
b8b2b34
added function to identify IDs of samples with suspicious mutations
AlaaALatif Mar 23, 2021
6120b4f
added function to separate MSA into two: one for whitelisted samples …
AlaaALatif Mar 23, 2021
8c94630
whitelisted sample files are automatically separated from samples req…
AlaaALatif Mar 23, 2021
bba890e
contains source code
AlaaALatif Mar 23, 2021
67fd308
generalised code refactor
AlaaALatif Mar 23, 2021
41a6600
user-specified non-concerning mutations
AlaaALatif Mar 23, 2021
a0677ce
formatting
AlaaALatif Mar 23, 2021
35f3e87
generalised fetching filepaths to be adaptable to different file stru…
AlaaALatif Mar 23, 2021
be4d280
updated readme instructions
AlaaALatif Mar 23, 2021
e4a01aa
minor formatting
AlaaALatif Mar 23, 2021
2b9bc8b
remove test code, passes successfully
AlaaALatif Mar 23, 2021
5611f26
add demo example for ccbb call
AlaaALatif Mar 23, 2021
2589b48
added exception handling when sequences do not have any type of mutat…
AlaaALatif Mar 25, 2021
88a861a
added run with test inputs and outputs
AlaaALatif Mar 25, 2021
0a0df93
Merge pull request #4 from andersen-lab/bjorn1
AlaaALatif Mar 25, 2021
c8106f9
update readme
AlaaALatif Mar 25, 2021
ce95228
update readme
AlaaALatif Mar 25, 2021
b119ce6
populate suspicious mutations even if one or more mutation type is mi…
AlaaALatif Mar 25, 2021
b5d96fc
turn sitrep scripts into executables, appears to fix excess mem usage
AlaaALatif Mar 26, 2021
de01bdc
corrected issue with county names in Texas (Houston city vs Harris co…
AlaaALatif Mar 27, 2021
c61d5a4
turn json2fasta step into executable, runs inline to optimize memory …
AlaaALatif Mar 30, 2021
13d5ba5
inline download step carried through bash call, significantly reducin…
AlaaALatif Mar 30, 2021
21932b8
added params for json2fasta
AlaaALatif Mar 30, 2021
b3f560a
fixed normalization to differentiate between Democratic Republic of t…
AlaaALatif Mar 30, 2021
5885c31
demultiplex script now takes num cpus as 2nd arg
AlaaALatif Mar 30, 2021
68dace9
fixed absolute coord computation for insertions, more intuitive colum…
AlaaALatif Apr 2, 2021
3636c0d
Merge branch 'bjorn1' of https://github.com/andersen-lab/bjorn into b…
AlaaALatif Apr 2, 2021
37f2452
dynamic datetime specification for filenaming and data updates
AlaaALatif Apr 4, 2021
3c98959
updated sitrep scripts to accept new datetime information as user args
AlaaALatif Apr 4, 2021
b258de5
add default config for nonconcerning (whitelisted) mutations
AlaaALatif Apr 5, 2021
a7ccc24
successfully tested with nonconcerning substitutions, deletions, and …
AlaaALatif Apr 5, 2021
f820bf8
revert back to run-mode, from test-mode
AlaaALatif Apr 5, 2021
52a4ffe
case-insensitive normalization of USA, potential fix for outbreak-inf…
AlaaALatif Apr 6, 2021
058ca3a
Merge pull request #5 from andersen-lab/bjorn1
AlaaALatif Apr 6, 2021
3c4949f
Merge branch 'master' of https://github.com/andersen-lab/bjorn into b…
AlaaALatif Apr 7, 2021
b139c9e
added function to separate input sample sheet into sub-sheets prior t…
AlaaALatif Apr 7, 2021
97883f1
tests separate_samples() added to bjorn_support
AlaaALatif Apr 7, 2021
70385dc
added more user args for running demultiplex script
AlaaALatif Apr 7, 2021
6a70ef5
remove umlaut from bjorn name
AlaaALatif Apr 8, 2021
95447eb
normalize northern cape and natal division names for south africa
AlaaALatif Apr 18, 2021
f0e10fd
normalize country names for Republic of Congo, Bonaire/Sint Eustatius…
AlaaALatif Apr 18, 2021
4cf089b
changes to params for alab analysis and release
AlaaALatif Apr 27, 2021
23b8f10
define get_variant_counts() to fetch minor variant present in analysi…
AlaaALatif Apr 27, 2021
3b032c8
added depth-based QC filter step, threshold can be user-specified
AlaaALatif Apr 27, 2021
e8afc06
initial script for demultiplexing using Picard, allowing user-specifi…
AlaaALatif Apr 27, 2021
8175782
nb for supporting ancestral anomaly investigation
AlaaALatif Apr 27, 2021
1f49e25
Merge pull request #6 from andersen-lab/bjorn1
AlaaALatif Apr 27, 2021
6a5c0b7
added the analysis file structure expected for bjorn release
AlaaALatif Apr 27, 2021
9baf1ff
add fig illustrating the folder structure expected for bjorn release
AlaaALatif Apr 27, 2021
f02d92e
fix loc norm for sachsen-anhalt germany and tirol austria
AlaaALatif Apr 27, 2021
26e32f5
loc norm umlauted divisions in sweden, fixes outbreak-info/outbreak.i…
AlaaALatif Apr 27, 2021
43ea6e0
loc norm umlauted divisions in sweden, fixes outbreak-info/outbreak.i…
AlaaALatif Apr 27, 2021
00393be
fix codec error
AlaaALatif Apr 27, 2021
0bb2cfa
fix issue with del name assignment for short frame-shifting dels, e.g…
AlaaALatif May 1, 2021
4233a16
Merge pull request #9 from andersen-lab/bjorn1_sitrep
AlaaALatif May 3, 2021
c7c80b8
Fixed Swiss normalisation for GADM
corneliusroemer May 4, 2021
a1b2fab
Merge pull request #11 from corneliusroemer/swiss_test
gkarthik May 4, 2021
0cc453f
Further Swiss Normalization fixes
corneliusroemer May 5, 2021
b578e7c
Merge pull request #12 from corneliusroemer/patch-1
gkarthik May 5, 2021
699f476
fix puerto rico location normalization
AlaaALatif May 5, 2021
59c553d
Merge branch 'master' into bjorn1_sitrep
AlaaALatif May 5, 2021
a528dd4
Merge pull request #13 from andersen-lab/bjorn1_sitrep
AlaaALatif May 5, 2021
5c56b7f
loc norm testing lab
AlaaALatif May 5, 2021
3b93153
Merge branch 'bjorn1_sitrep' of https://github.com/andersen-lab/bjorn…
AlaaALatif May 5, 2021
2ee4941
code cleanup, removing old commented statements
AlaaALatif May 5, 2021
c9b994e
Merge branch 'bjorn1_sitrep' of https://github.com/andersen-lab/bjorn…
AlaaALatif May 16, 2021
ea83812
specify chunk data files as temp to prevent excess disk usage
AlaaALatif May 16, 2021
3871c33
incorporate test mode with correct filenaming and behavior
AlaaALatif May 17, 2021
69ff4ff
change to gadm_mini.tsv for quicker runtime, incorporate test mode wi…
AlaaALatif May 17, 2021
dae5518
incorporate test mode with correct behavior, does NOT upload to outbr…
AlaaALatif May 17, 2021
aedbaa7
skip api metadata file generation during test mode
AlaaALatif May 17, 2021
5db6df9
qc filters are now applied separately within a single function, seqs …
AlaaALatif May 23, 2021
056911b
apply qc filtering before identifying mutations
AlaaALatif May 23, 2021
5174b72
consolidate qc filters into a single function, correctly filter seqs …
AlaaALatif May 23, 2021
455d16f
apply qc filters prior to identifying mutations
AlaaALatif May 23, 2021
5dd5a1b
fixes outbreak-info/outbreak.info-issues-362
AlaaALatif May 23, 2021
412d7dc
fix small error in comment
AlaaALatif May 23, 2021
ced266d
separate dels into single-aa del, fixes outbreak-info/outbreak.info-i…
AlaaALatif Jun 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 44 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,61 @@
This is the code repository for `bjorn` - a suite of tools that can be used to generate information for large-scale genomic surveillance of SARS-CoV-2 sequences. `bjorn` heavily relies on external tools such as `datafunk`, `minimap2`, and `pandas`.
# `björn`
This is the code repository for `bjorn` - a suite of miscellaneous tools that can be used to:

* generate information for large-scale genomic surveillance of SARS-CoV-2 sequences. This functionality relies on external tools such as `datafunk`, `minimap2`, and `pandas`.

* prepare results and data files from SARS-CoV-2 sequencing analysis for release to public databases such as GISAID, Google Cloud, and GitHub

## Installation
* Install Anaconda: [instructions can be found here](https://docs.anaconda.com/anaconda/install/)
* Create the `bjorn` environment
```bash
conda env create -f envs/macos.yml -n bjorn
conda env create -f env/linux.yml -n bjorn
```
* Activate environment
```bash
conda activate bjorn
```
* Install datafunk: [instructions (ensure environment is activated during installation)](https://github.com/cov-ert/datafunk)
* Install datafunk (inside the activated environment): [instructions (ensure environment is activated during installation)](https://github.com/cov-ert/datafunk)

## Usage
### Information for Surveillance of SARS-CoV-2 Genomic Mutations
* Activate `bjorn` environment
```bash
conda activate bjorn
```
* Open `config.json` to specify your parameters such as
* current date
* date appended to each filepath
* output directory where results are saved
* number of CPU cores available for use
* Run the `run_sitrep.sh` script to initiate the Snakemake pipeline
```bash
bash run_sitrep.sh
```

### Post-processing of SARS-CoV-2 Sequencing Results for Release to public databases
* Activate `bjorn` environment
```bash
conda activate bjorn
```
* Open `config.json` to specify your parameters, then save file
* NB: the config will run a test by default. Once its tested, make sure to change the `is_test` value to `false` in `config.json`
* Run `count_variants` to generate mutation information
* Open `run_alab_release.sh` to specify your parameters such as
* filepath to sample sheet containing sample metadata (input)
* filepath to updated metadata of samples that have already been uploaded
* output directory where results are saved
* number of CPU cores available for use
* minimum coverage required for each sample (QC filter)
* minimum average depth required for each sample (QC filter)
* DEFAULT: test parameters
* Open `config.json` to specify your parameters such as
* list of SARS-CoV-2 genes that are considered non-concerning
* i.e. the occurrence of open-read frame (ORF) altering mutations can be accepted
* e.g. ['ORF8', 'ORF10']
* list of SARS-CoV-2 mutations that are considered non-concerning
* i.e. the occurrence of `ORF8:Q27_` can be accepted (B117 exists)
* e.g. ['ORF8:Q27_']
* Run the `run_alab_release.sh` script to initiate the data release pipeline
```bash
python count_variants.py
```
bash run_alab_release.sh
```
* `bjorn` assumes the following file structure for the input sequencing data
![Release Structure](figs/alab_release_filestructure.png)
130 changes: 130 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
import sys
sys.path.append('src/')
from path import Path
from datetime import datetime
import pandas as pd
import json
import argparse
import bjorn_support as bs
import json2fasta as bj
import chunk_fasta as bf
import msa_2_mutations as bm


# load user parameters
configfile: "config.json"

username = config['username']
password = config['password']
out_dir = config['out_dir']
is_test = config['feed_test']
if is_test:
current_datetime = 'test'
else:
current_datetime = datetime.now().strftime("%Y-%m-%d-%H-%M")
gisaid_sequences_filepath = out_dir + '/' + config['gisaid_fasta'] + '_' + current_datetime + '.fasta'
meta_filepath = out_dir + '/' + config['gisaid_meta'] + '_' + current_datetime + '.tsv.gz'
info_filepath = out_dir + '/' + config['chunk_info']
chunks_dir = out_dir + '/chunks'
fasta_dir = chunks_dir + '/fasta/' + current_datetime
# sam_dir = chunks_dir + '/sam/' + current_date
# msa_dir = chunks_dir + '/msa/' + current_date
# muts_dir = chunks_dir + '/muts/' + current_date
logs_dir = out_dir + '/logs'
chunk_size = int(config['chunk_size'])
num_cpus = int(config['num_cpus'])
reference_filepath = config['ref_fasta']
patient_zero = config['patient_zero']

# Download and pre-process GISAID data
download_cmd = f"src/json2fasta.py -u {username} -p {password} -s {chunk_size} -t {current_datetime}"
bs.run_command(download_cmd)
info_df = pd.read_csv(info_filepath)
# info_df = bj.download_process_data(username, password, chunk_size)


rule all:
input:
"{out_dir}/mutations_{current_datetime}.csv".format(out_dir = out_dir, current_datetime = current_datetime), # output data (signal)
# expand("{chunks_dir}/muts/{current_datetime}/{sample}.mutations.csv", chunks_dir = chunks_dir, current_datetime = current_datetime, sample = info_df['chunk_names']), # bjorn
# expand("{chunks_dir}/msa/{current_datetime}/{sample}.aligned.fasta", chunks_dir = chunks_dir, current_datetime = current_datetime, sample = info_df['chunk_names']), # data2funk -> gofasta
# expand("{chunks_dir}/sam/{current_datetime}/{sample}.sam", chunks_dir = chunks_dir, current_datetime = current_datetime, sample = info_df['chunk_names']), # minimap2 -> mafft
# expand("{chunks_dir}/fasta/{current_datetime}/{sample}.fasta", chunks_dir = chunks_dir, current_datetime = current_datetime, sample = info_df['chunk_names']), # chunk_fasta
gisaid_sequences_filepath, # input data (signal)
info_filepath
# reference_filepath, # input data (patient zero)




# TODO: create merge_mutations.py
rule merge_results:
input:
expand("{chunks_dir}/muts/{current_datetime}/{sample}.mutations.csv", chunks_dir = chunks_dir, sample = info_df['chunk_names'], current_datetime = current_datetime),
meta_filepath=meta_filepath
threads: 1
params:
current_datetime=current_datetime,
output:
"{out_dir}/mutations_{current_datetime}.csv"
shell:
"""
src/merge_results.py -i {chunks_dir}/muts/{current_datetime}/ -m {input.meta_filepath} -o {output} -t {params.current_datetime}
"""


# TODO: test msa_2_mutations.py
rule run_bjorn:
input:
"{chunks_dir}/msa/{current_datetime}/{sample}.aligned.fasta"
params:
patient_zero=patient_zero,
output:
temp("{chunks_dir}/muts/{current_datetime}/{sample}.mutations.csv")
shell:
"""
src/msa_2_mutations.py -i {input} -r {params.patient_zero} -o {output}
"""
# for i, o in zip(input, output):
# _ = bm.msa_2_mutations(i, params.patient_zero, o, config)

rule run_data2funk:
input:
"{chunks_dir}/sam/{current_datetime}/{sample}.sam",
params:
reference_filepath=reference_filepath,
output:
temp("{chunks_dir}/msa/{current_datetime}/{sample}.aligned.fasta"),
shell:
"""
datafunk sam_2_fasta -s {input} -r {params.reference_filepath} -o {output} --pad --log-inserts
"""

rule run_minimap2:
input:
"{chunks_dir}/fasta/{current_datetime}/{sample}.fasta"
params:
num_cpus=num_cpus,
reference_filepath=reference_filepath
output:
temp("{chunks_dir}/sam/{current_datetime}/{sample}.sam"),
shell:
"""
minimap2 -a -x asm5 -t {params.num_cpus} {params.reference_filepath} {input} -o {output}
"""

rule chunk_fasta:
input:
gisaid_sequences_filepath,
params:
reference_filepath=reference_filepath,
chunk_size=chunk_size,
out_dir=fasta_dir
# out_dir=lambda wildcards, output: Path(output).parent
threads: 1
output:
temp(expand("{chunks_dir}/fasta/{current_datetime}/{sample}.fasta", chunks_dir = chunks_dir, current_datetime = current_datetime, sample = info_df['chunk_names']))
shell:
"""
src/chunk_fasta.py -f {input} -r {params.reference_filepath} -s {params.chunk_size} -o {chunks_dir}/fasta/{current_datetime}
"""
File renamed without changes.
51 changes: 51 additions & 0 deletions archive/alab_mutations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import subprocess
import shlex
import json
from path import Path
import pandas as pd
import bjorn_support as bs
import mutations as bm
import data as bd


with open('config.json', 'r') as f:
config = json.load(f)

date = config['date']
out_dir = Path(config['alab_out_dir'])
ref_fp = Path(config['ref_fasta'])
patient_zero = config['patient_zero']
num_cpus = config['num_cpus']
in_alab_seqs = Path(config['alab_sequences'])
in_alab_meta = Path(config['alab_meta'])
if not Path.isdir(out_dir):
Path.mkdir(out_dir)
print(f"Created results directory: {out_dir}")
else:
print(f"Results directory {out_dir} already exists...Continuing...")
# concatenate all consensus sequences
fa_fp = out_dir/'alab_seqs.fa'
if not Path.isfile(fa_fp):
fa_fp = bs.concat_fasta(in_alab_seqs, out_dir/'alab_seqs')
print(f"Concatenated all sequences and wrote to {fa_fp}")
# align consensus sequences
msa_fp = Path(fa_fp.split('.')[0] + '_aligned.fa')
if not Path.isfile(msa_fp):
print(f"Aligning sequences with reference...")
msa_fp = bs.align_fasta_reference(fa_fp, msa_fp, ref_fp=ref_fp, num_cpus=num_cpus)
print(f"Multiple sequence alignment of A-lab samples with reference saved in {msa_fp}")
# msa2_fp = Path(fa_fp.split('.')[0] + '_aligned_absolute.fa')
# if not Path.isfile(msa2_fp):
# print(f"Aligning sequences without reference...")
# msa2_fp = bs.align_fasta(fa_fp, msa2_fp, num_cpus=num_cpus)
# print(f"Multiple sequence alignment of A-lab samples without reference saved in {msa2_fp}")
# Identify substitutions and deletions
msa_data = bs.load_fasta(msa_fp, is_aligned=True)
subs_wide = bm.identify_replacements(msa_data, in_alab_meta, data_src='alab')
subs_wide_fp = out_dir/f'alab_substitutions_wide_{date}.csv'
subs_wide.sort_values('num_samples', ascending=False).to_csv(subs_wide_fp, index=False)
print(f"Substitution-based mutations of A-lab samples saved in {subs_wide_fp}")
dels_wide = bm.identify_deletions(msa_data, in_alab_meta, data_src='alab')
dels_wide_fp = out_dir/f'alab_deletions_wide_{date}.csv'
dels_wide.sort_values('num_samples', ascending=False).to_csv(dels_wide_fp, index=False)
print(f"Deletion-based mutations of A-lab samples saved in {dels_wide_fp}")
38 changes: 23 additions & 15 deletions bjorn.py → archive/bjorn.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from itertools import repeat
import os
from datetime import datetime as dt
from bjorn_support import concat_fasta, align_fasta, compute_tree, map_gene_to_pos
from bjorn_support import concat_fasta, align_fasta, compute_tree, map_gene_to_pos, load_fasta
from mutations import identify_replacements, identify_deletions, identify_insertions
from onion_trees import load_tree, visualize_tree, get_indel2color, get_sample2color
import data as bd
Expand Down Expand Up @@ -392,20 +392,33 @@ def process_id(x):
colors = list(mcolors.TABLEAU_COLORS.keys())
# path to new github metadata
meta_fp = out_dir/'metadata.csv'
# load multiple sequence alignment
msa_data = load_fasta(msa_fp, is_aligned=True)
# identify insertions
insertions = identify_insertions(msa_data,
meta_fp=meta_fp,
patient_zero=patient_zero,
min_ins_len=1,
data_src='alab')
# save insertion results to file
insertions.to_csv(out_dir/'insertions.csv', index=False)
# identify substitution mutations
subs = identify_replacements(msa_fp,
meta_fp,
patient_zero)
subs = identify_replacements(msa_data,
meta_fp=meta_fp,
data_src='alab',
patient_zero=patient_zero)
# save substitution results to file
subs.to_csv(out_dir/'replacements.csv', index=False)
# identify deletions
deletions = identify_deletions(msa_fp,
meta_fp,
patient_zero,
min_del_len=1)
deletions = identify_deletions(msa_data,
meta_fp=meta_fp,
data_src='alab',
patient_zero=patient_zero,
min_del_len=1)
# save deletion results to file
deletions.to_csv(out_dir/'deletions.csv', index=False)
# plot Phylogenetic tree with top consensus deletions annotated
# deletions = deletions.nlargest(len(colors), 'num_samples')
deletions = deletions.nlargest(len(colors), 'num_samples')
# del2color = get_indel2color(deletions, colors)
# sample_colors = get_sample2color(deletions, colors)
# fig2 = visualize_tree(tree, sample_colors,
Expand All @@ -415,10 +428,6 @@ def process_id(x):
# indels=deletions, colors=colors,
# isnv_info=True);
# fig3.savefig(tree_dir/'deletion_isnv_tree.pdf', dpi=300)
# identify insertions
insertions = identify_insertions(msa_fp, meta_fp, patient_zero, min_ins_len=1)
# save deletion results to file
insertions.to_csv(out_dir/'insertions.csv', index=False)
# plot Phylogenetic tree with top consensus deletions annotated
insertions = insertions.nlargest(len(colors), 'num_samples')
# del2color = get_indel2color(insertions, colors)
Expand All @@ -434,10 +443,9 @@ def process_id(x):
Path.mkdir(out_dir);
# Data logging
with open("{}/data_release.log".format(out_dir), 'w') as f:
f.write(f"Prepared {final_result.shape[0]} samples for release")
f.write(f"Prepared {final_result.shape[0]} samples for release\n")
f.write(f'{num_samples_missing_coverage} samples are missing coverage information\n')
f.write(f'{low_coverage_samples.shape[0]} samples were found to have coverage below 90%\n')
f.write(f'{num_samples_missing_cons} samples were ignored because they were missing consensus sequence files\n')
f.write(f'{num_samples_missing_bams} samples were ignored because they were missing BAM sequence files\n')
f.write(f'{num_samples_missing_bams} samples were ignored because they were missing BAM sequence files\n')
print(f"Transfer Complete. All results saved in {out_dir}")
30 changes: 30 additions & 0 deletions bjorn_support.py → archive/bjorn_support.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,35 @@
from Bio import Seq, SeqIO, AlignIO, Phylo, Align


def batch_iterator(iterator, chunk_size):
"""Returns lists of length batch_size.

This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.

This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
Citation: https://biopython.org/wiki/Split_large_file
"""
record = True
while record:
chunk = []
while len(chunk) < chunk_size:
try:
record = next(iterator)
except StopIteration:
record = None
if record is None:
# End of file
break
chunk.append(record)
if chunk:
yield chunk


def dict2fasta(seqs: dict, fasta_fp: str, wrap=80):
with open(fasta_fp, 'w') as f:
for gid, gseq in seqs.items():
Expand Down Expand Up @@ -148,6 +177,7 @@ def concat_fasta_2(in_filepaths: list, out_filepath):
"""Concatenate fasta sequences into single fasta file.
Takes a list of fasta filepaths and an output filename for saving"""
cat_cmd = f"cat {' '.join(in_filepaths)} > {out_filepath}"
print(cat_cmd)
run_command(cat_cmd)
return out_filepath

Expand Down
File renamed without changes.
Loading