the order to use the tools #4

XiaofeiSunUCSF · 2020-08-31T19:49:44Z

Hi,

Thank you for the wonderful tools.
Do you have any recommended order to use those three tools (filter.bam, filter.vcf, sort.vcf)?

Thanks
Xiaofei

ghuls · 2020-09-01T08:44:24Z

It depends a lot on your input VCF file.

Sort VCF file in the same order as the BAM file is not always needed, but only if popscle dsc-pileup gives this error" [E:%s] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has %s before %s, but VCF/BCF file has %s after %s". Altough it will never hurt to run this as first step and use the output VCF file for the Filter BAM file for usage with popscle dsc-pileup step.

The filter VCF file steps are only needed if you want to create your own specific VCF file (e.g. in case you can't use the 1000 genome VCF file as you have a different species or you have the mutations for all the cell lines you are interested in).

XiaofeiSunUCSF · 2020-09-01T19:20:30Z

It depends a lot on your input VCF file.

Sort VCF file in the same order as the BAM file is not always needed, but only if popscle dsc-pileup gives this error" [E:%s] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has %s before %s, but VCF/BCF file has %s after %s". Altough it will never hurt to run this as first step and use the output VCF file for the Filter BAM file for usage with popscle dsc-pileup step.

The filter VCF file steps are only needed if you want to create your own specific VCF file (e.g. in case you can't use the 1000 genome VCF file as you have a different species or you have the mutations for all the cell lines you are interested in).

Hi,

Thank you for your very clear answer.
I Filter BAM by using your "filter_bam"method . Then run into sort issue when run popscle dsc-pileup. I come back and using your "sort vcf" method.

One quick question regarding sort vcf. Do you have any idea how long it usually takes? I have one 1000g vcf that is roughly 16G that has been running for 48h (I only get around 4Mb output till now). I can start a new job with a longer time before HPC abort this job. Do you have idea/suggestion to run it a little bit faster?

Cheers,
Xiaofei

cflerin · 2020-09-04T10:53:25Z

Hi @XiaofeiSunUCSF , maybe you're using the wrong 1000 Genomes VCF? You can use the VCF with the full set of sites, but without genotypes (assuming this is for freemuxlet). For GRCh38 I used:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz

and it's only ~850 MB. Filtering to remove rare variants reduced it to <100 MB. For me this ran quite fast.

ghuls · 2020-09-04T12:30:07Z

@XiaofeiSunUCSF Sorting the VCF file shouldn't take that long even for such a big file. What is the exact command you run?

What is the output of grep '^#CHROM' -m1 -A 10 ${vcf_filename} and samtools view -H ${bam_file} | grep '^@SQ'? It could be that the chromosome names in your VCF file and BAM file are different.

XiaofeiSunUCSF · 2020-09-04T17:42:56Z

ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129

Hi cflerin @cflerin ,

Yes! My ultimate target is to use freemuxlet to get the genotyping of human samples.

Thank you for this good information. I will try your VCF.
I actually downloaded 24 vcf files from the below link and merge all of them, and finally get such big 16G VCF. http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/.

Could you help me check if my procedure is correct

I will use your recommended VCF to filter my Bam file by filter_bam_file_for_popscle_dsc_pileup.sh
I am pretty new to VCF world. I guess the second step is to filter VCF. Should I use the example code here to filter 1000g VCF? Or this tutorial code is for who sequenced their own sample VCF, not 1000g VCF.
`only_keep_snps input.vcf
| compute_AF_AC_AN_values_based_on_genotype_info
| filter_out_mutations_missing_genotype_for_one_or_more_samples
| filter_out_mutations_homozygous_reference_in_all_samples
| filter_out_mutations_homozygous_in_all_samples \

output.vcf`
use sort_vcf_same_as_bam.sh to sort 1000g vcf (or the one from step 2) if they do not match bam order.
use bam from step1) and vcf from step 3) to run popscle dsc-pileup.

Thank you in advance!
Xiaofei

XiaofeiSunUCSF · 2020-09-04T22:45:52Z

@XiaofeiSunUCSF Sorting the VCF file shouldn't take that long even for such a big file. What is the exact command you run?

What is the output of grep '^#CHROM' -m1 -A 10 ${vcf_filename} and samtools view -H ${bam_file} | grep '^@SQ'? It could be that the chromosome names in your VCF file and BAM file are different.

Hi ghuls @ghuls
My code is here

module load Sali cmake
module load CBI htslib
module load CBI vcftools
module load CBI bcftools
module load CBI samtools/1.10
samtools --version
/wynton/home/malab/xiaofeisun/sort_vcf_same_as_bam.sh \
/wynton/home/malab/xiaofeisun/tmp1/cite5.bam \
/wynton/home/malab/xiaofeisun/chr/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf \
> /wynton/home/malab/xiaofeisun/tmp/CITE5.vcf
qstat -j $JOB_ID`

I checked the VCF file

#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 16103 . T G . PASS AC=126;AN=5096;DP=29994;AF=0.02;EAS_AF=0;EUR_AF=0.04;AFR_AF=0.03;AMR_AF=0.03;SAS_AF=0.02;VT=SNP;NS=2548
chr1 51479 . T A . PASS AC=536;AN=5096;DP=17461;AF=0.11;EAS_AF=0;EUR_AF=0.2;AFR_AF=0.02;AMR_AF=0.12;SAS_AF=0.22;VT=SNP;NS=2548
chr1 51898 . C A . PASS AC=399;AN=5096;DP=15331;AF=0.08;EAS_AF=0.05;EUR_AF=0.12;AFR_AF=0.05;AMR_AF=0.06;SAS_AF=0.1;VT=SNP;NS=2548
chr1 51928 . G A . PASS AC=359;AN=5096;DP=17022;AF=0.07;EAS_AF=0.01;EUR_AF=0.14;AFR_AF=0.06;AMR_AF=0.07;SAS_AF=0.09;VT=SNP;NS=2548
chr1 51954 . G C . PASS AC=1;AN=5096;DP=18469;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548
chr1 54490 . G A . PASS AC=487;AN=5096;DP=24006;AF=0.1;EAS_AF=0;EUR_AF=0.18;AFR_AF=0.02;AMR_AF=0.1;SAS_AF=0.21;VT=SNP;NS=2548
chr1 54669 . C T . PASS AC=1;AN=5096;DP=23532;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548
chr1 54708 . G C . PASS AC=1167;AN=5096;DP=23694;AF=0.23;EAS_AF=0.08;EUR_AF=0.38;AFR_AF=0.18;AMR_AF=0.25;SAS_AF=0.28;VT=SNP;NS=2548
chr1 54716 . C T . PASS AC=1079;AN=5096;DP=24540;AF=0.21;EAS_AF=0.07;EUR_AF=0.34;AFR_AF=0.16;AMR_AF=0.24;SAS_AF=0.27;VT=SNP;NS=2548
chr1 54725 . T G . PASS AC=1;AN=5096;DP=25901;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548

and the BAM file

@sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq @sq SN:chr1 LN:248956422
SN:chr10 LN:133797422
SN:chr11 LN:135086622
SN:chr12 LN:133275309
SN:chr13 LN:114364328
SN:chr14 LN:107043718
SN:chr15 LN:101991189
SN:chr16 LN:90338345
SN:chr17 LN:83257441
SN:chr18 LN:80373285
SN:chr19 LN:58617616
SN:chr2 LN:242193529
SN:chr20 LN:64444167
SN:chr21 LN:46709983
SN:chr22 LN:50818468
SN:chr3 LN:198295559
SN:chr4 LN:190214555
SN:chr5 LN:181538259
SN:chr6 LN:170805979
SN:chr7 LN:159345973
SN:chr8 LN:145138636
SN:chr9 LN:138394717
SN:chrM LN:16569
SN:chrX LN:156040895
SN:chrY LN:57227415
SN:KI270728.1 LN:1872759
SN:KI270727.1 LN:448248
SN:KI270442.1 LN:392061
SN:KI270729.1 LN:280839
SN:GL000225.1 LN:211173
SN:KI270743.1 LN:210658
SN:GL000008.2 LN:209709
SN:GL000009.2 LN:201709
SN:KI270747.1 LN:198735
SN:KI270722.1 LN:194050
SN:GL000194.1 LN:191469
SN:KI270742.1 LN:186739
SN:GL000205.2 LN:185591
SN:GL000195.1 LN:182896
SN:KI270736.1 LN:181920
SN:KI270733.1 LN:179772
SN:GL000224.1 LN:179693
SN:GL000219.1 LN:179198
SN:KI270719.1 LN:176845
SN:GL000216.2 LN:176608
SN:KI270712.1 LN:176043
SN:KI270706.1 LN:175055
SN:KI270725.1 LN:172810
SN:KI270744.1 LN:168472
SN:KI270734.1 LN:165050
SN:GL000213.1 LN:164239
SN:GL000220.1 LN:161802
SN:KI270715.1 LN:161471
SN:GL000218.1 LN:161147
SN:KI270749.1 LN:158759
SN:KI270741.1 LN:157432
SN:GL000221.1 LN:155397
SN:KI270716.1 LN:153799
SN:KI270731.1 LN:150754
SN:KI270751.1 LN:150742
SN:KI270750.1 LN:148850
SN:KI270519.1 LN:138126
SN:GL000214.1 LN:137718
SN:KI270708.1 LN:127682
SN:KI270730.1 LN:112551
SN:KI270438.1 LN:112505
SN:KI270737.1 LN:103838
SN:KI270721.1 LN:100316

XiaofeiSunUCSF · 2020-09-04T23:35:24Z

@ghuls
I read one comment from statgen/popscle#25

Can I just use zcat ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz | sed '/^##contig/d' | gzip -c > ref_stripped.vcf.gz
to suppress the header and skip sort_vcf_same_as_bam.sh step?

ghuls · 2020-09-06T09:12:31Z

It looks like your BAM file and VCF file have the same chromosome names convention (like "chr" in the beginning).

Which tool generated your BAM file?

The SAM header looks a bit weird.
As far as I know the SAMtools header for chromosome information should be @SQ SN:chr1 LN:248956422 instead of @sq SN:chr1 LN:248956422. I never saw the lowercase version. sort_vcf_same_as_bam.sh explicitly checks for @SQ at line 91.

ghuls · 2020-09-07T13:27:48Z

I added a check that will print an error when '@sq' header lines are not found in the BAM header.

According to the SAM specification, lowercase header tags are reserved for local use:
https://github.com/samtools/hts-specs/blob/81d95f9e97e8b67d0d635c3015e156415c3aaf6b/SAMv1.tex#L221-L236

I think your BAM file can not be indexed or sorted by samtools as it does not have those '@sq' header lines.

cflerin · 2020-09-08T14:56:00Z

@XiaofeiSunUCSF, for freemuxlet you don't need to merge and process the full genotypes files. But I would still filter the sites VCF to remove rare variants. This is what I did:

bcftools view -r \
    chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX \
    -i 'INFO/AF[0] < 0.90 && INFO/AF[0] > 0.10' \
    ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz \
    -O z -o 1000Genomes.wgs.GRCH38.sites.minAF-0.1.freemuxlet.vcf.gz

I didn't need to use the sort VCF script since my bam was already in the proper sort order.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the order to use the tools #4

the order to use the tools #4

XiaofeiSunUCSF commented Aug 31, 2020

ghuls commented Sep 1, 2020

XiaofeiSunUCSF commented Sep 1, 2020

cflerin commented Sep 4, 2020

ghuls commented Sep 4, 2020

XiaofeiSunUCSF commented Sep 4, 2020 •

edited

Loading

XiaofeiSunUCSF commented Sep 4, 2020 •

edited

Loading

XiaofeiSunUCSF commented Sep 4, 2020

ghuls commented Sep 6, 2020

ghuls commented Sep 7, 2020

cflerin commented Sep 8, 2020

the order to use the tools #4

the order to use the tools #4

Comments

XiaofeiSunUCSF commented Aug 31, 2020

ghuls commented Sep 1, 2020

XiaofeiSunUCSF commented Sep 1, 2020

cflerin commented Sep 4, 2020

ghuls commented Sep 4, 2020

XiaofeiSunUCSF commented Sep 4, 2020 • edited Loading

XiaofeiSunUCSF commented Sep 4, 2020 • edited Loading

XiaofeiSunUCSF commented Sep 4, 2020

ghuls commented Sep 6, 2020

ghuls commented Sep 7, 2020

cflerin commented Sep 8, 2020

XiaofeiSunUCSF commented Sep 4, 2020 •

edited

Loading

XiaofeiSunUCSF commented Sep 4, 2020 •

edited

Loading