Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the order to use the tools #4

Open
XiaofeiSunUCSF opened this issue Aug 31, 2020 · 10 comments
Open

the order to use the tools #4

XiaofeiSunUCSF opened this issue Aug 31, 2020 · 10 comments

Comments

@XiaofeiSunUCSF
Copy link

Hi,

Thank you for the wonderful tools.
Do you have any recommended order to use those three tools (filter.bam, filter.vcf, sort.vcf)?

Thanks
Xiaofei

@ghuls
Copy link
Member

ghuls commented Sep 1, 2020

It depends a lot on your input VCF file.

Sort VCF file in the same order as the BAM file is not always needed, but only if popscle dsc-pileup gives this error" [E:%s] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has %s before %s, but VCF/BCF file has %s after %s". Altough it will never hurt to run this as first step and use the output VCF file for the Filter BAM file for usage with popscle dsc-pileup step.

The filter VCF file steps are only needed if you want to create your own specific VCF file (e.g. in case you can't use the 1000 genome VCF file as you have a different species or you have the mutations for all the cell lines you are interested in).

@XiaofeiSunUCSF
Copy link
Author

It depends a lot on your input VCF file.

Sort VCF file in the same order as the BAM file is not always needed, but only if popscle dsc-pileup gives this error" [E:%s] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has %s before %s, but VCF/BCF file has %s after %s". Altough it will never hurt to run this as first step and use the output VCF file for the Filter BAM file for usage with popscle dsc-pileup step.

The filter VCF file steps are only needed if you want to create your own specific VCF file (e.g. in case you can't use the 1000 genome VCF file as you have a different species or you have the mutations for all the cell lines you are interested in).

Hi,

Thank you for your very clear answer.
I Filter BAM by using your "filter_bam"method . Then run into sort issue when run popscle dsc-pileup. I come back and using your "sort vcf" method.

One quick question regarding sort vcf. Do you have any idea how long it usually takes? I have one 1000g vcf that is roughly 16G that has been running for 48h (I only get around 4Mb output till now). I can start a new job with a longer time before HPC abort this job. Do you have idea/suggestion to run it a little bit faster?

Cheers,
Xiaofei

@cflerin
Copy link

cflerin commented Sep 4, 2020

Hi @XiaofeiSunUCSF , maybe you're using the wrong 1000 Genomes VCF? You can use the VCF with the full set of sites, but without genotypes (assuming this is for freemuxlet). For GRCh38 I used:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz

and it's only ~850 MB. Filtering to remove rare variants reduced it to <100 MB. For me this ran quite fast.

@ghuls
Copy link
Member

ghuls commented Sep 4, 2020

@XiaofeiSunUCSF Sorting the VCF file shouldn't take that long even for such a big file. What is the exact command you run?

What is the output of grep '^#CHROM' -m1 -A 10 ${vcf_filename} and samtools view -H ${bam_file} | grep '^@SQ'? It could be that the chromosome names in your VCF file and BAM file are different.

@XiaofeiSunUCSF
Copy link
Author

XiaofeiSunUCSF commented Sep 4, 2020

ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129

Hi cflerin @cflerin ,

Yes! My ultimate target is to use freemuxlet to get the genotyping of human samples.

Thank you for this good information. I will try your VCF.
I actually downloaded 24 vcf files from the below link and merge all of them, and finally get such big 16G VCF. http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/.

Could you help me check if my procedure is correct

  1. I will use your recommended VCF to filter my Bam file by filter_bam_file_for_popscle_dsc_pileup.sh
  2. I am pretty new to VCF world. I guess the second step is to filter VCF. Should I use the example code here to filter 1000g VCF? Or this tutorial code is for who sequenced their own sample VCF, not 1000g VCF.
    `only_keep_snps input.vcf
    | compute_AF_AC_AN_values_based_on_genotype_info
    | filter_out_mutations_missing_genotype_for_one_or_more_samples
    | filter_out_mutations_homozygous_reference_in_all_samples
    | filter_out_mutations_homozygous_in_all_samples \

    output.vcf`

  3. use sort_vcf_same_as_bam.sh to sort 1000g vcf (or the one from step 2) if they do not match bam order.
  4. use bam from step1) and vcf from step 3) to run popscle dsc-pileup.

Thank you in advance!
Xiaofei

@XiaofeiSunUCSF
Copy link
Author

XiaofeiSunUCSF commented Sep 4, 2020

@XiaofeiSunUCSF Sorting the VCF file shouldn't take that long even for such a big file. What is the exact command you run?

What is the output of grep '^#CHROM' -m1 -A 10 ${vcf_filename} and samtools view -H ${bam_file} | grep '^@SQ'? It could be that the chromosome names in your VCF file and BAM file are different.

Hi ghuls @ghuls
My code is here

module load Sali cmake
module load CBI htslib
module load CBI vcftools
module load CBI bcftools
module load CBI samtools/1.10
samtools --version
/wynton/home/malab/xiaofeisun/sort_vcf_same_as_bam.sh \
/wynton/home/malab/xiaofeisun/tmp1/cite5.bam \
/wynton/home/malab/xiaofeisun/chr/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf \
> /wynton/home/malab/xiaofeisun/tmp/CITE5.vcf
qstat -j $JOB_ID`

I checked the VCF file

#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 16103 . T G . PASS AC=126;AN=5096;DP=29994;AF=0.02;EAS_AF=0;EUR_AF=0.04;AFR_AF=0.03;AMR_AF=0.03;SAS_AF=0.02;VT=SNP;NS=2548
chr1 51479 . T A . PASS AC=536;AN=5096;DP=17461;AF=0.11;EAS_AF=0;EUR_AF=0.2;AFR_AF=0.02;AMR_AF=0.12;SAS_AF=0.22;VT=SNP;NS=2548
chr1 51898 . C A . PASS AC=399;AN=5096;DP=15331;AF=0.08;EAS_AF=0.05;EUR_AF=0.12;AFR_AF=0.05;AMR_AF=0.06;SAS_AF=0.1;VT=SNP;NS=2548
chr1 51928 . G A . PASS AC=359;AN=5096;DP=17022;AF=0.07;EAS_AF=0.01;EUR_AF=0.14;AFR_AF=0.06;AMR_AF=0.07;SAS_AF=0.09;VT=SNP;NS=2548
chr1 51954 . G C . PASS AC=1;AN=5096;DP=18469;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548
chr1 54490 . G A . PASS AC=487;AN=5096;DP=24006;AF=0.1;EAS_AF=0;EUR_AF=0.18;AFR_AF=0.02;AMR_AF=0.1;SAS_AF=0.21;VT=SNP;NS=2548
chr1 54669 . C T . PASS AC=1;AN=5096;DP=23532;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548
chr1 54708 . G C . PASS AC=1167;AN=5096;DP=23694;AF=0.23;EAS_AF=0.08;EUR_AF=0.38;AFR_AF=0.18;AMR_AF=0.25;SAS_AF=0.28;VT=SNP;NS=2548
chr1 54716 . C T . PASS AC=1079;AN=5096;DP=24540;AF=0.21;EAS_AF=0.07;EUR_AF=0.34;AFR_AF=0.16;AMR_AF=0.24;SAS_AF=0.27;VT=SNP;NS=2548
chr1 54725 . T G . PASS AC=1;AN=5096;DP=25901;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548

and the BAM file

@sq SN:chr1 LN:248956422
@sq SN:chr10 LN:133797422
@sq SN:chr11 LN:135086622
@sq SN:chr12 LN:133275309
@sq SN:chr13 LN:114364328
@sq SN:chr14 LN:107043718
@sq SN:chr15 LN:101991189
@sq SN:chr16 LN:90338345
@sq SN:chr17 LN:83257441
@sq SN:chr18 LN:80373285
@sq SN:chr19 LN:58617616
@sq SN:chr2 LN:242193529
@sq SN:chr20 LN:64444167
@sq SN:chr21 LN:46709983
@sq SN:chr22 LN:50818468
@sq SN:chr3 LN:198295559
@sq SN:chr4 LN:190214555
@sq SN:chr5 LN:181538259
@sq SN:chr6 LN:170805979
@sq SN:chr7 LN:159345973
@sq SN:chr8 LN:145138636
@sq SN:chr9 LN:138394717
@sq SN:chrM LN:16569
@sq SN:chrX LN:156040895
@sq SN:chrY LN:57227415
@sq SN:KI270728.1 LN:1872759
@sq SN:KI270727.1 LN:448248
@sq SN:KI270442.1 LN:392061
@sq SN:KI270729.1 LN:280839
@sq SN:GL000225.1 LN:211173
@sq SN:KI270743.1 LN:210658
@sq SN:GL000008.2 LN:209709
@sq SN:GL000009.2 LN:201709
@sq SN:KI270747.1 LN:198735
@sq SN:KI270722.1 LN:194050
@sq SN:GL000194.1 LN:191469
@sq SN:KI270742.1 LN:186739
@sq SN:GL000205.2 LN:185591
@sq SN:GL000195.1 LN:182896
@sq SN:KI270736.1 LN:181920
@sq SN:KI270733.1 LN:179772
@sq SN:GL000224.1 LN:179693
@sq SN:GL000219.1 LN:179198
@sq SN:KI270719.1 LN:176845
@sq SN:GL000216.2 LN:176608
@sq SN:KI270712.1 LN:176043
@sq SN:KI270706.1 LN:175055
@sq SN:KI270725.1 LN:172810
@sq SN:KI270744.1 LN:168472
@sq SN:KI270734.1 LN:165050
@sq SN:GL000213.1 LN:164239
@sq SN:GL000220.1 LN:161802
@sq SN:KI270715.1 LN:161471
@sq SN:GL000218.1 LN:161147
@sq SN:KI270749.1 LN:158759
@sq SN:KI270741.1 LN:157432
@sq SN:GL000221.1 LN:155397
@sq SN:KI270716.1 LN:153799
@sq SN:KI270731.1 LN:150754
@sq SN:KI270751.1 LN:150742
@sq SN:KI270750.1 LN:148850
@sq SN:KI270519.1 LN:138126
@sq SN:GL000214.1 LN:137718
@sq SN:KI270708.1 LN:127682
@sq SN:KI270730.1 LN:112551
@sq SN:KI270438.1 LN:112505
@sq SN:KI270737.1 LN:103838
@sq SN:KI270721.1 LN:100316

@XiaofeiSunUCSF
Copy link
Author

@ghuls
I read one comment from statgen/popscle#25

Can I just use zcat ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz | sed '/^##contig/d' | gzip -c > ref_stripped.vcf.gz
to suppress the header and skip sort_vcf_same_as_bam.sh step?

@ghuls
Copy link
Member

ghuls commented Sep 6, 2020

It looks like your BAM file and VCF file have the same chromosome names convention (like "chr" in the beginning).

Which tool generated your BAM file?

The SAM header looks a bit weird.
As far as I know the SAMtools header for chromosome information should be @SQ SN:chr1 LN:248956422 instead of @sq SN:chr1 LN:248956422. I never saw the lowercase version. sort_vcf_same_as_bam.sh explicitly checks for @SQ at line 91.

@ghuls
Copy link
Member

ghuls commented Sep 7, 2020

I added a check that will print an error when '@sq' header lines are not found in the BAM header.

According to the SAM specification, lowercase header tags are reserved for local use:
https://github.com/samtools/hts-specs/blob/81d95f9e97e8b67d0d635c3015e156415c3aaf6b/SAMv1.tex#L221-L236

I think your BAM file can not be indexed or sorted by samtools as it does not have those '@sq' header lines.

@cflerin
Copy link

cflerin commented Sep 8, 2020

@XiaofeiSunUCSF, for freemuxlet you don't need to merge and process the full genotypes files. But I would still filter the sites VCF to remove rare variants. This is what I did:

bcftools view -r \
    chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX \
    -i 'INFO/AF[0] < 0.90 && INFO/AF[0] > 0.10' \
    ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz \
    -O z -o 1000Genomes.wgs.GRCH38.sites.minAF-0.1.freemuxlet.vcf.gz

I didn't need to use the sort VCF script since my bam was already in the proper sort order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants