-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
315 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# File Formats | ||
|
||
Different file formats in bioinformatics and what they're used for | ||
|
||
## FASTA (.fa) | ||
|
||
Raw nucleotide or protein sequence | ||
|
||
[Example](../examples/mygene.fasta) | ||
|
||
## .fai | ||
|
||
Fasta index file generated by `samtools` for fast access to specific regions | ||
|
||
[Example hg19 index positions for chromosome positions](../examples/human_g1k_v37.fasta.fai) | ||
|
||
## FASTQ (.fq) | ||
|
||
FASTA but with quality scores (I think always a [phred quality score](https://en.wikipedia.org/wiki/Phred_quality_score)?). | ||
|
||
Quality score represents probability that the base was called incorrectly by the sequencer | ||
|
||
```fastq | ||
@SEQ_ID | ||
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT | ||
+ | ||
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 | ||
``` | ||
|
||
[Example](../examples/HI.4019.002.index_7.ANN0831_R1.fastq) | ||
|
||
## Alignment Formats | ||
|
||
### SAM / BAM (Sequence/Binary Alignment Map) | ||
|
||
Aligns a sequence with a reference sequence. BAM is the binary format. | ||
|
||
[example sam](../examples/toy.sam) | ||
|
||
### CRAM (Compressed Reference-oriented Alignment Map) | ||
|
||
Compressed Sequence Alignment - more compressed than BAM. | ||
|
||
## VCF (Variant Call Format) / BCF (binary format) | ||
|
||
Represents sequence variation (e.g. SNPs, indels, structural variants, alternative alleles) | ||
|
||
## PDB (protein data bank format) | ||
|
||
Representing 3d strucutres of molecules / proteins |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Steps for manipulating a genome and producing output files | ||
|
||
https://www.biostars.org/p/150010/ (thanks jdimatteo) | ||
|
||
https://web.archive.org/web/20161010092833/http://biobits.org/samtools_primer.html | ||
|
||
https://www.melbournebioinformatics.org.au/tutorials/ | ||
|
||
1. Download hg19 Reference Genome | ||
|
||
```bash | ||
curl -O http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai | ||
curl -O http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | ||
gunzip human_g1k_v37.fasta.gz | ||
``` | ||
|
||
2. Filter out a single chromosome and index it using `samtools` - suite of tools for interacting w/ the file formats | ||
|
||
`samtools faidx` filters out the single chromosome using the `.fai` index file. | ||
|
||
[Bowtie2](https://github.com/BenLangmead/bowtie2) is a read aligner for aligning a sequence to a reference. | ||
I think `minimap2` is an alternative, modern version. | ||
|
||
```bash | ||
samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta | ||
bowtie2-build human_g1k_v37_chr20.fasta homo_chr20 | ||
``` | ||
|
||
3. Simulate a read sample | ||
|
||
[wgsim](https://github.com/lh3/wgsim) simulates sequence reads from a reference genome. | ||
It simulates diloid genomes with SNPs and indel polymorphisms. | ||
|
||
```bash | ||
wgsim -N 1 human_g1k_v37_chr20.fasta single.read1.fq single.read2.fq > wgsim.out | ||
``` | ||
|
||
4. Generate the sam | ||
|
||
```bash | ||
bowtie2 homo_chr20 -1 single.read1.fq -2 single.read2.fq -S single_pair.sam | ||
``` | ||
|
||
5. Generate the bam | ||
|
||
```bash | ||
samtools view -b -S -o single_pair.bam single_pair.sam | ||
``` | ||
|
||
6. Sort and index it | ||
|
||
```bash | ||
samtools sort single_pair.bam single_pair.sorted | ||
samtools index single_pair.sorted.bam | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
>seq1 | ||
CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCT | ||
GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGCTGAGGGGTGCAGAGCCGAGTCAC | ||
GGGGTTGCCAGCACAGGGGCTTAACCTCTGGTGACTGCCAGAGCTGCTGGCAAGCTAGAG | ||
TCCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAATGAAAACTATATTTATGCTATTC | ||
AGTTCTAAATATAGAAATTGAAACAGCTGTGTTTAGTGCCTTTGTTCAACCCCCTTGCAA | ||
CAACCTTGAGAACCCCAGGGAATTTGTCAATGTCAGGGAAGGAGCATTTTGTCAGTTACC | ||
AAATGTGTTTATTACCAGAGGGATGGAGGGAAGAGGGACGCTGAAGAACTTTGATGCCCT | ||
CTTCTTCCAAAGATGAAACGCGTAACTGCGCTCTCATTCACTCCAGCTCCCTGTCACCCA | ||
ATGGACCTGTGATATCTGGATTCTGGGAAATTCTTCATCCTGGACCCTGAGAGATTCTGC | ||
AGCCCAGCTCCAGATTGCTTGTGGTCTGACAGGCTGCAACTGTGAGCCATCACAATGAAC | ||
AACAGGAAGAAAAGGTCTTTCAAAAGGTGATGTGTGTTCTCATCAACCTCATACACACAC | ||
ATGGTTTAGGGGTATAATACCTCTACATGGCTGATTATGAAAACAATGTTCCCCAGATAC | ||
CATCCCTGTCTTACTTCCAGCTCCCCAGAGGGAAAGCTTTCAACGCTTCTAGCCATTTCT | ||
TTTGGCATTTGCCTTCAGACCCTACACGAATGCGTCTCTACCACAGGGGGCTGCGCGGTT | ||
TCCCATCATGAAGCACTGAACTTCCACGTCTCATCTAGGGGAACAGGGAGGTGCACTAAT | ||
GCGCTCCACGCCCAAGCCCTTCTCACAGTTTCTGCCCCCAGCATGGTTGTACTGGGCAAT | ||
ACATGAGATTATTAGGAAATGCTTTACTGTCATAACTATGAAGAGACTATTGCCAGATGA | ||
ACCACACATTAATACTATGTTTCTTATCTGCACATTACTACCCTGCAATTAATATAATTG | ||
TGTCCATGTACACACGCTGTCCTATGTACTTATCATGACTCTATCCCAAATTCCCAATTA | ||
CGTCCTATCTTCTTCTTAGGGAAGAACAGCTTAGGTATCAATTTGGTGTTCTGTGTAAAG | ||
TCTCAGGGAGCCGTCCGTGTCCTCCCATCTGGCCTCGTCCACACTGGTTCTCTTGAAAGC | ||
TTGGGCTGTAATGATGCCCCTTGGCCATCACCCAGTCCCTGCCCCATCTCTTGTAATCTC | ||
TCTCCTTTTTGCTGCATCCCTGTCTTCCTCTGTCTTGATTTACTTGTTGTTGGTTTTCTG | ||
TTTCTTTGTTTGATTTGGTGGAAGACATAATCCCACGCTTCCTATGGAAAGGTTGTTGGG | ||
AGATTTTTAATGATTCCTCAATGTTAAAATGTCTATTTTTGTCTTGACACCCAACTAATA | ||
TTTGTCTGAGCAAAACAGTCTAGATGAGAGAGAACTTCCCTGGAGGTCTGATGGCGTTTC | ||
TCCCTCGTCTTCTTA | ||
>seq2 | ||
TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAAGAAATTACAAAATATAGTTGAAAG | ||
CTCTAACAATAGACTAAACCAAGCAGAAGAAAGAGGTTCAGAACTTGAAGACAAGTCTCT | ||
TATGAATTAACCCAGTCAGACAAAAATAAAGAAAAAAATTTTAAAAATGAACAGAGCTTT | ||
CAAGAAGTATGAGATTATGTAAAGTAACTGAACCTATGAGTCACAGGTATTCCTGAGGAA | ||
AAAGAAAAAGTGAGAAGTTTGGAAAAACTATTTGAGGAAGTAATTGGGGAAAACCTCTTT | ||
AGTCTTGCTAGAGATTTAGACATCTAAATGAAAGAGGCTCAAAGAATGCCAGGAAGATAC | ||
ATTGCAAGACAGACTTCATCAAGATATGTAGTCATCAGACTATCTAAAGTCAACATGAAG | ||
GAAAAAAATTCTAAAATCAGCAAGAGAAAAGCATACAGTCATCTATAAAGGAAATCCCAT | ||
CAGAATAACAATGGGCTTCTCAGCAGAAACCTTACAAGCCAGAAGAGATTGGATCTAATT | ||
TTTGGACTTCTTAAAGAAAAAAAAACCTGTCAAACACGAATGTTATGCCCTGCTAAACTA | ||
AGCATCATAAATGAAGGGGAAATAAAGTCAAGTCTTTCCTGACAAGCAAATGCTAAGATA | ||
ATTCATCATCACTAAACCAGTCCTATAAGAAATGCTCAAAAGAATTGTAAAAGTCAAAAT | ||
TAAAGTTCAATACTCACCATCATAAATACACACAAAAGTACAAAACTCACAGGTTTTATA | ||
AAACAATTGAGACTACAGAGCAACTAGGTAAAAAATTAACATTACAACAGGAACAAAACC | ||
TCATATATCAATATTAACTTTGAATAAAAAGGGATTAAATTCCCCCACTTAAGAGATATA | ||
GATTGGCAGAACAGATTTAAAAACATGAACTAACTATATGCTGTTTACAAGAAACTCATT | ||
AATAAAGACATGAGTTCAGGTAAAGGGGTGGAAAAAGATGTTCTACGCAAACAGAAACCA | ||
AATGAGAGAAGGAGTAGCTATACTTATATCAGATAAAGCACACTTTAAATCAACAACAGT | ||
AAAATAAAACAAAGGAGGTCATCATACAATGATAAAAAGATCAATTCAGCAAGAAGATAT | ||
AACCATCCTACTAAATACATATGCACCTAACACAAGACTACCCAGATTCATAAAACAAAT | ||
ACTACTAGACCTAAGAGGGATGAGAAATTACCTAATTGGTACAATGTACAATATTCTGAT | ||
GATGGTTACACTAAAAGCCCATACTTTACTGCTACTCAATATATCCATGTAACAAATCTG | ||
CGCTTGTACTTCTAAATCTATAAAAAAATTAAAATTTAACAAAAGTAAATAAAACACATA | ||
GCTAAAACTAAAAAAGCAAAAACAAAAACTATGCTAAGTATTGGTAAAGATGTGGGGAAA | ||
AAAGTAAACTCTCAAATATTGCTAGTGGGAGTATAAATTGTTTTCCACTTTGGAAAACAA | ||
TTTGGTAATTTCGTTTTTTTTTTTTTCTTTTCTCTTTTTTTTTTTTTTTTTTTTGCATGC | ||
CAGAAAAAAATATTTACAGTAACT |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
Comes from https://github.com/samtools/samtools/blob/develop/examples | ||
|
||
File ex1.fa contains two sequences cut from the human genome | ||
build36. They were extracted with command: | ||
|
||
samtools faidx human_b36.fa 2:2043966-2045540 20:67967-69550 | ||
|
||
Sequence names were changed manually for simplicity. File ex1.sam.gz | ||
contains MAQ alignments extracted with: | ||
|
||
(samtools view NA18507_maq.bam 2:2044001-2045500; | ||
samtools view NA18507_maq.bam 20:68001-69500) | ||
|
||
and processed with `samtools fixmate' to make it self-consistent as a | ||
standalone alignment. | ||
|
||
To try samtools, you may run the following commands. | ||
|
||
Index the reference FASTA. | ||
samtools faidx ex1.fa | ||
|
||
Convert the (headerless) SAM file to BAM. Note if we had used | ||
"samtools view -h" above to create the ex1.sam.gz then we could omit the | ||
"-t ex1.fa.fai" option here. | ||
samtools view -S -b -t ex1.fa.fai -o ex1.bam ex1.sam.gz | ||
|
||
Build an index for the BAM file: | ||
samtools index ex1.bam | ||
|
||
View a portion of the BAM file: | ||
samtools view ex1.bam seq2:450-550 | ||
|
||
Visually inspect the alignments at the same location: | ||
samtools tview -p seq2:450 ex1.bam ex1.fa | ||
|
||
View the data in pileup format: | ||
samtools mpileup -f ex1.fa ex1.bam | ||
|
||
Generate an uncompressed VCF file of variants: | ||
samtools mpileup -vu -f ex1.fa ex1.bam > ex1.vcf | ||
|
||
Generate a compressed VCF file of variants: | ||
samtools mpileup -g -f ex1.fa ex1.bam > ex1.bcf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
1 249250621 52 60 61 | ||
2 243199373 253404903 60 61 | ||
3 198022430 500657651 60 61 | ||
4 191154276 701980507 60 61 | ||
5 180915260 896320740 60 61 | ||
6 171115067 1080251307 60 61 | ||
7 159138663 1254218344 60 61 | ||
8 146364022 1416009371 60 61 | ||
9 141213431 1564812846 60 61 | ||
10 135534747 1708379889 60 61 | ||
11 135006516 1846173603 60 61 | ||
12 133851895 1983430282 60 61 | ||
13 115169878 2119513096 60 61 | ||
14 107349540 2236602526 60 61 | ||
15 102531392 2345741279 60 61 | ||
16 90354753 2449981581 60 61 | ||
17 81195210 2541842300 60 61 | ||
18 78077248 2624390817 60 61 | ||
19 59128983 2703769406 60 61 | ||
20 63025520 2763883926 60 61 | ||
21 48129895 2827959925 60 61 | ||
22 51304566 2876892038 60 61 | ||
X 155270560 2929051733 60 61 | ||
Y 59373566 3086910193 60 61 | ||
MT 16569 3147273397 70 71 | ||
GL000207.1 4262 3147290265 60 61 | ||
GL000226.1 15008 3147294661 60 61 | ||
GL000229.1 19913 3147309982 60 61 | ||
GL000231.1 27386 3147330289 60 61 | ||
GL000210.1 27682 3147358194 60 61 | ||
GL000239.1 33824 3147386400 60 61 | ||
GL000235.1 34474 3147420850 60 61 | ||
GL000201.1 36148 3147455961 60 61 | ||
GL000247.1 36422 3147492774 60 61 | ||
GL000245.1 36651 3147529866 60 61 | ||
GL000197.1 37175 3147567190 60 61 | ||
GL000203.1 37498 3147605047 60 61 | ||
GL000246.1 38154 3147643232 60 61 | ||
GL000249.1 38502 3147682084 60 61 | ||
GL000196.1 38914 3147721290 60 61 | ||
GL000248.1 39786 3147760915 60 61 | ||
GL000244.1 39929 3147801427 60 61 | ||
GL000238.1 39939 3147842084 60 61 | ||
GL000202.1 40103 3147882751 60 61 | ||
GL000234.1 40531 3147923585 60 61 | ||
GL000232.1 40652 3147964854 60 61 | ||
GL000206.1 41001 3148006246 60 61 | ||
GL000240.1 41933 3148047993 60 61 | ||
GL000236.1 41934 3148090687 60 61 | ||
GL000241.1 42152 3148133382 60 61 | ||
GL000243.1 43341 3148176299 60 61 | ||
GL000242.1 43523 3148220425 60 61 | ||
GL000230.1 43691 3148264736 60 61 | ||
GL000237.1 45867 3148309218 60 61 | ||
GL000233.1 45941 3148355912 60 61 | ||
GL000204.1 81310 3148402681 60 61 | ||
GL000198.1 90085 3148485409 60 61 | ||
GL000208.1 92689 3148577058 60 61 | ||
GL000191.1 106433 3148671355 60 61 | ||
GL000227.1 128374 3148779625 60 61 | ||
GL000228.1 129120 3148910202 60 61 | ||
GL000214.1 137718 3149041537 60 61 | ||
GL000221.1 155397 3149181614 60 61 | ||
GL000209.1 159169 3149339664 60 61 | ||
GL000218.1 161147 3149501549 60 61 | ||
GL000220.1 161802 3149665445 60 61 | ||
GL000213.1 164239 3149830007 60 61 | ||
GL000211.1 166566 3149997047 60 61 | ||
GL000199.1 169874 3150166453 60 61 | ||
GL000217.1 172149 3150339222 60 61 | ||
GL000216.1 172294 3150514304 60 61 | ||
GL000215.1 172545 3150689533 60 61 | ||
GL000205.1 174588 3150865017 60 61 | ||
GL000219.1 179198 3151042578 60 61 | ||
GL000224.1 179693 3151224826 60 61 | ||
GL000223.1 180455 3151407577 60 61 | ||
GL000195.1 182896 3151591103 60 61 | ||
GL000212.1 186858 3151777111 60 61 | ||
GL000222.1 186861 3151967147 60 61 | ||
GL000200.1 187035 3152157186 60 61 | ||
GL000193.1 189789 3152347402 60 61 | ||
GL000194.1 191469 3152540418 60 61 | ||
GL000225.1 211173 3152735142 60 61 | ||
GL000192.1 547496 3152949898 60 61 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
>ref | ||
AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT | ||
>ref2 | ||
aggttttataaaacaattaagtctacagagcaactacgcg |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
@SQ SN:ref LN:45 | ||
@SQ SN:ref2 LN:40 | ||
r001 163 ref 7 30 8M4I4M1D3M = 37 39 TTAGATAAAGAGGATACTG * XX:B:S,12561,2,20,112 | ||
r002 0 ref 9 30 1S2I6M1P1I1P1I4M2I * 0 0 AAAAGATAAGGGATAAA * | ||
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * | ||
r004 0 ref 16 30 6M14N1I5M * 0 0 ATAGCTCTCAGC * | ||
r003 16 ref 29 30 6H5M * 0 0 TAGGC * | ||
r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * | ||
x1 0 ref2 1 30 20M * 0 0 aggttttataaaacaaataa ???????????????????? | ||
x2 0 ref2 2 30 21M * 0 0 ggttttataaaacaaataatt ????????????????????? | ||
x3 0 ref2 6 30 9M4I13M * 0 0 ttataaaacAAATaattaagtctaca ?????????????????????????? | ||
x4 0 ref2 10 30 25M * 0 0 CaaaTaattaagtctacagagcaac ????????????????????????? | ||
x5 0 ref2 12 30 24M * 0 0 aaTaattaagtctacagagcaact ???????????????????????? | ||
x6 0 ref2 14 30 23M * 0 0 Taattaagtctacagagcaacta ??????????????????????? |