Skip to content

Commit

Permalink
Example formats
Browse files Browse the repository at this point in the history
  • Loading branch information
hill committed Nov 11, 2024
1 parent cc3d5ea commit 697767f
Show file tree
Hide file tree
Showing 9 changed files with 315 additions and 1 deletion.
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,18 @@
DNAvigate is Tom's experimental genome browser, primarily so he can learn about
different file formats / methods in bioinformatics-land.

...but its also because no one has named a genome browser DNAvigate yet? And that seems
like the obvious naming choice???

Please don't expect this to be functional or useful! :)

## Existing Genome Browsers

- UCSC genome brower - https://genome.ucsc.edu/
- https://bioconductor.org/packages/release/bioc/vignettes/Gviz/inst/doc/Gviz.html
- IGV - https://igv.org/doc/desktop/
- IGV - https://igv.org/doc/desktop/


## Stuff I'm Learning about as I build this

- [file formats](./docs/formats.md)
50 changes: 50 additions & 0 deletions docs/formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# File Formats

Different file formats in bioinformatics and what they're used for

## FASTA (.fa)

Raw nucleotide or protein sequence

[Example](../examples/mygene.fasta)

## .fai

Fasta index file generated by `samtools` for fast access to specific regions

[Example hg19 index positions for chromosome positions](../examples/human_g1k_v37.fasta.fai)

## FASTQ (.fq)

FASTA but with quality scores (I think always a [phred quality score](https://en.wikipedia.org/wiki/Phred_quality_score)?).

Quality score represents probability that the base was called incorrectly by the sequencer

```fastq
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```

[Example](../examples/HI.4019.002.index_7.ANN0831_R1.fastq)

## Alignment Formats

### SAM / BAM (Sequence/Binary Alignment Map)

Aligns a sequence with a reference sequence. BAM is the binary format.

[example sam](../examples/toy.sam)

### CRAM (Compressed Reference-oriented Alignment Map)

Compressed Sequence Alignment - more compressed than BAM.

## VCF (Variant Call Format) / BCF (binary format)

Represents sequence variation (e.g. SNPs, indels, structural variants, alternative alleles)

## PDB (protein data bank format)

Representing 3d strucutres of molecules / proteins
55 changes: 55 additions & 0 deletions docs/manipulating-genome.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Steps for manipulating a genome and producing output files

https://www.biostars.org/p/150010/ (thanks jdimatteo)

https://web.archive.org/web/20161010092833/http://biobits.org/samtools_primer.html

https://www.melbournebioinformatics.org.au/tutorials/

1. Download hg19 Reference Genome

```bash
curl -O http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
curl -O http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
gunzip human_g1k_v37.fasta.gz
```

2. Filter out a single chromosome and index it using `samtools` - suite of tools for interacting w/ the file formats

`samtools faidx` filters out the single chromosome using the `.fai` index file.

[Bowtie2](https://github.com/BenLangmead/bowtie2) is a read aligner for aligning a sequence to a reference.
I think `minimap2` is an alternative, modern version.

```bash
samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta
bowtie2-build human_g1k_v37_chr20.fasta homo_chr20
```

3. Simulate a read sample

[wgsim](https://github.com/lh3/wgsim) simulates sequence reads from a reference genome.
It simulates diloid genomes with SNPs and indel polymorphisms.

```bash
wgsim -N 1 human_g1k_v37_chr20.fasta single.read1.fq single.read2.fq > wgsim.out
```

4. Generate the sam

```bash
bowtie2 homo_chr20 -1 single.read1.fq -2 single.read2.fq -S single_pair.sam
```

5. Generate the bam

```bash
samtools view -b -S -o single_pair.bam single_pair.sam
```

6. Sort and index it

```bash
samtools sort single_pair.bam single_pair.sorted
samtools index single_pair.sorted.bam
```
56 changes: 56 additions & 0 deletions examples/ex1.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
>seq1
CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCT
GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGCTGAGGGGTGCAGAGCCGAGTCAC
GGGGTTGCCAGCACAGGGGCTTAACCTCTGGTGACTGCCAGAGCTGCTGGCAAGCTAGAG
TCCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAATGAAAACTATATTTATGCTATTC
AGTTCTAAATATAGAAATTGAAACAGCTGTGTTTAGTGCCTTTGTTCAACCCCCTTGCAA
CAACCTTGAGAACCCCAGGGAATTTGTCAATGTCAGGGAAGGAGCATTTTGTCAGTTACC
AAATGTGTTTATTACCAGAGGGATGGAGGGAAGAGGGACGCTGAAGAACTTTGATGCCCT
CTTCTTCCAAAGATGAAACGCGTAACTGCGCTCTCATTCACTCCAGCTCCCTGTCACCCA
ATGGACCTGTGATATCTGGATTCTGGGAAATTCTTCATCCTGGACCCTGAGAGATTCTGC
AGCCCAGCTCCAGATTGCTTGTGGTCTGACAGGCTGCAACTGTGAGCCATCACAATGAAC
AACAGGAAGAAAAGGTCTTTCAAAAGGTGATGTGTGTTCTCATCAACCTCATACACACAC
ATGGTTTAGGGGTATAATACCTCTACATGGCTGATTATGAAAACAATGTTCCCCAGATAC
CATCCCTGTCTTACTTCCAGCTCCCCAGAGGGAAAGCTTTCAACGCTTCTAGCCATTTCT
TTTGGCATTTGCCTTCAGACCCTACACGAATGCGTCTCTACCACAGGGGGCTGCGCGGTT
TCCCATCATGAAGCACTGAACTTCCACGTCTCATCTAGGGGAACAGGGAGGTGCACTAAT
GCGCTCCACGCCCAAGCCCTTCTCACAGTTTCTGCCCCCAGCATGGTTGTACTGGGCAAT
ACATGAGATTATTAGGAAATGCTTTACTGTCATAACTATGAAGAGACTATTGCCAGATGA
ACCACACATTAATACTATGTTTCTTATCTGCACATTACTACCCTGCAATTAATATAATTG
TGTCCATGTACACACGCTGTCCTATGTACTTATCATGACTCTATCCCAAATTCCCAATTA
CGTCCTATCTTCTTCTTAGGGAAGAACAGCTTAGGTATCAATTTGGTGTTCTGTGTAAAG
TCTCAGGGAGCCGTCCGTGTCCTCCCATCTGGCCTCGTCCACACTGGTTCTCTTGAAAGC
TTGGGCTGTAATGATGCCCCTTGGCCATCACCCAGTCCCTGCCCCATCTCTTGTAATCTC
TCTCCTTTTTGCTGCATCCCTGTCTTCCTCTGTCTTGATTTACTTGTTGTTGGTTTTCTG
TTTCTTTGTTTGATTTGGTGGAAGACATAATCCCACGCTTCCTATGGAAAGGTTGTTGGG
AGATTTTTAATGATTCCTCAATGTTAAAATGTCTATTTTTGTCTTGACACCCAACTAATA
TTTGTCTGAGCAAAACAGTCTAGATGAGAGAGAACTTCCCTGGAGGTCTGATGGCGTTTC
TCCCTCGTCTTCTTA
>seq2
TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAAGAAATTACAAAATATAGTTGAAAG
CTCTAACAATAGACTAAACCAAGCAGAAGAAAGAGGTTCAGAACTTGAAGACAAGTCTCT
TATGAATTAACCCAGTCAGACAAAAATAAAGAAAAAAATTTTAAAAATGAACAGAGCTTT
CAAGAAGTATGAGATTATGTAAAGTAACTGAACCTATGAGTCACAGGTATTCCTGAGGAA
AAAGAAAAAGTGAGAAGTTTGGAAAAACTATTTGAGGAAGTAATTGGGGAAAACCTCTTT
AGTCTTGCTAGAGATTTAGACATCTAAATGAAAGAGGCTCAAAGAATGCCAGGAAGATAC
ATTGCAAGACAGACTTCATCAAGATATGTAGTCATCAGACTATCTAAAGTCAACATGAAG
GAAAAAAATTCTAAAATCAGCAAGAGAAAAGCATACAGTCATCTATAAAGGAAATCCCAT
CAGAATAACAATGGGCTTCTCAGCAGAAACCTTACAAGCCAGAAGAGATTGGATCTAATT
TTTGGACTTCTTAAAGAAAAAAAAACCTGTCAAACACGAATGTTATGCCCTGCTAAACTA
AGCATCATAAATGAAGGGGAAATAAAGTCAAGTCTTTCCTGACAAGCAAATGCTAAGATA
ATTCATCATCACTAAACCAGTCCTATAAGAAATGCTCAAAAGAATTGTAAAAGTCAAAAT
TAAAGTTCAATACTCACCATCATAAATACACACAAAAGTACAAAACTCACAGGTTTTATA
AAACAATTGAGACTACAGAGCAACTAGGTAAAAAATTAACATTACAACAGGAACAAAACC
TCATATATCAATATTAACTTTGAATAAAAAGGGATTAAATTCCCCCACTTAAGAGATATA
GATTGGCAGAACAGATTTAAAAACATGAACTAACTATATGCTGTTTACAAGAAACTCATT
AATAAAGACATGAGTTCAGGTAAAGGGGTGGAAAAAGATGTTCTACGCAAACAGAAACCA
AATGAGAGAAGGAGTAGCTATACTTATATCAGATAAAGCACACTTTAAATCAACAACAGT
AAAATAAAACAAAGGAGGTCATCATACAATGATAAAAAGATCAATTCAGCAAGAAGATAT
AACCATCCTACTAAATACATATGCACCTAACACAAGACTACCCAGATTCATAAAACAAAT
ACTACTAGACCTAAGAGGGATGAGAAATTACCTAATTGGTACAATGTACAATATTCTGAT
GATGGTTACACTAAAAGCCCATACTTTACTGCTACTCAATATATCCATGTAACAAATCTG
CGCTTGTACTTCTAAATCTATAAAAAAATTAAAATTTAACAAAAGTAAATAAAACACATA
GCTAAAACTAAAAAAGCAAAAACAAAAACTATGCTAAGTATTGGTAAAGATGTGGGGAAA
AAAGTAAACTCTCAAATATTGCTAGTGGGAGTATAAATTGTTTTCCACTTTGGAAAACAA
TTTGGTAATTTCGTTTTTTTTTTTTTCTTTTCTCTTTTTTTTTTTTTTTTTTTTGCATGC
CAGAAAAAAATATTTACAGTAACT
Binary file added examples/ex1.sam.gz
Binary file not shown.
43 changes: 43 additions & 0 deletions examples/ex1_readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Comes from https://github.com/samtools/samtools/blob/develop/examples

File ex1.fa contains two sequences cut from the human genome
build36. They were extracted with command:

samtools faidx human_b36.fa 2:2043966-2045540 20:67967-69550

Sequence names were changed manually for simplicity. File ex1.sam.gz
contains MAQ alignments extracted with:

(samtools view NA18507_maq.bam 2:2044001-2045500;
samtools view NA18507_maq.bam 20:68001-69500)

and processed with `samtools fixmate' to make it self-consistent as a
standalone alignment.

To try samtools, you may run the following commands.

Index the reference FASTA.
samtools faidx ex1.fa

Convert the (headerless) SAM file to BAM. Note if we had used
"samtools view -h" above to create the ex1.sam.gz then we could omit the
"-t ex1.fa.fai" option here.
samtools view -S -b -t ex1.fa.fai -o ex1.bam ex1.sam.gz

Build an index for the BAM file:
samtools index ex1.bam

View a portion of the BAM file:
samtools view ex1.bam seq2:450-550

Visually inspect the alignments at the same location:
samtools tview -p seq2:450 ex1.bam ex1.fa

View the data in pileup format:
samtools mpileup -f ex1.fa ex1.bam

Generate an uncompressed VCF file of variants:
samtools mpileup -vu -f ex1.fa ex1.bam > ex1.vcf

Generate a compressed VCF file of variants:
samtools mpileup -g -f ex1.fa ex1.bam > ex1.bcf
84 changes: 84 additions & 0 deletions examples/human_g1k_v37.fasta.fai
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
1 249250621 52 60 61
2 243199373 253404903 60 61
3 198022430 500657651 60 61
4 191154276 701980507 60 61
5 180915260 896320740 60 61
6 171115067 1080251307 60 61
7 159138663 1254218344 60 61
8 146364022 1416009371 60 61
9 141213431 1564812846 60 61
10 135534747 1708379889 60 61
11 135006516 1846173603 60 61
12 133851895 1983430282 60 61
13 115169878 2119513096 60 61
14 107349540 2236602526 60 61
15 102531392 2345741279 60 61
16 90354753 2449981581 60 61
17 81195210 2541842300 60 61
18 78077248 2624390817 60 61
19 59128983 2703769406 60 61
20 63025520 2763883926 60 61
21 48129895 2827959925 60 61
22 51304566 2876892038 60 61
X 155270560 2929051733 60 61
Y 59373566 3086910193 60 61
MT 16569 3147273397 70 71
GL000207.1 4262 3147290265 60 61
GL000226.1 15008 3147294661 60 61
GL000229.1 19913 3147309982 60 61
GL000231.1 27386 3147330289 60 61
GL000210.1 27682 3147358194 60 61
GL000239.1 33824 3147386400 60 61
GL000235.1 34474 3147420850 60 61
GL000201.1 36148 3147455961 60 61
GL000247.1 36422 3147492774 60 61
GL000245.1 36651 3147529866 60 61
GL000197.1 37175 3147567190 60 61
GL000203.1 37498 3147605047 60 61
GL000246.1 38154 3147643232 60 61
GL000249.1 38502 3147682084 60 61
GL000196.1 38914 3147721290 60 61
GL000248.1 39786 3147760915 60 61
GL000244.1 39929 3147801427 60 61
GL000238.1 39939 3147842084 60 61
GL000202.1 40103 3147882751 60 61
GL000234.1 40531 3147923585 60 61
GL000232.1 40652 3147964854 60 61
GL000206.1 41001 3148006246 60 61
GL000240.1 41933 3148047993 60 61
GL000236.1 41934 3148090687 60 61
GL000241.1 42152 3148133382 60 61
GL000243.1 43341 3148176299 60 61
GL000242.1 43523 3148220425 60 61
GL000230.1 43691 3148264736 60 61
GL000237.1 45867 3148309218 60 61
GL000233.1 45941 3148355912 60 61
GL000204.1 81310 3148402681 60 61
GL000198.1 90085 3148485409 60 61
GL000208.1 92689 3148577058 60 61
GL000191.1 106433 3148671355 60 61
GL000227.1 128374 3148779625 60 61
GL000228.1 129120 3148910202 60 61
GL000214.1 137718 3149041537 60 61
GL000221.1 155397 3149181614 60 61
GL000209.1 159169 3149339664 60 61
GL000218.1 161147 3149501549 60 61
GL000220.1 161802 3149665445 60 61
GL000213.1 164239 3149830007 60 61
GL000211.1 166566 3149997047 60 61
GL000199.1 169874 3150166453 60 61
GL000217.1 172149 3150339222 60 61
GL000216.1 172294 3150514304 60 61
GL000215.1 172545 3150689533 60 61
GL000205.1 174588 3150865017 60 61
GL000219.1 179198 3151042578 60 61
GL000224.1 179693 3151224826 60 61
GL000223.1 180455 3151407577 60 61
GL000195.1 182896 3151591103 60 61
GL000212.1 186858 3151777111 60 61
GL000222.1 186861 3151967147 60 61
GL000200.1 187035 3152157186 60 61
GL000193.1 189789 3152347402 60 61
GL000194.1 191469 3152540418 60 61
GL000225.1 211173 3152735142 60 61
GL000192.1 547496 3152949898 60 61
4 changes: 4 additions & 0 deletions examples/toy.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
>ref
AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
>ref2
aggttttataaaacaattaagtctacagagcaactacgcg
14 changes: 14 additions & 0 deletions examples/toy.sam
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
@SQ SN:ref LN:45
@SQ SN:ref2 LN:40
r001 163 ref 7 30 8M4I4M1D3M = 37 39 TTAGATAAAGAGGATACTG * XX:B:S,12561,2,20,112
r002 0 ref 9 30 1S2I6M1P1I1P1I4M2I * 0 0 AAAAGATAAGGGATAAA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA *
r004 0 ref 16 30 6M14N1I5M * 0 0 ATAGCTCTCAGC *
r003 16 ref 29 30 6H5M * 0 0 TAGGC *
r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
x1 0 ref2 1 30 20M * 0 0 aggttttataaaacaaataa ????????????????????
x2 0 ref2 2 30 21M * 0 0 ggttttataaaacaaataatt ?????????????????????
x3 0 ref2 6 30 9M4I13M * 0 0 ttataaaacAAATaattaagtctaca ??????????????????????????
x4 0 ref2 10 30 25M * 0 0 CaaaTaattaagtctacagagcaac ?????????????????????????
x5 0 ref2 12 30 24M * 0 0 aaTaattaagtctacagagcaact ????????????????????????
x6 0 ref2 14 30 23M * 0 0 Taattaagtctacagagcaacta ???????????????????????

0 comments on commit 697767f

Please sign in to comment.