==[p.33~]=========================
Retrieving data from BioMart
<BioMartとは>
説明:Kazusa Wiki(http://wiki.annotation.jp/KazusaMart)より
BioMartとは
EBI(European Bioinformatics Institute)と CSHL(the Cold Spring Harbor Laboratory)が共同開発したクエリー指向型のデータ管理システムです。 このシステムを利用して公開し、ポータルサイトに登録すると、http://biomart.org/ からもアクセス可能になります。また、多数のアプリケーションからBiomartのデータを利用できます。
===========================================================
http://www.biomart.org/ (BioMartのホームページ)
[46データベースを統合]
http://central.biomart.org/ (検索ポータル)
キーワード「分散型データ共有」
<biomart利用の為のライブラリ準備>
> source("http://bioconductor.org/biocLite.R")
Installing package into ‘C:/Users/ekaminuma/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL 'http://www.bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/BiocInstaller_1.16.5.zip'
Content type 'application/zip' length 54033 bytes (52 KB)
opened URL
downloaded 52 KB
The downloaded binary packages are in
C:\Users\ekaminuma\AppData\Local\Temp\Rtmpeq6kwC\downloaded_packages
Bioconductor version 3.0 (BiocInstaller 1.16.5), ?biocLite for
help
A new version of Bioconductor is available after installing
the most recent version of R; see
http://bioconductor.org/install
> biocLite("biomaRt") 10分くらいかかる
BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R
version 3.1.3.
Installing package(s) 'biomaRt'
also installing the dependencies ‘IRanges’, ‘bitops’, ‘BiocGenerics’, ‘Biobase’, ‘GenomeInfoDb’, ‘DBI’, ‘RSQLite’, ‘S4Vectors’, ‘XML’, ‘RCurl’, ‘AnnotationDbi’
trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/IRanges_2.0.1.zip'
Content type 'application/zip' length 3273003 bytes (3.1 MB)
opened URL
downloaded 3.1 MB
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/bitops_1.0-6.zip'
Content type 'application/zip' length 36018 bytes (35 KB)
opened URL
downloaded 35 KB
trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/BiocGenerics_0.12.1.zip'
Content type 'application/zip' length 855784 bytes (835 KB)
opened URL
downloaded 835 KB
trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/Biobase_2.26.0.zip'
Content type 'application/zip' length 4268946 bytes (4.1 MB)
opened URL
downloaded 4.1 MB
trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/GenomeInfoDb_1.2.5.zip'
Content type 'application/zip' length 828291 bytes (808 KB)
opened URL
downloaded 808 KB
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/DBI_0.3.1.zip'
Content type 'application/zip' length 154184 bytes (150 KB)
opened URL
downloaded 150 KB
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/RSQLite_1.0.0.zip'
Content type 'application/zip' length 1211110 bytes (1.2 MB)
opened URL
downloaded 1.2 MB
trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/S4Vectors_0.4.0.zip'
Content type 'application/zip' length 1411503 bytes (1.3 MB)
opened URL
downloaded 1.3 MB
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/XML_3.98-1.2.zip'
Content type 'application/zip' length 4293059 bytes (4.1 MB)
opened URL
downloaded 4.1 MB
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/RCurl_1.95-4.6.zip'
Content type 'application/zip' length 2703689 bytes (2.6 MB)
opened URL
downloaded 2.6 MB
trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/AnnotationDbi_1.28.2.zip'
Content type 'application/zip' length 9489077 bytes (9.0 MB)
opened URL
downloaded 9.0 MB
trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/biomaRt_2.22.0.zip'
Content type 'application/zip' length 749475 bytes (731 KB)
opened URL
downloaded 731 KB
package ‘IRanges’ successfully unpacked and MD5 sums checked
package ‘bitops’ successfully unpacked and MD5 sums checked
package ‘BiocGenerics’ successfully unpacked and MD5 sums checked
package ‘Biobase’ successfully unpacked and MD5 sums checked
package ‘GenomeInfoDb’ successfully unpacked and MD5 sums checked
package ‘DBI’ successfully unpacked and MD5 sums checked
package ‘RSQLite’ successfully unpacked and MD5 sums checked
package ‘S4Vectors’ successfully unpacked and MD5 sums checked
package ‘XML’ successfully unpacked and MD5 sums checked
package ‘RCurl’ successfully unpacked and MD5 sums checked
package ‘AnnotationDbi’ successfully unpacked and MD5 sums checked
package ‘biomaRt’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\ekaminuma\AppData\Local\Temp\Rtmpeq6kwC\downloaded_packages
Old packages: 'gdata', 'manipulate', 'codetools', 'lattice',
'MASS', 'Matrix', 'mgcv'
Update all/some/none? [a/s/n]:
n
> library(biomaRt)
> listMarts() 選択できるbiomartのデータベース表示
biomart
1 ensembl
2 snp
3 regulation
4 vega
5 fungi_mart_26
:
> mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
> mart
Object of class 'Mart':
Using the ensembl BioMart database
Using the hsapiens_gene_ensembl dataset
> myres <- getBM(attributes = c("hgnc_symbol"), mart = mart)
> head(myres)
hgnc_symbol
1 MT-TF
2 MT-RNR1
3 MT-TV
4 MT-RNR2
5 MT-TL1
6 MT-ND1
> tmpsample <- sample(myres$hgnc_symbol,50)
> head(tmpsample)
[1] "BAHCC1" "SNORD115-23" "ACTG1P4" "GNAI2P2" "MIR941-5"
[6] "PRUNEP1
> seq <- getSequence(id="BRCA1",type="hgnc_symbol",seqType="peptide",mart= mart)
> show(seq)
peptide
1 MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKFCMLKLLNQKKGPSQCPLCKNDITKRSLQESTRFSQLVEELLKIICAFQLDTGLEYANSYNFAKKENNSPEHLKDEVSIIQSMGYRNRAKRLLQSEPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYIELGSDSSEDTVNKATYCSVGDQELLQITPQGTRDEISLDSAKKAACEFSETDVTNTEHHQPSNNDLNTTEKRAAERHPEKYQGSSVSNLHVEPCGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQKLPCSENPRDTEDVPWITLNSSIQKVNEWFSRSDELLGSDDSHDGESESNAKVADVLDVLNEVDEYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTENLIIGAFVTEPQIIQERPLTNKLKRKRRPTSGLHPEDFIKKADLAVQKTPEMINQGTNQTEQNGQVMNITNSGHENKTKGDSIQNEKNPNPIESLEKESAFKTKAEPISSSISNMELELNIHNSKAPKKNRLRRKSSTRHIHALELVVSRNLSPPNCTELQIDSCSSSEE
:
> seq2 <- getSequence(id="ENST00000520540", type='ensembl_transcript_id',seqType ='gene_flank',upstream=30,mart=mart)
> show(seq2)
gene_flank ensembl_transcript_id
1 AATGAAAAGAGGTCTGCCCGAGCGTGCGAC ENST00000520540
> variation = useMart(biomart="snp", dataset="hsapiens_snp")
> listFilters(variation)
name
1 chr_name
2 start
3 end
4 band_start
5 band_end
6 marker_start
7 marker_end
8 chromosomal_region
:
> listAttributes(variation)
listAttributes(variation)
name description
1 refsnp_id Variation Name
2 refsnp_source Variation source
3 refsnp_source_description Variation source description
4 chr_name Chromosome name
5 chrom_start Chromosome position start (bp)
6 chrom_end Chromosome position end (bp)
7 chrom_strand Strand
8 allele
> rs1333049 <- getBM(attributes=c('refsnp_id','refsnp_source','chr_name','chrom_start','chrom_end','minor_allele','minor_allele_freq','minor_allele_count','consequence_allele_string','ensembl_gene_stable_id','ensembl_transcript_stable_id'), filters = 'snp_filter', values ="rs1333049", mart = variation)
次回は9章のクラスタリングに入ります。
一人一人、自分でデータを用意してもらいます。
bioinformaticsのデータではなく、一般的で面白い結果が出そうなデータを探してきてください。
行列数値の形式でTSVもしくはCSVファイルでデータを準備します。
Rでデータファイルを読み込める所まで、確認準備してください(宿題)。