Skip to content

Latest commit

 

History

History
214 lines (165 loc) · 9.15 KB

150608.md

File metadata and controls

214 lines (165 loc) · 9.15 KB
==[p.33~]=========================
Retrieving data from BioMart

<BioMartとは>

説明:Kazusa Wiki(http://wiki.annotation.jp/KazusaMart)より
BioMartとは

EBI(European Bioinformatics Institute)と CSHL(the Cold Spring Harbor Laboratory)が共同開発したクエリー指向型のデータ管理システムです。 このシステムを利用して公開し、ポータルサイトに登録すると、http://biomart.org/ からもアクセス可能になります。また、多数のアプリケーションからBiomartのデータを利用できます。

===========================================================

http://www.biomart.org/ (BioMartのホームページ)

[46データベースを統合]

http://central.biomart.org/ (検索ポータル)


キーワード「分散型データ共有」

<biomart利用の為のライブラリ準備>


> source("http://bioconductor.org/biocLite.R")

Installing package into ‘C:/Users/ekaminuma/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL 'http://www.bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/BiocInstaller_1.16.5.zip'
Content type 'application/zip' length 54033 bytes (52 KB)
opened URL
downloaded 52 KB


The downloaded binary packages are in
 C:\Users\ekaminuma\AppData\Local\Temp\Rtmpeq6kwC\downloaded_packages
Bioconductor version 3.0 (BiocInstaller 1.16.5), ?biocLite for
  help
A new version of Bioconductor is available after installing
  the most recent version of R; see
  http://bioconductor.org/install

> biocLite("biomaRt")   10分くらいかかる

BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R
  version 3.1.3.
Installing package(s) 'biomaRt'
also installing the dependencies ‘IRanges’, ‘bitops’, ‘BiocGenerics’, ‘Biobase’, ‘GenomeInfoDb’, ‘DBI’, ‘RSQLite’, ‘S4Vectors’, ‘XML’, ‘RCurl’, ‘AnnotationDbi’

trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/IRanges_2.0.1.zip'
Content type 'application/zip' length 3273003 bytes (3.1 MB)
opened URL
downloaded 3.1 MB

trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/bitops_1.0-6.zip'
Content type 'application/zip' length 36018 bytes (35 KB)
opened URL
downloaded 35 KB

trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/BiocGenerics_0.12.1.zip'
Content type 'application/zip' length 855784 bytes (835 KB)
opened URL
downloaded 835 KB

trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/Biobase_2.26.0.zip'
Content type 'application/zip' length 4268946 bytes (4.1 MB)
opened URL
downloaded 4.1 MB

trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/GenomeInfoDb_1.2.5.zip'
Content type 'application/zip' length 828291 bytes (808 KB)
opened URL
downloaded 808 KB

trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/DBI_0.3.1.zip'
Content type 'application/zip' length 154184 bytes (150 KB)
opened URL
downloaded 150 KB

trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/RSQLite_1.0.0.zip'
Content type 'application/zip' length 1211110 bytes (1.2 MB)
opened URL
downloaded 1.2 MB

trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/S4Vectors_0.4.0.zip'
Content type 'application/zip' length 1411503 bytes (1.3 MB)
opened URL
downloaded 1.3 MB

trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/XML_3.98-1.2.zip'
Content type 'application/zip' length 4293059 bytes (4.1 MB)
opened URL
downloaded 4.1 MB

trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/RCurl_1.95-4.6.zip'
Content type 'application/zip' length 2703689 bytes (2.6 MB)
opened URL
downloaded 2.6 MB

trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/AnnotationDbi_1.28.2.zip'
Content type 'application/zip' length 9489077 bytes (9.0 MB)
opened URL
downloaded 9.0 MB

trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/biomaRt_2.22.0.zip'
Content type 'application/zip' length 749475 bytes (731 KB)
opened URL
downloaded 731 KB

package ‘IRanges’ successfully unpacked and MD5 sums checked
package ‘bitops’ successfully unpacked and MD5 sums checked
package ‘BiocGenerics’ successfully unpacked and MD5 sums checked
package ‘Biobase’ successfully unpacked and MD5 sums checked
package ‘GenomeInfoDb’ successfully unpacked and MD5 sums checked
package ‘DBI’ successfully unpacked and MD5 sums checked
package ‘RSQLite’ successfully unpacked and MD5 sums checked
package ‘S4Vectors’ successfully unpacked and MD5 sums checked
package ‘XML’ successfully unpacked and MD5 sums checked
package ‘RCurl’ successfully unpacked and MD5 sums checked
package ‘AnnotationDbi’ successfully unpacked and MD5 sums checked
package ‘biomaRt’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
 C:\Users\ekaminuma\AppData\Local\Temp\Rtmpeq6kwC\downloaded_packages
Old packages: 'gdata', 'manipulate', 'codetools', 'lattice',
  'MASS', 'Matrix', 'mgcv'
Update all/some/none? [a/s/n]: 
n
> library(biomaRt)

> listMarts()    選択できるbiomartのデータベース表示
                                 biomart
1                                ensembl
2                                    snp
3                             regulation
4                                   vega
5                          fungi_mart_26
:

> mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
> mart
Object of class 'Mart':
 Using the ensembl BioMart database
 Using the hsapiens_gene_ensembl dataset

> myres <- getBM(attributes = c("hgnc_symbol"), mart = mart)
> head(myres)
  hgnc_symbol
1       MT-TF
2     MT-RNR1
3       MT-TV
4     MT-RNR2
5      MT-TL1
6      MT-ND1

> tmpsample <- sample(myres$hgnc_symbol,50)
> head(tmpsample)
[1] "BAHCC1"      "SNORD115-23" "ACTG1P4"     "GNAI2P2"     "MIR941-5"   
[6] "PRUNEP1

> seq <-  getSequence(id="BRCA1",type="hgnc_symbol",seqType="peptide",mart= mart)
> show(seq)
                                                                           peptide
1                                                                                                                                            MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKFCMLKLLNQKKGPSQCPLCKNDITKRSLQESTRFSQLVEELLKIICAFQLDTGLEYANSYNFAKKENNSPEHLKDEVSIIQSMGYRNRAKRLLQSEPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYIELGSDSSEDTVNKATYCSVGDQELLQITPQGTRDEISLDSAKKAACEFSETDVTNTEHHQPSNNDLNTTEKRAAERHPEKYQGSSVSNLHVEPCGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQKLPCSENPRDTEDVPWITLNSSIQKVNEWFSRSDELLGSDDSHDGESESNAKVADVLDVLNEVDEYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTENLIIGAFVTEPQIIQERPLTNKLKRKRRPTSGLHPEDFIKKADLAVQKTPEMINQGTNQTEQNGQVMNITNSGHENKTKGDSIQNEKNPNPIESLEKESAFKTKAEPISSSISNMELELNIHNSKAPKKNRLRRKSSTRHIHALELVVSRNLSPPNCTELQIDSCSSSEE
:

> seq2 <- getSequence(id="ENST00000520540", type='ensembl_transcript_id',seqType ='gene_flank',upstream=30,mart=mart)
> show(seq2)
                      gene_flank ensembl_transcript_id
1 AATGAAAAGAGGTCTGCCCGAGCGTGCGAC       ENST00000520540


> variation = useMart(biomart="snp", dataset="hsapiens_snp")

> listFilters(variation)
                       name
1                  chr_name
2                     start
3                       end
4                band_start
5                  band_end
6              marker_start
7                marker_end
8        chromosomal_region
:

> listAttributes(variation)

listAttributes(variation)

                              name                                      description
1                        refsnp_id                                   Variation Name
2                    refsnp_source                                 Variation source
3        refsnp_source_description                     Variation source description
4                         chr_name                                  Chromosome name
5                      chrom_start                   Chromosome position start (bp)
6                        chrom_end                     Chromosome position end (bp)
7                     chrom_strand                                           Strand
8                           allele           


> rs1333049 <- getBM(attributes=c('refsnp_id','refsnp_source','chr_name','chrom_start','chrom_end','minor_allele','minor_allele_freq','minor_allele_count','consequence_allele_string','ensembl_gene_stable_id','ensembl_transcript_stable_id'), filters = 'snp_filter', values ="rs1333049", mart = variation)


                                                                                                                                                                                                                                                                                        

次回は9章のクラスタリングに入ります。
一人一人、自分でデータを用意してもらいます。
bioinformaticsのデータではなく、一般的で面白い結果が出そうなデータを探してきてください。
行列数値の形式でTSVもしくはCSVファイルでデータを準備します。
Rでデータファイルを読み込める所まで、確認準備してください(宿題)。