Skip to content

Latest commit

 

History

History
314 lines (236 loc) · 14.4 KB

160530.md

File metadata and controls

314 lines (236 loc) · 14.4 KB
--------[p.233]-------------------------
Chapter 8 Analyzng NGS Data

NGS data

<FASTQ形式とは>
--------------[FASTQ形式]-----------------------------
@ID
塩基配列
+
クオリティスコア
-------------------------------------------------------------
4行単位の配列記載形式

*https://ja.wikipedia.org/wiki/Fastq
*https://en.wikipedia.org/wiki/FASTQ_format

<クオリティスコア>
- Sanger sequencingのPhred scoreに同じ
- SangerのQuality Score範囲0~40が標準
  (https://en.wikipedia.org/wiki/FASTQ_formatのrange参照)
- NGS配列クオリティは、通常20以上、厳しめで30以上を閾値とする場合が多い。

QS =-10 log10 P_err

QS: Quality score
P_err: Estimated error probability

P_err(QS=20)=10^-2=0.01 (99%精度)
P_err(QS=30)=10^-3=0.001 (99.9%精度)

*https://ja.wikipedia.org/wiki/Phred%E3%82%AF%E3%82%AA%E3%83%AA%E3%83%86%E3%82%A3%E3%82%B9%E3%82%B3%E3%82%A2
 
------[p.235]----------------------------

Querying the SRA database

SRA database
次世代シークエンサの出力配列データベース(NCBI/EBI/DDBJが登録・検索を提供している)


参考 
*DDBJ SRA
https://trace.ddbj.nig.ac.jp/dra/index.html
*SRAのデータモデル PRJD/SAMD/DRR/DRX)
https://trace.ddbj.nig.ac.jp/dra/submission.html


> source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.1 (BiocInstaller 1.18.5), ?biocLite for help
A newer version of Bioconductor is available for this version of R, ?BiocUpgrade for help
> 
> biocLite("SRAdb")
BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.5.
Installing package(s) ‘SRAdb’
also installing the dependency ‘GEOquery’
> library(SRAdb)
 要求されたパッケージ RSQLite をロード中です 
 要求されたパッケージ DBI をロード中です 
 要求されたパッケージ graph をロード中です 
 要求されたパッケージ RCurl をロード中です 
 要求されたパッケージ bitops をロード中です 
Setting options('d


> sfile <- getSRAdbFile() #時間がかかる
trying URL 'http://gbnci.abcc.ncifcrf.gov/backup/SRAmetadb.sqlite.gz'
Content type 'application/x-gzip' length 1324339570 bytes (1263.0 MB)

downloaded 1263.0 MB

Unzipping...

Metadata associate with downloaded file:

c("schema version", "creation timestamp")c("1.0", "2016-04-29 05:53:15")



※ダウンロードができない場合は、ブラウザから直接ダウンロードして下さい。
ダウンロード後のコマンドは次の赤文字に変更して実行
sracon <- dbConnect(SQLite(),dbname="SRAmetadb.sqlite") 


> sracon <- dbConnect(SQLite(),sfile)
> sratable <- dbListTables(sracon)
> dbListFields(sracon,"study")
 [1] "study_ID"             "study_alias"          "study_accession"     
 [4] "study_title"          "study_type"           "study_abstract"      
 [7] "broker_name"          "center_name"          "center_project_name" 
[10] "study_description"    "related_studies"      "primary_study"       
[13] "sra_link"             "study_url_link"       "xref_link"           
[16] "study_entrez_link"    "ddbj_link"            "ena_link"            
[19] "study_attribute"      "submission_accession" "sradb_updated"       
> 



> myhit <- dbGetQuery(sracon,paste("select study_accession, study_title from study where","study_description like'%embryo'",sep=" "))


> myhit
  study_accession                                                  study_title
1       SRP001353 GSE17621: RNA-Seq of Drosophila developmental stage 16-18 hr
> myhit2 <- getSRA(search_terms = "brain", out_types =
 c('run','study'),sracon)






> head(myhit2 )
  run_alias       run   run_date updated_date   spots     bases run_center experiment_name
1 DRR000062 DRR000062 2008-09-25   2015-02-06 6095377 219433572     UT-MGS       DRX000023
2 DRR000061 DRR000061 2008-09-25   2015-02-06 6082236 218960496     UT-MGS       DRX000023
3 DRR000063 DRR000063 2008-09-25   2015-02-06 6062563 218252268     UT-MGS       DRX000023
4 DRR000060 DRR000060 2008-09-25   2015-02-06 6219174 223890264     UT-MGS       DRX000023
5 DRR000072 DRR000072 2009-02-02   2015-02-06 9192512 330930432     UT-MGS       DRX000026
6 DRR000073 DRR000073 2009-02-02   2015-02-06 9306929 335049444     UT-MGS       DRX000026
  run_url_link run_entrez_link                            run_attribute study_alias     study
1         <NA>            <NA> notes: repackaged seq prb sig2 int nse:    DRP000023 DRP000023
2         <NA>            <NA> notes: repackaged seq prb sig2 int nse:    DRP000023 DRP000023
3         <NA>            <NA> notes: repackaged seq prb sig2 int nse:    DRP000023 DRP000023
4         <NA>            <NA> notes: repackaged seq prb sig2 int nse:    DRP000023 DRP000023
5         <NA>            <NA> notes: repackaged seq prb sig2 int nse:    DRP000026 DRP000026
6         <NA>            <NA> notes: repackaged seq prb sig2 int nse:    DRP000026 DRP000026
                                                                             study_title
1     Massive transcriptional start site mapping of human fetal brain             cells.
2     Massive transcriptional start site mapping of human fetal brain             cells.
3     Massive transcriptional start site mapping of human fetal brain             cells.
4     Massive transcriptional start site mapping of human fetal brain             cells.
5 Massive transcriptional start site mapping of human clonetech brain             cells.
6 Massive transcriptional start site mapping of human clonetech brain             cells.
              study_type
1 Transcriptome Analysis
2 Transcrip

> tail(myhit2)
                                    run_alias        run run_date updated_date    spots
50686 Cinel_E1tRNA_ACAGTGAT_L001_R2_001.fastq SRR3406053     <NA>   2016-04-25 28361074
50687 Cinel_E4tRNA_ACTTGAAT_L001_R1_001.fastq SRR3406059     <NA>   2016-04-25 30660093
50688 Cinel_C4tRNA_TGACCAAT_L001_R2_001.fastq SRR3406052     <NA>   2016-04-25 29811718
50689 Cinel_C3tRNA_TTAGGCAT_L001_R2_001.fastq SRR3406036     <NA>   2016-04-25 29273530
50690 Cinel_E2tRNA_GCCAATAT_L001_R1_001.fastq SRR3406054     <NA>   2016-04-25 30121607
50691 Cinel_E3tRNA_CAGATCAT_L001_R1_001.fastq SRR3406055     <NA>   2016-04-25 31079186
           bases run_center experiment_name run_url_link run_entrez_link run_attribute
50686 5623923171       <NA>            <NA>         <NA>            <NA>          <NA>
50687 6075801870       <NA>            <NA>         <NA>            <NA>          <NA>
50688 5891620039       <NA>            <NA>         <NA>            <NA>          <NA>
50689 5808738827       <NA>            <NA>         <NA>            <NA>          <NA>
50690 5969233028       <NA>            <NA>         <NA>            <NA>          <NA>
50691 6155388947       <NA>            <NA>         <NA>            <NA>          <NA>
      study_alias     study                                                     study_title
50686 PRJNA318819 SRP073619 Spodoptera frugiperda indirect predator exposure-induced stress
50687 PRJNA318819 SRP073619 Spodoptera frugiperda indirect predator exposure-induced stress
50688 PRJNA318819 SRP073619 Spodoptera frugiperda indirect predator exposure-induced stress
50689 PRJNA318819 SRP073619 Spodoptera frugiperda indirect predator exposure-induced stress
50690 PRJNA318819 SRP073619 Spodoptera frugiperda indirect predator exposure-induced stress
50691 PRJNA318819 SRP073619 Spodoptera frugiperda indirect predator exposure-induced stress
                   study_type
50686 Whole Genome Sequencing
50687 Whole Genome Sequencing
50688 Whole Genome Sequencing
50689 Whole Genome Sequencing
50690 Whole Genome Sequencing
50691 Whole Genome Sequencing



> summary(myhit2)
  run_alias             run              run_date         updated_date      
 Length:50691       Length:50691       Length:50691       Length:50691      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                 
     spots               bases            run_center        experiment_name   
 Min.   :0.000e+00   Min.   :0.000e+00   Length:50691       Length:50691      
 1st Qu.:2.722e+06   1st Qu.:2.308e+08   Class :character   Class :character  
 Median :1.036e+07   Median :9.148e+08   Mode  :character   Mode  :character  
 Mean   :2.939e+07   Mean   :3.948e+09                                        
 3rd Qu.:3.409e+07   3rd Qu.:3.634e+09                                        
 Max.   :4.320e+09   Max.   :4.090e+11                                        
 run_url_link       run_entrez_link    run_attribute      study_alias       
 Length:50691       Length:50691       Length:50691       Length:50691      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
    study           study_title         study_type        study_abstract    
 Length:50691       Length:50691       Length:50691       Length:50691      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
 center_project_name study_description  study_url_link     study_entrez_link 
 Length:50691        Length:50691       Length:50691       Length:50691      
 Class :character    Class :character   Class :character   Class :character  
 Mode  :character    Mode  :character   Mode  :character   Mode  :character  
                                                                          
 study_attribute    related_studies    primary_study     
 Length:50691       Length:50691       Length:50691      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         


> myhit3 <- getSRA(search_terms = '"Alzheimers" OR "EPILEPSY"', out_types = c('sample'),sracon)
> 
> head(myhit3)
                     sample_alias    sample taxon_id common_name anonymized_name individual_name
1                          cec411 ERS354366     9606       human            <NA>            <NA>
2 qiime_ppdid_586:10288.000027209 ERS877305   408170        <NA>            <NA>            <NA>
3 qiime_ppdid_586:10288.000027805 ERS877375   408170        <NA>            <NA>            <NA>
4 qiime_ppdid_586:10288.000028250 ERS877431   408170        <NA>            <NA>            <NA>
5 qiime_ppdid_586:10288.000027210 ERS877306   408170        <NA>            <NA>            <NA>
6 qiime_ppdid_586:10288.000027803 ERS877374   408170        <NA>            <NA>            <NA>
                   description sample_url_link sample_entrez_link
1                         <NA>            <NA>               <NA>
2 American Gut Round 20 sample            <NA>               <NA>
3 American Gut Round 20 sample            <NA>               <NA>
4 American Gut Round 20 sample            <NA>               <NA>
5 American Gut Round 20 sample            <NA>               <NA>
6 American Gut Round 20 sample            <NA>               <NA>
                                                            
------[p.237]--------------------------

Downloading data from the SRA database


> conversion <- sraConvert(c('ERS354366','SRS266589'),sra_con=sracon)
>

> conversion
     sample submission     study experiment       run
1 ERS354366  ERA252779 ERP003987  ERX324104 ERR351268
2 SRS266589  SRA046961 SRP008797  SRX100465 SRR351672
3 SRS266589  SRA046961 SRP008797  SRX100465 SRR351673
> 
> 

> rs <- getSRAinfo(c("SRX100465"),sracon,sraType="sra")
> rs
                                                                                                            ftp
1 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX100/SRX100465/SRR351672/SRR351672.sra
2 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX100/SRX100465/SRR351673/SRR351673.sra
  experiment     study    sample       run size(KB) date
1  SRX100465 SRP008797 SRS266589 SRR351672     <NA> <NA>
2  SRX100465 SRP008797 SRS266589 SRR351673     <NA> <NA>
> 
> 
> 

> getSRAfile(c("SRR351672","SRR351673"),sracon,fileType='fastq')
Files are saved to: 
'C:/Users/ekaminuma/Documents'

trying URL 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR351/SRR351672/SRR351672.fastq.gz'

<ファイルサイズが大きい為、Sessionメニュの「Interrupt R」でダウンロード中止しておく>

==================================================
------------[p.239]-----------------------
Reading FASTQ in R

> install.packages("R.utils")
Installing package into ‘C:/Users/ekaminuma/Documents/R/win-library/3.2’ 
> library(R.utils)
 要求されたパッケージ R.oo をロード中です 
 要求されたパッケージ R.methodsS3 をロード中です 
R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.


> download.file(url="ftp://ftp.ddbj.nig.
ac.jp/ddbj_database/dra/fastq/SRA000/SRA
000241/SRX000122/SRR000648.fastq.bz2"
,destfile="SRR000648.fastq.bz2") 
trying URL 'ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000241/SRX000122'
downloaded 1772 bytes



> 


> bunzip2(list.files(pattern=".fastq.bz2$")) 圧縮を解凍






> 


> biocLite("ShortRead")
BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.5.
> library(ShortRead)
 要求されたパッケージ BiocGenerics をロード中です 
 要求されたパッケージ parallel をロード中です 

> mydat <- readFastq(getwd(),pattern=".fastq")  関数がサポートされていない
Error: could not find function "readFastq"


> readLines("SRR000648.fastq",4)
[1] "@SRR000648.1 EUEMSW405C5XBJ length=218"                                                                                     
[2] "TTGCGTGGGAGCATTTTATTTCATTTACAAGGCCGTGGGAAGTCATAAAATTAATTTCGGCATAACCATGTGTTCAGTGTGTATACAGTGAGAGTAGCTCCCGCTGCCCATTGTTTCAAGACTTCTGCTTCTTGTGGGTTGTGGCTGGAATGAAGACACCGGTCTTCACTGCAATAAGGTTGTTGTTTCTGATTTCCAACCCCAAATCCGTCCAGCTT"
[3] "+SRR000648.1 EUEMSW405C5XBJ length=218"                                                  
[4] "C==<==FB1==<<HD8+<FB1==FB1<=B;C<C<<<EA/A9=;<<=FC6'C=C=FB1<C<<<=C=C<<=<==C=<<<=<<=<====<===<=<==<<;<FB0==<<GB1<C=<FB0=C<=:=C=<===C=;B;;=B=)C<8<?6<=B;B;<<B;=8:;C<C=;;C=<<<===C;;A9C<C=8B;;FB0<<=<GC3C=C<FB4$A=(=C=:<A9=;<B<"