As the topic, I alse implement the local and global sequence alignment.
However, I don't do this by one program but two.
- The program
semi_interval
, will calculate best score with semi x and y first, then generate semi inteval of x and y. - The program
alignment
, will align the sequence x,y with the interval, although this is global align, but we use semi interval to do this, so result will as same as semi global sequence alignment.
You can edit config in myconfig.h
, this config is a part of program, so it will be optimized ("optimized" is my word, not "compiler optimizing") with preprocessor , such as branch reducing, class member reducing, etc.
In this file, you can edit cuda thread number per block, start and end of sequence x and y is free or fixed the datatype of score matrix and so on.
make
make the program.
make clean
clean the program
./semi_interval.out <x.txt> <y.txt> <best interval.txt> <score.txt>
Get the best interval, you need to config semi setting and datatype of score matrix in
myconfig.h
.
<x.txt>
<y.txt>
the files need to contain input sequence.<best interval.txt>
best intervals store in this file.<score.txt>
score matrix store will be get from this file.
./alignment.out <x.txt> <y.txt> <best interval.txt> <score.txt> <alignment.txt>
Get alignment using the
<best interval.txt>
generated bysemi_interval.out
, this is global alignment but using the interval in<best interval.txt>
<x.txt> <y.txt>
the files need to contain input sequence.<best interval.txt>
best interval should be got in this file (only get first line (interval)).<score.txt>
score matrix store will be get from this file.<alignment.txt>
alignment will be stored in this file.
./cpu.out <x.txt> <y.txt> <score.txt>`
Just for testing speed, only calculate and print out global sequence alignment score.
<x.txt>
<y.txt>
the files need to contain input sequence.<score.txt>
score matrix store will be get from this file.
Which stores the score matrix used in program
alignment.out
andsemi_interval.out
.
input format
<number base>
<base> ...
<score matrix> ...
.
.
.
<gap>
<extension>
<base>
can be anychar
(includingspace
), but can not benewline
input example
4
ATGC
1 -5 -5 -1
-5 1 -1 -5
-5 -1 1 -5
-1 -5 -5 1
-2
-1
The file only contain
newline
and<base>
, the program will ignorenewline
when read the sequence file.
input example
AATTCCGAT
AATTCGTT
TGGAAT
output format by semi_interval.out
<best score> <x start> <x end> <y start> <y end>
.
.
.
input format by alignment.out
<best score> <x start> <x end> <y start> <y end>
If there are multiple lines, only the first line is consumed by
alignment.out
output example
- C
A A
T T
G -
C C
- C
You can use my python scripts which calculate alignment automatically in a specific file structure. If you have many alignment to do, it's useful.
├───score.json
├───tasks
│ ├───100K-100K
│ │ └───x.txt
│ │ └───y.txt
│ ├───100K-10K
│ │ └───x.txt
│ │ └───y.txt
│ ├───10K-100K
│ │ └───x.txt
│ │ └───y.txt
│ ├───10K-10K
│ │ └───x.txt
│ │ └───y.txt
│ └───1K-1K
│ │ └───x.txt
│ │ └───y.txt
after command ./gpu_test.sh -a
├───tasks
│ ├───100K-100K
│ │ └───out
│ │ └───best.txt
│ │ └───alm
│ │ └───...
│ ├───100K-10K
│ │ └───out
│ │ └───best.txt
│ │ └───alm
│ │ └───...
│ ├───10K-100K
│ │ └───out
│ │ └───best.txt
│ │ └───alm
│ │ └───...
│ ├───10K-10K
│ │ └───out
│ │ └───best.txt
│ │ └───alm
│ │ └───...
│ └───1K-1K
│ │ └───out
│ │ └───best.txt
│ │ └───alm
│ │ └───...
The folder
alm/
contains alignments<alignment.txt>
generated fromalignment.out
, if you don't want to genearate this, remove argv-a
The file
best.txt
is the file which stores the best intervals generated fromsemi_interval.out
make cpu_test
Just use CPU run global alignment score in tasks, there is no semi function, so it just let you can compare the performance of CPU with that of GPU or, the global alignment score should be as same as the program run with CUDA (set start and end of x and y to fixed).
make gpu_test
Calcuate best scores and its intervals by
semi_interval.out
, then runalignemnt.out
generate alignments of the intervals generated bysemi_interval.out
.
make clean_tasks
clean alignments in tasks
This is very important, instead of
score.txt
, python scripts only allowscore.json
, but I thinkscore.json
is easier to edit.
example for DNA
{
"chars":["A","T","G","C"],
"matrix":[
[1,-1,-1,-1],
[-1,1,-1,-1],
[-1,-1,1,-1],
[-1,-1,-1,1]
],
"gap":-2,
"extension":-1
}
example for a-z, A-Z and space
{
"chars":[
{
"l":"a",
"r":"z"
},
{
"l":"A",
"r":"Z"
},
" "
],
"matrix":{
"match":1,
"miss":-1
},
"gap":-2,
"extension":-1
}
- GCC
9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
- CUDA
11.5
- NVIDIA-SMI
495.29.05
- Operating System
Ubuntu 20.04.3 LTS (Focal Fossa)
- Make
GNU Make 4.2.1
- Python3
3.8.10
run following command to get score.txt
$ ./matrix_transform.sh score.json temp/score.txt
Transfrom score.json to temp/score.txt.
and run following command.
$ ./semi_interval.out "tasks/1K-1K/x.txt" "tasks/1K-1K/y.txt" "tasks/1K-1K/out/best.txt" temp/score.txt
semi-global-setting: src/headers/myconfig.h
- x: [fixed, fixed]
- y: [fixed, fixed]
score matrix: temp/score.txt
sequence X: tasks/1K-1K/x.txt
- size: 972
sequence Y: tasks/1K-1K/y.txt
- size: 979
time taken: 0.01s
[OUTPUT]
best intervals: tasks/1K-1K/out/best.txt
best score: -281.20000
inteval: X=[1, 972] Y=[1, 979]
- score: -281.20000
$ ./cpu.out "tasks/1K-1K/x.txt" "tasks/1K-1K/y.txt" temp/score.txt
score matrix: temp/score.txt
sequence X: tasks/1K-1K/x.txt
- size: 972
sequence Y: tasks/1K-1K/y.txt
- size: 979
time taken: 0.02s
[OUTPUT]
best score: -281.2
run the python script gpu_test
,
$ ./gpu_test.sh -a
and you will find it run following command.
$ ./alignment.out "tasks/1K-1K/x.txt" "tasks/1K-1K/y.txt" temp/best.txt temp/score.txt "tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt"
score matrix: temp/score.txt
interval: temp/best.txt
- index: 0
- score: -281.2
- sequence X: tasks/1K-1K/x.txt
- - interval: [1, 972]
- sequence Y: tasks/1K-1K/y.txt
- - interval: [1, 979]
time taken: 1.22s
[OUTPUT]
best score: -281.2
alignment: tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt
- score: -281.2
and the result alignment is stored in "tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt"
, run following command to show it.
$ ./v2h.sh tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt
ATGCTAAAAACCCTCAATAAACTAGGTACTGATGGAACATATCTCAAAAT
--G---ACATCCAT---T----TTTGTTGTTATCCAACATCTGCCCACCG
AATAATACCTATTTATGAAAAACCCACAGCCAATACTGAATGGTGAAAAA
A-TATT-CCTTTTGAAGACTA-CCC-CATT-AATCTTGA-GAGTGG----
CTGGAAGCATTCCCTTTGAAAACCAGCACAAG--ACAAGGATGCCCTATC
CTGGTA-C--TCCCTCT-AAGAC-ATCGAAAGGGACTAGCTTTCCAAA-C
...