-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREFERENCE_FORMATS
58 lines (35 loc) · 2.19 KB
/
REFERENCE_FORMATS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
REFERENCE INPUT FILES
**************************
a) .leb36 file (can be generated by TRDB database download tool or trf2proclu program)
Copyright (c) Gary Benson and Alfredo Rodriguez, 2004
Last Revision Date Jan 23, 2013 by Yevgeniy Gelfand
PROFILE STORAGE FORMAT EXPLANATION
The file format used to store profiles is text based.
Each profile is stored in its own line and in the following format:
KEY PATLEN COPYNUMBER PROFLEN LEB36PROFILE LEB36PROFILERC Na Nc Ng Nt LEFTFLANK|RIGHTFLANK
For example, a profile record might look like:
175344010 7 4.43 7 7 X7X7MIT1X7T1X7 Q1T1T1T1X7T1X7 1 22 8 0 CCGGGGACAGCCAAGGAGGAACGCGAGGAGCCTGAGAACGCGAGGCCCTAGGGGCAGCCA|AGCCGTGCTGCCTGCCCTCAGGGACCTATAAAGCCCACTTTGCTACAAACACAGT
Here the key of the profile in an existing database is 175344010,
the size of the consensus pattern is 7, there are 4.43 copies
of the pattern, the length of the profile is 7, and there are
14 LEB36 digits representing the seven compositionsin the
profile. Exactly two LEB36 digits per standard composition.
Followed by count of As,Cs,Gs and Ts in the repeat sequence.
Followed by 50 characters (or less) of the flanks separated
by '|' character.
LEB36 stands for little endian base-36 or alphanumeric:
Decimal Base36 (2-digit) LEB36
15 0F F0
b) .seq file (can be generated by TRDB database download tool)
Each tandem repeat is stored in its own line and in the following format:
Repeatid,FirstIndex,LastIndex,CopyNumber,FastaHeader,FlankingLeft1000,Pattern,ArraySequence,FlankingRight1000,Conserved
Repeatid - key of the tandem repeat (TR) in an existing database
FirstIndex - first index of the TR in chromosome
LastIndex - last index of the TR in chromosome
CopyNumber - number of copies of the pattern
FastaHeader - FASTA header
FlankingLeft1000 - up to 1000 characters on the left of the TR
Pattern - concensus pattern of the TR
ArraySequence - nucleotide sequence of the TR
FlankingRight1000 - up to 1000 characters on the left of the TR
Conserved - ratio of matches in the TR alignment over sum of matches, mismatches and indels