This repo contains a Python script, bin/seq-gen.py, to generate genetic sequences based on directives given in a JSON file. Python 3.6 and above should all work.
I wrote this because I wanted to be able to easily generate alignments of FASTA files with known properties, and to then examine the output of phylogenetic inference or recombination analysis programs run on the sequences. There are various other things you can do with it.
$ pip install seqgen
You can also get the source from PyPI or on Github.
In summary: put a specification for a set of sequences into a
JSON-formatted file. Pass this to bin/seq-gen.py
,
either using the --specification
option or on standard input.
Run with --help
or -h
to see all options:
$ seq-gen.py --help
usage: seq-gen.py [-h] [--specification FILENAME] [--defaultIdPrefix PREFIX]
[--defaultLength N]
Create genetic sequences according to a JSON specification file and write them
to stdout.
optional arguments:
-h, --help show this help message and exit
--specification FILENAME
The name of the JSON sequence specification file.
Standard input will be read if no file name is given.
(default: stdin)
--defaultIdPrefix PREFIX
The default prefix that sequence ids should have (for
those that are not named individually in the
specification file) in the resulting FASTA. Numbers
will be appended to this value. (default: seq-id-)
--defaultLength N
The default length that sequences should have (for
those that do not have their length given in the
specification file) in the resulting FASTA. (default:
100)
--format {fasta,fastq}
Set the default output format. The output format can be
set (via the specification file) for each set of reads, if
desired. This option just sets the default. If the format is
'fastq', the quality for each nucleotide will be a constant,
according to the value given to --quality (or 30 if --quality
is not used. (default: fasta)
--quality N
The quality value to use. This will result in FASTQ output.
The value will be converted to a single character, according
to the Phred scale. The numeric value you give will be added
to the value for '!' to get the character that will be used
for all quality scores. So use 0 for the lowest quality or,
e.g., 30 for a reasonably high quality. If --fastq is used
but --quality is not, a value of 30 will be used. (default: None)
Your JSON specifies what sequences you want created.
As an example, the JSON shown below specifies the following:
- Create a random sequence
A
of length 100 nucelotides. - Make 19 (approximately) 1% mutants from
A
. - Make a new sequence
B
approximately 15% different fromA
. - Make 19 (approximately) 1% mutants from
B
. - Make a recombinant sequence whose first 30 nucleotides are from
A
and last 70 nucleotides fromB
.
{
"variables": {
"length": 100
},
"sequences": [
{
"id": "seq-A",
"length": "%(length)d"
},
{
"from id": "seq-A",
"id prefix": "seq-A-mutant-",
"length": "%(length)d",
"count": 19,
"mutation rate": 0.01
},
{
"id": "seq-B",
"from id": "A",
"length": "%(length)d",
"mutation rate": 0.15
},
{
"from id": "seq-B",
"id prefix": "seq-B-mutant-",
"length": "%(length)d",
"count": 19,
"mutation rate": 0.01
},
{
"id": "recombinant",
"sections": [
{
"from id": "seq-A",
"start": 1,
"length": 30
},
{
"from id": "seq-B",
"start": 31,
"length": 70
}
]
}
]
}
Note: in JSON parlance, "object" is what Python programmers call a "dict". In the text below I am describing the JSON specification, so I'm using "object". If you're more comfortable thinking of the JSON as having "dicts", go right ahead.
The variables
section is optional. For convenience, if no variables
section is given, the JSON may simply be a list
of sequence objects.
There are many more small examples of usage in test/testSequences.py.
The sequence specification must be a list of objects. The full list of specification keys you can put in a object in the JSON is as follows:
alphabet
: A string of characters from which sequences will be drawn (seerandom nt
andrandom aa
below).count
: The number of sequences to generate from this object.description
: The sequence description. This will be appended to the FASTA id (separated by a space). Note that if a sequence has a description and you want to refer to it usingfrom id
(see below), you must include its id and description, separated by a single space.id
: The FASTA id to give this sequence.id prefix
: The prefix of the FASTA ids to give a set of sequences. A count will be appended to this prefix. This is useful when you specify acount
value.filename
: The file into which to write the sequences. The first time a file is mentioned, it is truncated. Subsequent output to the same file will be appended. This allows the use of the same file more than once in a specification.format
: Either "fasta" or "fastq". If the latter, the quality string will be set to the--quality
option passed toseq-gen.py
or the default value (30).from id
: The sequence should be based on another (already named) sequence in the JSON file. The value given should either be the exactid
of another sequence or else be theid prefix
of another sequence followed by a number. E.g., if you specify"count": 10
and"id prefix": "my-id"
for a set of ten sequences, and want to refer to the 3rd of them when specifying another sequence, you would use"from id": "my-id-3"
. The default id prefixes for sequences that are not given an id explicity isseq-id-
.length
: The sequence length.mutation rate
: A mutation rate to apply to the sequence.rc
(orreverse complement
) the sequence will be reverse complemented. Note that this happens before any mutations are applied.random aa
: The sequence should be made of random amino acids.random nt
: The sequence should be made of random nucleotides. This is the default.ratchet
: Iftrue
and a count and mutation rate are given, sequences after the first will be mutants of the previous sequence. This can be used to build a series of sequences that are successive mutants of one another. For the name derivation, see Muller's Ratchet.reverse complement
(orrc
). Reverse complement the sequence. If specified, This is done as the penultimate step, just before the sequence is mutated (if mutation has been requested).sections
: Gives a list of sequence sections used to build up another sequence (see example above). Each section is a JSON object that may contain the keys present in a regular sequence object, excludingcount
,description
,id
,id prefix
,name
, andsections
. The main idea here is to allow you to easily specify sequences that are recombinants.sequence
: Give the exact sequence of nucleotides or amino acids.sequence file
: Specify a FASTA file to get the sequence from. Only the first sequence in the file is used. The resulting sequence will have the id of the sequence in the file, unless anid
key is given.skip
: Iftrue
the sequence will not be output. This is useful either for temporarily omitting a sequence or for just giving a sequence (e.g., one read from a file) an id so it can be used in the construction of another sequence.start
: The (1-based) position at which a substring should start when a sequence refers to another (viafrom id
).
All specification keys are optional. A completely empty specification object will get you a sequence of the default length, with a default id, composed of random nucleotides. Hence:
$ echo '[{},{}]' | seq-gen.py
>seq-id-1
ACTCGTGCTATAGGGCGAATATCGCAAAATGCTCACATACCCAATAGCTTAGGAATAGTTCCTGTCGGGGCGCTCGTTGATTTAAGTCAATGAGCATCCT
>seq-id-2
CTTGAGATGATTCGGCAACGTTAGCCGATAGATCATGGAAGGAATACGGCTAAAATATTCAGGTAATTAATGGATACGTCCTAGATAAGTAGAATCGAAT
(Note that this example takes advantage of the convenience mentioned above).
Although the code will complain about unknown keys, it does not detect cases where you specify a sequence in two different ways. You'll have to play around and/or read the code in seqgen/sequences.py to see the order in which the various sequence specification keys are acted on.
To run the tests:
$ pytest
You can also use
$ tox
to run tests for various versions of Python.