-
Notifications
You must be signed in to change notification settings - Fork 4
Text files
- cat, head, tail, wc
- grep
- cut, sort, uniq
- advanced: sed, awk
Using nano create a text file called ~/myself.tsv
where to put these data separated by a tab:
- Surname
- Name
- Research area
There is no single “file extension” for FASTA files, but there are many, the most common and generic being “.fasta” or “.fa”. Sometimes more specific versions are “.faa” for protein files (aminoacidic), and “.fna” for nucleic acids.
Let's start listing all the files ending with .fna in our home:
find ~ -name "*.fna"
How many sequences in those files? Last time we counted them first selecting the lines containing '>':
grep '>' ~/learn_bash/phage/vir_cds_from_genomic.fna
We subsequently pass the output of grep to the wc command, using a pipe. We used this trick mainly to make use of the pipes, but grep has a switch (-c
, for count) for this task:
# Counting sequences in a file:
grep '>' ~/learn_bash/phage/vir_cds_from_genomic.fna | wc -l
# Counting sequences in multiple files using wildcards:
grep -c '>' ~/learn_bash/phage/*.fna
The FASTQ format devotes 4 lines for each sequence, the last being an encoded version of the quality score for each nucleotide. There are some FASTQ files in a shared directory called /homes/qib/shared/reads/. Let's have a look:
# List the (compressed) FASTQ files in a specific directory of this repository
ls -l ~/learn_bash/files/*.fastq.gz
# Decompress them
gunzip ~/learn_bash/files/*.fastq.gz
# Display the first two reads
head -n 8 ~/learn_bash/files/Sample1_R1.fastq
How many reads? We can count the lines and then divide by 4!
wc -l ~/learn_bash/files/*.fastq
Or we can use a specific bioinformatics tool: seqkit
. If we don't have it installed we can use Miniconda:
# Install seqkit, if it's not installed
conda install -y -c bioconda seqkit
Using the subcommand stats to count reads (and have more details on their lengths):
# Count reads
seqkit stats ~/learn_bash/files/*.fastq
The GFF (General Feature Format) is used to store annotations. An alternative format, called GTF, is more focused on genes annotations while GFF is more generic. They are both TSV (tab separated values), that is they are table where the boundaries across cells are marked by a single tabulation.
The first lines optionally specify some metadata, and they are preceded by a #.
Let's see an example:
less -S ~/learn_bash/phage/vir_genomic.gff
# If we want to remove the header lines:
grep -v '^#' ~/learn_bash/phage/vir_genomic.gff | less -S
# If we want to increase the tabulation:
grep -v '^#' ~/learn_bash/phage/vir_genomic.gff | less -S -x 15
If we want to extract all the lines with CDSs, and then lines containing the word capsid:
grep -w CDS ~/learn_bash/phage/vir_genomic.gff
grep -w CDS ~/learn_bash/phage/vir_genomic.gff | grep -i capsid
A useful command to extract some columns from a text file is cut:
cut -f 1,3-5 ~/examples/phage/vir_genomic.gff
GFF, GTF, but also SAM and VCF are examples of tabular text files. They all are tab-separated values. A smaller example will be easier to deal with:
# Try using relative path!
cat ~/learn_bash/files/wine.tsv
If we want to sort by username, that is the third column of the file:
sort -k 3 ~/learn_bash/files/wine.tsv
Sometimes we need to increase the space used by tabs to have a clearer view:
sort -k 3 /homes/2020/binf/data/people.tsv | less -S -x 20
Create a text file in your home directory called ~/reads.fasta
where you should put some substring taken from ~/learn_bash/phage/vir_genomic.fna
.
You can extract a substring of at least 20 chars from anywhere. You can add errors (i.e. change some letters), small deletions or small insertions…
In your home there should be a couple of archives, in two very popular formats: “zip” and “tar.gz”. They are in your examples/archives/ directory.
Unzipping is done by:
unzip FILENAME
while for tar archives:
tar xvfz FILENAME
The “switches” here are:
-
x
, to eXtract -
v
, for Verbose reporting (print files as they are extracted). Don't add it if you are not interested in the list -
f
, extract from a File (sounds crazy) -
z
, the tar archive is alzo compressed with gz. Don't add it if the archive is .tar and not .tar.gz
· Bioinformatics at the Command Line - Andrea Telatin, 2017-2020
Menu