This repository contains scripts for:
- Downloading multiple versions of
.contig
files in the svn repositorycuration.pombase.org/var/svn-repos/pombe-embl
or PomBase FTP server. - Summarise the differences between subsequent versions of the contig files, namely:
- Coordinates of removed/added features.
- Changes in coordinates of features that are present in both versions.
- Changes in qualifiers of features that are present in both versions.
- It also contains a script to perform the diff for any two embl files (
two_genomes_diff.py
). This might be useful outside of PomBase.
All the steps below that need to be re-ran are in update_file.sh.
NOTE: if
poetry install
gives you errorFailed to open keyring
, check #47
# install dependencies
poetry install
# activate venv
poetry shell
# run this script (The comments explain what it does)
bash update_file.sh
# Commit changes
To install the dependencies, we used poetry (see poetry installation instructions).
In the source directory run:
poetry install
This should create a folder .venv
with the python virtual environment. To activate the virtual environment, then run:
poetry shell
Now when you call python
, it will be the one from the .venv
.
Most of the repository was made for a pombe genome analysis, in which many diffs of pombe genome were compared. If you are here only to quickly compare two genomes, you can run the script:
python two_genomes_diff.py --new_genome data/chromosome1/8485.contig --old_genome data/chromosome1/8338.contig --output_locations_file 'a.tsv' --output_qualifiers_file 'b.tsv'
Arguments:
--new_genome
and--old_genome
: files to be compared, in embl format.--revision_string
: a string with 3 space-separated revision-related values (revision number, user, date). If not provided, it is not printed.--output_locations_file
: the file where the diff in locations will be stored.--output_qualifiers_file
: the file where the diff in qualifiers will be stored.
Before using this, read the next section.
In order to assess whether a feature has changed or not, the script two_genomes_diff.py needs to use some unique identifier of a feature to compare it between
two genome versions. For PomBase, that is the feature qualifier \systematic_id
. Old genomes did not have this qualifier, and used \gene
instead. However, the \gene
qualifier
does not have to be unique, and there are some revisions were the usage of both is mixed.
This is handled for the PomBase case in the function read_pombe_genome
in genome_functions.py.
The data used for this was generated with the pre-svn pombe genomes and with the script old_scripts/find_missing_synonyms.sh and is in the directory
valid_ids_data.
The same unique identifier can refer to multiple features sometimes, such as introns. In that case, we report the ones that are removed or added.
WARNING: Downloading all revisions and generating the full diffs will require ~100GB of space.
Necessary no matter what you do:
bash get_data.sh
# If you haven't, activate the local python environment
poetry shell
# Create the basic folder structure, and download the information about revisions
# If you want to download last revisions (since last revision mentioned in either all_qualifier_changes_file or all_coordinate_changes_file.tsv)
bash get_revisions_where_contigs_changed.sh last
# If you want to download ALL revisions (~100 GB)
bash get_revisions_where_contigs_changed.sh all
This will create a directory structure:
data
├── chromosome1
│ ├── change_log
│ │ ├── locations
│ │ ├── qualifiers
│ │ └── diff
│ └── revisions.txt
├── chromosome2
│ ├── change_log
│ │ ├── locations
│ │ ├── qualifiers
│ │ └── diff
│ └── revisions.txt
├── chromosome3
│ ├── change_log
│ │ ├── locations
│ │ ├── qualifiers
│ │ └── diff
│ └── revisions.txt
├── mating_type_region
│ ├── change_log
│ │ ├── locations
│ │ ├── qualifiers
│ │ └── diff
│ └── revisions.txt
└── pMIT
├── change_log
│ ├── locations
│ ├── qualifiers
│ └── diff
└── revisions.txt
In each revisions.txt
file there is information about the revisions that affect a given chromosome space separated (revision user date), e.g.:
8485 vw253 2022-10-08
After this, you can download all versions each contig file where changes were made by running (note that you can run each chromosome in parallel to speed up):
python get_revisions_files.py
The generated folder tree looks like this:
data
├── chromosome1
│ ├── 10.contig
│ ├── 1000.contig
│ ├── 1002.contig
│ ├── 1007.contig
...
Where data/chromosome1/10.contig
is the file chromosome1.contig
at revision 10, etc.
You can also download the svn diffs by running python get_svn_diff.py
There are known erros in the old contig files, some of which need fixing if you are to run the analysis pipeline. See known_errors.md
.
In addition, some previous inconsistencies are fixed using data from the following files:
- valid_ids_data/genes_in_wrong_chromosomes.tsv: Some features were sometimes added to the wrong chromosome (see issue 22), and then move to the right chromosome. This may mess up the generation of results/genome_changes_summary.tsv, so we remove them from the table when running get_info_from_changes.py.
- valid_ids_data/known_exceptions.tsv: Some features have been associated with two systematic_ids at the same time in certain revisions. This file allows the function
read_pombe_genome
(in genome_functions.py) to choose one of the two as unique identifier of the feature. - valid_ids_data/missing_synonyms.tsv: Some of the synonyms are not included in the current genome, and they have been extracted from old genomes (pre_svn), by co-occurence as gene qualifiers, or as synonym qualifiers that are not longer included (see type column). This was generated using old_scripts/find_missing_synonyms.sh, but it can be updated manually if something is missing.
- valid_ids_data/obsoleted_ids.tsv: Generated by gather_obsoleted_names.py. Obsoleted ids along with the gene they were merged into. Used to create a synonym dictionary that was used for old data as well as for get_info_from_changes.py.
- valid_ids_data/systematic_ids_associated_with_two_CDS.tsv: Some systematic_ids have been previously associated with two CDS features at the same time (this was sometimes used to represent multiple transcripts for a given gene, but not always). This messes up the count in get_info_from_changes.py. The ids in the file are used to solve the problem.
For running the analysis:
# If you haven't activated the environment
poetry shell
python pombe_svn_diff.py
This will generate the following output:
data
├── chromosome1
│ ├── change_log
│ │ ├── locations
│ │ │ ├── 10.tsv
│ │ │ ├── 1000.tsv
..........................
│ │ └── qualifiers
│ │ ├── 10.tsv
│ │ ├── 1000.tsv
..........................
Where data/chromosome1/change_log/locations/xxx.tsv
contains changes in location that were introduced in revision xxx
. The file might be empty if no changes where made in that revision. Otherwise it contains:
- Coordinates of removed/added features.
- Changes in coordinates of features that are present in both versions.
The output looks like this:
revision user date systematic_id primary_name feature_type added_or_removed value
8462 vw253 2022-09-27 SPNCRNA.145 ncRNA removed 239730..240571
8462 vw253 2022-09-27 SPNCRNA.18 ncRNA removed complement(3699381..3700010)
8462 vw253 2022-09-27 SPNCRNA.193 ncRNA removed 2404684..2405924
8462 vw253 2022-09-27 SPNCRNA.884 ncRNA removed complement(3084619..3086087)
8462 vw253 2022-09-27 SPNCRNA.951 ncRNA removed 3951825..3952588
To combine all the changes in a single file, you can then run:
python create_single_coordinate_changes_file.py --output_file the_file.tsv
data/chromosome1/change_log/qualifiers/xxx.tsv
contains changes introduced in revision xxx
to qualifiers of features that existed in revision xxx
and the previous one. The output looks like this:
revision user date chromosome systematic_id primary_name feature_type qualifier_type added_or_removed value
8659 vw253 2023-01-16 I SPAC15A10.04c zpr1 CDS product added EF-1 alpha folding chaperone, zinc finger protein Zpr1
8659 vw253 2023-01-16 I SPAC15A10.04c zpr1 CDS product removed EF-1 alpha binding zinc finger protein Zpr1 (predicted)
8656 vw253 2023-01-15 I SPAC589.05c qng1 CDS controlled_curation added term=complementation, functionally complemented by human QNG1; db_xref=PMID:24911101; date=20140610
8656 vw253 2023-01-15 I SPAC589.05c qng1 CDS controlled_curation added term=human QNG1 ortholog; date=19700101
8656 vw253 2023-01-15 I SPAC589.05c qng1 CDS controlled_curation removed term=complementation, functionally complemented by human C9orf64; db_xref=PMID:24911101; date=20140610
To combine all the changes in a single file, you can then run:
python create_single_qualifier_changes_file.py --output_file the_file.tsv
get_info_from_changes.py
generates only_modified_coordinates.tsv, a table listing only the changes in gene locations (addition and removal in the same revision) for the main types of features (CDS
,ncRNA
,snRNA
,repeat_region
,rRNA
,tRNA
,snoRNA
,misc_RNA
). It combines the data from the svn server and the pre-svn data.
The script takes several inputs that should be there if you are using update_file.sh
, but some might be missing.
This can be particularly useful for alleles that refer to previous gene coordinates. This is used in the repository https://github.com/pombase/allele_qc.
Pombase-specific. Links changes in genome coordinates in the file only_modified_coordinates.tsv
to either:
- Comments from PomBase website. The file from PomBase curation repository lists some (not all) of the gene feature changes, and the reasons that led to them (a publication, personal communication, etc.).
- Changes in
dbxref
that occurred in the same revision as a change recorded inonly_modified_coordinates.tsv
.
To run this:
# Make the associations
python associate_comments_with_genome_changes.py
From all genomes stored in data/*, see if in any of them the genome SEQUENCE changed (not the features). This is normally output into results/genome_sequence_changes.tsv
python get_revisions_where_genome_sequence_changes.py --data data/* --output_file output.tsv
rm pre_svn_data/*/change_log/*/*.tsv
rm data/*/change_log/*/*.tsv
Some of the contig files pre-date the use of SVN, to download them and calculate the differences, they are in the ftp server of PomBase: https://www.pombase.org/data/genome_sequence_and_features/artemis_files/OLD/. The full list of those that pre-date svn are in the file . The output coordinates file can be found in results/pre_svn_coordinate_changes_file.tsv. The qualifier changes file is too big, and can be found in the latest release (gene_changes_comments_and_pmids/pre_svn_qualifier_changes_file.tsv). The file is used by update_file.sh
(see above), and for that it is downloaded from the latest release.
# Download files from first revision of svn and prepare directory structure (pre_svn_data)
bash prepare_pre_svn_folder.sh
# Download the contig files from ftp site and produce an equivalent to the revisions.txt described above, also fix known errors
python get_ftp_site_files.py
# Run the diffs on the pre_svn_data directory
python pombe_svn_diff.py --data_folders pre_svn_data/*
# Combine in single files
python create_single_coordinate_changes_file.py --data_folder pre_svn_data/ --output_file coordinate_changes_file.tsv
python create_single_qualifier_changes_file.py --data_folder pre_svn_data/ --output_file qualifier_changes_file.tsv
get_all_entries_for_systematic_id.sh
, takes one argument, a systematic_id.