Skip to content
This repository has been archived by the owner on Aug 11, 2021. It is now read-only.

What is the actual sorting required for tabix? #29

Open
Shians opened this issue May 28, 2020 · 2 comments
Open

What is the actual sorting required for tabix? #29

Shians opened this issue May 28, 2020 · 2 comments

Comments

@Shians
Copy link

Shians commented May 28, 2020

On http://www.htslib.org/doc/tabix.html it is indicated that the file should be position sorted.

The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.

However in many usages I see that the files are in fact first sorted by seqname THEN position. The tabix paper also seems to indicate this https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042176/.

Before being indexed, the data file needs to be sorted first by sequence name
and then by leftmost coordinate

So does the documentation need to be updates, or has tabix been updated since to allow the seqname to be out of order?

@Shians
Copy link
Author

Shians commented May 28, 2020

I think I understand that "position sorted" is the same as "sequence name and then by leftmost coordinate", as the sequence is considered a part of the entry's position. Still, since the information exists in two separate columns, it might be beneficial to state this explicitly in the documentation as to avoid any mistakes by users.

@winni2k
Copy link

winni2k commented May 28, 2020

Position sorted means sorted by chromosome and then sorted by position within each chromosome. The position entry in the VCF specification only alludes to that. I agree that the documentation is unclear. This is probably because "position sorted" has become a de-facto technical term in bioinformatics, but of course that's not very helpful to novices.

Also, some tools such as the GATK even distinguish between different chromosome orderings, and require them to match across all input files.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants