Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement weight and boost properties to improve gene search #29

Open
mhuyck opened this issue Jan 10, 2020 · 3 comments
Open

implement weight and boost properties to improve gene search #29

mhuyck opened this issue Jan 10, 2020 · 3 comments

Comments

@mhuyck
Copy link
Collaborator

mhuyck commented Jan 10, 2020

While porting django-genes code to Python 3 for this project, a question came up about the Gene.weight property (see py3-adage-backend/adage/genes/models.py) and how it was used.

After reviewing the original django-genes codebase, it's clear that .weight supports an important search feature. Although it's not initially needed for Adage, it will likely be useful in the future and it will certainly be needed when it comes time to factor the py3 version of django-genes back out into a separate component. Details of my code review are below:

Weight is a search tuning parameter. Although it is not particularly useful for the Pseudomonas data we currently use in Adage, a fair amount of work was done in django-genes (for Tribe, I assume) to add this in because it was needed.

What happens when searching for genes across many data sources is that you find the same gene name being used to refer to different genetic locations even within the same organism. So, when a user is searching for a gene to add to a list, there needs to be a way to sort through the duplicates. From the comments in django-genes/genes/search_indexes.py (lines 34-59) and django-genes/genes/management/commands/genes_load_geneinfo.py (lines 213-271), it appears that the weighting is done in such a way that the “more popular” gene hits will rise to the top of the search list. genes_load_geneinfo.py has logic that counts the number of cross-references and aliases a gene has and gives the gene a higher weight if there are a lot of those. search_indexes.py then tweaks those weights into a boost parameter, which appears to be what actually modifies the search visibility of that gene.

So I take from this that we will need the weight parameter and the boost logic to return, somehow, before this code is folded back into django-genes.

How soon will we need this sort of logic for Adage? I guess that’s really a question for @cgreene, but I think it’s safe to say that if we expand to other organisms we will hit this duplicate gene name issue eventually.

@cgreene
Copy link
Member

cgreene commented Jan 10, 2020

I don't know how soon we'll need it. I think it was designed to address issues with search that showed up in Tribe's human genes.

@dongbohu
Copy link
Contributor

Let's look into this issue more closely when we port Tribe to Python 3.

@mhuyck
Copy link
Collaborator Author

mhuyck commented Jan 10, 2020

Sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants