Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitelist filtering #784

Open
bamueh opened this issue Mar 18, 2024 · 6 comments
Open

Whitelist filtering #784

bamueh opened this issue Mar 18, 2024 · 6 comments

Comments

@bamueh
Copy link
Contributor

bamueh commented Mar 18, 2024

I was under the impression that passing a whitelist to filter-fasta.py would remove all sequences except those that are in the whitelist. Instead, it seems like nothing gets filtered at all. But maybe I'm misunderstanding something?

@terrycojones
Copy link
Member

How did you pass a whitelist? There are two options, --whitelist and --whitelistFile. If you look in test/test_filter.py you'll see quite a few whitelist tests. I'm guessing you did --whitelist FILE instead of --whitelistFile FILE ? (We could easily make a change to detect this and give a warning.)

@bamueh
Copy link
Contributor Author

bamueh commented Mar 19, 2024

No, I did --whitelist somesequencename and --whitelistFile FILE and both didn't work.

@bamueh
Copy link
Contributor Author

bamueh commented Mar 19, 2024

e.g.

$ filter-fasta.py --whitelist OP866294_C_C2.2_Ethiopia_Camelus_dromedarius_2019-11-23 < data/sequences.fasta > sss.fasta
Read 640 sequences, kept 640 (100.00%).

where OP866294_C_C2.2_Ethiopia_Camelus_dromedarius_2019-11-23 is a name of a sequence in data/sequences.fasta.

I would have expected that only one sequence would be kept (OP866294_C_C2.2_Ethiopia_Camelus_dromedarius_2019-11-23) and not 640, but maybe I'm misunderstanding something?

@terrycojones
Copy link
Member

terrycojones commented Mar 19, 2024

OK, got it. A whitelist is used to guarantee that some things don't get thrown away. A blacklist ensures that some things do. The whitelist has priority. So what you should do is:

$ filter-fasta.py --whitelist OP866294_C_C2.2_Ethiopia_Camelus_dromedarius_2019-11-23 \
    --negativeTitleRegex . < data/sequences.fasta > sss.fasta

I.e., you ask that everything be thrown away, except the ones on the whitelist.

@terrycojones
Copy link
Member

BTW, these terms are now considered racist.

@terrycojones
Copy link
Member

If you look at the tests, you'll see several that have negativeRegex="."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants