Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interplay of max_subproblem_frac and max_subproblem_size #8

Open
smirarab opened this issue Dec 14, 2011 · 0 comments
Open

Interplay of max_subproblem_frac and max_subproblem_size #8

smirarab opened this issue Dec 14, 2011 · 0 comments

Comments

@smirarab
Copy link
Member

In the current implementation if both max_subproblem_frac and max_subproblem_size are provided, the larger of the two is used. This is I think counter-intuitive. If two numbers are specifying the maximum size for a quantity, to me it makes more sense to take the smaller of the two.

Basically, the two options are "OR"ed in current implementation, but I think "AND"ing them makes more sense. Here is and example. If max_subproblem_frac is set to 20% an mximum_subproblem_size to 200, I think the user means she wants her subproblems to be no larger that 20% of the original problem AND no larger than 200 taxa. That is, the subproblems should be both under 200 taxa and 20% of the original problem. This is not what the current code (see below) does.

        configuration = self.configuration()
        # Here we check if the max_subproblem_frac is more stringent than max_subproblem_size
        frac_max = int(math.ceil(self.max_subproblem_frac*self.tree.n_leaves))
        if frac_max > self.max_subproblem_size:
            configuration['max_subproblem_size'] = frac_max

And here is one more motivation for the approach I am suggesting.

In the current implementation of SATe, if a subproblem is larger than 200 taxa, the fast and inaccurate version of MAFFT is used on it. So if your subproblem turns out to be 210 taxa, you get much worse results that 200 taxa. For that reason, I always like to limit the size of my alignment subsets to 200 taxa. But then if the number of sequences are less than 1000, the 200 limit amounts to more than 20%, which is not good either. This means I would need to create separate config files for different inputs, based on their size. But if the conditions were interpreted as I am suggesting, I could simply set max_subproblem_size to 200 and max_subproblem_frac to 0.2. This way the subsets would have never been more than 200 taxa, and at the same time, they would have never been too big (ie. if the input alignment is small, say 400 taxa).

Also note that the comment is very confusing. In my mind, stringent means lower here, not higher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant