Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim a multiple sequence alignment based on a BED #477

Open
mbhall88 opened this issue Jul 5, 2024 · 4 comments
Open

Trim a multiple sequence alignment based on a BED #477

mbhall88 opened this issue Jul 5, 2024 · 4 comments

Comments

@mbhall88
Copy link

mbhall88 commented Jul 5, 2024

As always, I assume there is a way to do this seqkit, but my fu is not strong enough to find it.

I have a BED file of positions/columns I would like to keep from a multiple sequence alignment.

I've tried seqkit subset -w 0 -U --bed keep.bed msa.fa > msa.trim.fa but I get the warning [WARN] sequence () not found in file: msa.fa and the resulting output is empty. I assume this is because the chromosome name in the BED does not match the names of the sequences in the MSA.

Esentially I want to ignore the chromosome name and just extract the positions in the BED from each sequence in the MSA.

@shenwei356
Copy link
Owner

How about trim via a region -r ?

@mbhall88
Copy link
Author

mbhall88 commented Jul 6, 2024

But the region option only takes a single region. I’d like to do multiple regions at once.
maybe a flag to ignore chromosome name in a bed, or pass a regions file?

@shenwei356
Copy link
Owner

Calling seqkit subseq -r multiple times.

Even if using the bed with chr, the results would be strange.

@mbhall88
Copy link
Author

if I have thousands of regions though this would be quite cumbersome no?

For example, this is a small python script I used to achieve the functionality I am asking about

import sys


def trim_sequence(sequence: str, positions: list[range]) -> str:
    return "".join([sequence[iv] for iv in positions])


def main():
    bed_file = sys.argv[1]
    msa_file = sys.argv[2]
    # Read the bed file and extract the positions
    positions = set()
    with open(bed_file, "r") as bed:
        for line in bed:
            if not line.strip():
                continue

            fields = line.strip().split("\t")
            start, end = int(fields[1]), int(fields[2])
            positions.update(range(start, end))

    # Trim each sequence in the MSA
    with open(msa_file, "r") as msa:
        seq = ""
        for line in msa:
            if not line.strip():
                continue

            if line.startswith(">"):
                if seq:
                    trimmed_sequence = trim_sequence(seq, positions)
                    print(trimmed_sequence, file=sys.stdout)
                seq = ""
                sys.stdout.write(line)
            else:
                seq += line.strip()

        if seq:
            trimmed_sequence = trim_sequence(seq, positions)
            print(trimmed_sequence, file=sys.stdout)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python trim_msa.py <bed_file> <msa_file>", file=sys.stderr)
        sys.exit(1)
    main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants