-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trim a multiple sequence alignment based on a BED #477
Comments
How about trim via a region -r ? |
But the region option only takes a single region. I’d like to do multiple regions at once. |
Calling Even if using the bed with chr, the results would be strange. |
if I have thousands of regions though this would be quite cumbersome no? For example, this is a small python script I used to achieve the functionality I am asking about import sys
def trim_sequence(sequence: str, positions: list[range]) -> str:
return "".join([sequence[iv] for iv in positions])
def main():
bed_file = sys.argv[1]
msa_file = sys.argv[2]
# Read the bed file and extract the positions
positions = set()
with open(bed_file, "r") as bed:
for line in bed:
if not line.strip():
continue
fields = line.strip().split("\t")
start, end = int(fields[1]), int(fields[2])
positions.update(range(start, end))
# Trim each sequence in the MSA
with open(msa_file, "r") as msa:
seq = ""
for line in msa:
if not line.strip():
continue
if line.startswith(">"):
if seq:
trimmed_sequence = trim_sequence(seq, positions)
print(trimmed_sequence, file=sys.stdout)
seq = ""
sys.stdout.write(line)
else:
seq += line.strip()
if seq:
trimmed_sequence = trim_sequence(seq, positions)
print(trimmed_sequence, file=sys.stdout)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python trim_msa.py <bed_file> <msa_file>", file=sys.stderr)
sys.exit(1)
main() |
As always, I assume there is a way to do this seqkit, but my fu is not strong enough to find it.
I have a BED file of positions/columns I would like to keep from a multiple sequence alignment.
I've tried
seqkit subset -w 0 -U --bed keep.bed msa.fa > msa.trim.fa
but I get the warning[WARN] sequence () not found in file: msa.fa
and the resulting output is empty. I assume this is because the chromosome name in the BED does not match the names of the sequences in the MSA.Esentially I want to ignore the chromosome name and just extract the positions in the BED from each sequence in the MSA.
The text was updated successfully, but these errors were encountered: