Skip to content

Working with very large datasets (millions of unique amplicons)

Frédéric Mahé edited this page Nov 27, 2022 · 3 revisions

Working with very large datasets (millions of unique amplicons)

Operating systems often buffer the output of programs before writing into files. The longer it takes to fill in the buffer, the less often it is flushed to a file. The stats file is small, while the swarm file receives much more data. Consequently, the stats file might be lagging behind the swarm file by several clusters.

If you are using swarm with the -o output.swarms to indicate an output file for swarms, your OS might buffer the entire swarm file before flushing it to the output file at the end of the clustering process. If you don't want the OS to buffer your results, use a redirection >:

swarm input.fasta > output.swarms