In this lab, we will use the latest and greatest sequencing technology (PacBio HiFi) to assemble the genome of Plodia interpunctella, the pantry moth. We will also assess the sequencing coverage, the contiguity of the resulting assembly, and the gene completeness using BUSCO.
You will have an opportunity to explore GenBank assemblies of other moths and relate how this assembly compares to other moths in terms of contiguity.
Step 1: Login to the supercomputer, create a new folder for your analysis, and copy over the raw data
Open the command prompt or the terminal application and ssh
into the supercomputer.
Change directories into your compute directory and create a new directory called lab5
.
cd ~/compute
mkdir lab5
cd lab5
Copy the raw PacBio read FASTQ
and FASTA
files from our shared PWS 472 directory. We will use the FASTQ
file for the assembly, but we will use the FASTA
file to estimate the average read length and the total number of base pairs in the raw data.
cp ~/fsl_groups/fslg_pws472/compute/lab5/data/m64140_201011_172335.Q20.fast* ./
Notice that, in the command above, we use a wildcard *
to copy both the FASTA
file and the FASTQ
file.
Copy the assembly_stats
script to your folder.
cp -r ~/fsl_groups/fslg_pws472/compute/lab5/data/assembly_stats ./
Make sure conda is loaded (module load miniconda3/4.12-pws-472
) then run assembly_stats
on your FASTA
raw reads. Note that this script is primarily used to assess the completeness of a genome assembly not necessarily raw data. However, we will use it here to figure out the average read length and the total number of base pairs in the raw data. You will later use these to estimate the coverage in the raw data.
python assembly_stats/assembly_stats.py m64140_201011_172335.Q20.fasta
You should see two different JSON
objects entitled "Contig Stats:" and "Scaffold Stats:". They will be the same values and, for this part of the exercise, what we are actually measuring are "Read Stats:". Take note of the total number of base pairs and the average read lengths. You'll use these later.
Create a new job script using the Job Script Generator. Make sure that the job is limited to one node, select 24 processor cores, a wall-time of 2 hours, and 1 GB per processor. Add a name for your job and select the boxes to receive emails when your job begins, ends, and/or aborts. Copy the script from below Job Script on the webpage and paste it into a new file called hifiasm.job
in your folder. Use your favorite text editor to do this.
Open the job file and add the following commands to the bottom, separated by line breaks:
module load miniconda3/4.12-pws-472
conda activate hifiasm
hifiasm -o plodia.asm -t 24 m64140_201011_172335.Q20.fastq
Close the job file and submit it using sbatch
sbatch hifiasm.job
You can check on your job using squeue -u <username>
. It should complete relatively quickly and you will receive an email when it does. In the meantime, you can move to the next step to install BUSCO
and get the job file ready for job submission.
Create a new job script using the Job Script Generator. Make sure that the job is limited to one node, select 4 processor cores, 4 GB of RAM per processor, and a wall-time of 12 hours. Copy the resulting script into a new job file called busco.job
.
Copy the busco_downloads
folder from the pws472
class folder to your working folder
cp -r ~/fsl_groups/fslg_pws472/compute/lab5/data/busco_downloads ./
d. Open the job file and add the following commands to the bottom, separated by line breaks:
module load miniconda3/4.12-pws-472
conda activate busco
busco -o plodia_busco -i plodia.asm.bp.p_ctg.fa -l insecta_odb10 -c 4 -m genome --offline
Wait to submit this job until you complete step 5.
Wait until you get the email that your hifiasm
job has completed. Once it has, you will have many new files in your folder. You can read the hifiasm
documentation to understand what each file is. For these next steps, however, we want to convert the final assembly from a Graphical Fragment Assembly (.gfa
) file to a FASTA
file. This is simply for the convenience of using downstream tools. You can do this with the following command (you can run this interactively):
awk '$1 ~/S/ {print ">"$2"\n"$3}' plodia.asm.bp.p_ctg.gfa > plodia.asm.bp.p_ctg.fa
Now assess the contiguity of your assembly using the assembly stats script.
python assembly_stats/assembly_stats.py plodia.asm.bp.p_ctg.fa
You'll want to make sure that you keep this information handy. I'd like to see it in a table in your lab write-up. You will also use it to compare contiguity with an existing moth genome assembly. You will use the total number of base pairs in your assembly to estimate coverage from the total number of base pairs in your reads.
Since you already created the job file, you can simply submit the new file with sbatch
.
sbatch busco.job
Once busco
is complete (this may take quite some time), you'll be able to view how many expected single copy orthologs are in your assembly. You can do this by viewing the short_summary.specific.insecta_odb10.plodia_busco.txt
file.
cat plodia_busco/short_summary.specific.insecta_odb10.plodia_busco.txt
Complete BUSCOs include both single-copy and duplicated BUSCOs. Fragmented BUSCOs are where part of the gene was found, but the whole gene could not be confirmed. Missing BUSCOs are single copy orthologs that were not found in the assembly. More details about these output and more details on how to further analyze them can be found here.
Items to ensure that you include in your lab write-up:
- Full details of the methods that you used, including commands to each piece of software.
- The amount of coverage of HiFi PacBio data that was generated (Hint: you can estimate this by dividing the total number of base pairs in the reads by the total number of base pairs in the assembly).
- The contiguity stats of the genome that you assembled. How does this compare to another moth genome that you compared it to on GenBank? One that you can compare it to is the Diamondback moth, Plutella xylostella. Examine the differences in sequencing technology, the differences in coverage, and the differences in contiguity.