Metagenomics pipeline to retrieve info from MGnify using IDs.
- About - Overview of the project's purpose and goals
- Getting Started - Instructions on how to begin with this project
- Prerequisites and installing - Required software and installation steps
- Step by step - Detailed guide to each stage of the project
- Repository structure - A layout of the repository's architecture, describing the purpose of each file or directory
- References - Tools used in the project
- Authors - List of contributors to the project
- Acknowledgments - Credits and thanks to those who helped with the project
This repository is designed to efficiently fetch and analyze data from MGnify studies, focusing on specific biomes and study types. It extends the work initiated by sayalaruano utilizing the MGnify API for effective data retrieval. The project leverages Nextflow, an advanced and flexible platform for constructing computational pipelines. Moreover, the workflow is thoughtfully crafted to be compatible with another pipeline developed by apalleja, highlighting its adaptability and relevance in a variety of research scenarios.
The following instructions are designed to guide users in extracting information and download FASTQ files considering a list of IDs. Originally, the pipeline was implemented using python scripts from retrieve info from MGnify repository (Sebastian). Presently, it is undergoing a transition to be re-implemented as a Nextflow workflow. This update aims to enhance the reproducibility and efficiency of the analysis process.
This workflow is configured to be executed through Azure Batch and Docker, leveraging cloud computing resources and containerized environments. It is recommended to follow these instructions to set Azure up. Remember also to change the name of the container which is not specified in this guide. This guide download Java and Nextflow so it is not necessary to follow the following instructions. Going through steps:
- Generating a batch account
- Generating a storage account
- Generating a container (the concept of container in Azure is different from Docker)
- Generating a Virtual Machine
Using the SSH protocol, you can connect and authenticate to remote servers. For more details please have a look at this page. Going through steps:
- Generating a new SSH key and adding it to the ssh-agent
- Checking for existing SSH keys
- Adding a new SSH key to your GitHub account
- About commit signature verification
If you get this message error-permission-denied-publickey:
- Copy and paste your private key in VM
- Modifing the permissions:
chmod 600 ~/.ssh/id_rsa
- Adding your private key and entering your passphrase:
ssh-add ~/path/id_rsa
Once you have completed all these steps you can try to clone a repository from GitHub.
Uploading samples into the VM
- Using SCP command
- Using
azcopy
command: follow this guide; this should be the command:azureuser@marcorVM:~/test-meta$ azcopy copy 'https://[profile].blob.core.windows.net/metanfnewsample/ERR9777403?sp=rwd&st=2024-01-23T12:54:38Z&se=2024-01-23T20:54:38Z&sv=2022-11-02&sr=b&sig=yebbX8kGGVBwSJCSMuGmVtF1IYFs8TzTkeRMVbYzO4A%3D´ "path/"
- Using Azure Blob Storage
Make sure Docker is installed and properly set up, configure your Azure Blob Storage and Azure Batch accounts, and install Nextflow following Nextflow information guide if you haven't done it yet. Once these prerequisites are in place, you can clone the repository and run the analysis.
The following instructions consider macOS systems.
- Download Java if it is not already installed in your laptop.
- Double click on downloaded
.dmg
file - If the java version in not included between 11 and 21 Nextflow could not work so download updated version of Java
- Check your Java version with this command
java -version
in your terminal
- Download Docker if it is not already installed in your laptop.
- Double click on downloaded
.dmg
file - Check your Docker version with this command
docker --version
in your terminal
- Install Nextflow using this simple command in your terminal
curl -s https://get.nextflow.io | bash
- Move Nextflow in a specific directory
sudo mv nextflow /usr/local/bin
- Check your Nextflow version with this command
nextflow -version
in your terminal
This pipeline considers functions from the repository called Retrieve_info_MGnifyAPI. Functions has been modified to make them more simple and readable for a workflow. A file named mgnify_functions.py
contains all the remade functions able to get info from MGnify and also download FASTQ files:
Functions_getInfo_MGnify_studies_analyses.py
to retrieve a summary of MGnify studies and analyses for a given biome and data type (amplicon, shotgun metagenomics, metatranscriptomic, or assembly).Functions_get_results_from_MGnifystudy.py
to obtain abundance and functional tables, as well as other results for a MGnify study.Functions_get_samplesMetadata_from_MGnifystudy.py
to obtain metadata for the samples of a MGnify studyget_fastq_from_list_ids.py
to obtain FASTQ files from MGnify studies.
To run this Nextflow script, use the command nextflow run main.nf --url_studies "https://www.ebi.ac.uk/metagenomics/api/v1/studies" --url_analyses "https://www.ebi.ac.uk/metagenomics/api/v1/analyses" --biome_name "example_biome" --experiment_type "example_experiment"
.
The table below provides an overview of the key files and directories in this repository, along with a brief description of each.
File | Description |
---|---|
nextflow.config | Configuration file which contains a nextflow configuration for running the bioinformatics workflow, including parameters for processing genomic data on Azure cloud service |
nextflow_config_full_draft.txt | Text file which contains a configuration for nextflow workflow specifying resources requirements for each program used |
Dockerfile | Docker file containing the necessary commands to assemble a docker image |
requirements.txt | Text file containing all the dependencies to run the analysis |
mgnify_functions.py | Python script which contains all the fuctions to retrieve info from MGnify and get FASTQ file to run the second pipeline |
We would like to extend our heartfelt gratitude to DTU Biosustain and the Novo Nordisk Foundation Center for Biosustainability for providing the essential resources and support that have been fundamental in the development and success of the DSP (Data Science Platform) and MoNA (Multi-omics Network Analysis) projects.