Nextflow retrieve info MGnify

Metagenomics pipeline to retrieve info from MGnify using IDs.

About - Overview of the project's purpose and goals
Getting Started - Instructions on how to begin with this project
Prerequisites and installing - Required software and installation steps
Step by step - Detailed guide to each stage of the project
Repository structure - A layout of the repository's architecture, describing the purpose of each file or directory
References - Tools used in the project
Authors - List of contributors to the project
Acknowledgments - Credits and thanks to those who helped with the project

About

This repository is designed to efficiently fetch and analyze data from MGnify studies, focusing on specific biomes and study types. It extends the work initiated by sayalaruano utilizing the MGnify API for effective data retrieval. The project leverages Nextflow, an advanced and flexible platform for constructing computational pipelines. Moreover, the workflow is thoughtfully crafted to be compatible with another pipeline developed by apalleja, highlighting its adaptability and relevance in a variety of research scenarios.

Getting started

The following instructions are designed to guide users in extracting information and download FASTQ files considering a list of IDs. Originally, the pipeline was implemented using python scripts from retrieve info from MGnify repository (Sebastian). Presently, it is undergoing a transition to be re-implemented as a Nextflow workflow. This update aims to enhance the reproducibility and efficiency of the analysis process.

Prerequisites and installing

Azure setting

This workflow is configured to be executed through Azure Batch and Docker, leveraging cloud computing resources and containerized environments. It is recommended to follow these instructions to set Azure up. Remember also to change the name of the container which is not specified in this guide. This guide download Java and Nextflow so it is not necessary to follow the following instructions. Going through steps:

Generating a batch account
Generating a storage account
Generating a container (the concept of container in Azure is different from Docker)
Generating a Virtual Machine

Connecting with GitHub

Using the SSH protocol, you can connect and authenticate to remote servers. For more details please have a look at this page. Going through steps:

If you get this message error-permission-denied-publickey:

Copy and paste your private key in VM
Modifing the permissions: chmod 600 ~/.ssh/id_rsa
Adding your private key and entering your passphrase: ssh-add ~/path/id_rsa

Once you have completed all these steps you can try to clone a repository from GitHub.

Upload samples into the VM

Uploading samples into the VM

Using SCP command
Using azcopy command: follow this guide; this should be the command: azureuser@marcorVM:~/test-meta$ azcopy copy 'https://[profile].blob.core.windows.net/metanfnewsample/ERR9777403?sp=rwd&st=2024-01-23T12:54:38Z&se=2024-01-23T20:54:38Z&sv=2022-11-02&sr=b&sig=yebbX8kGGVBwSJCSMuGmVtF1IYFs8TzTkeRMVbYzO4A%3D´ "path/"
Using Azure Blob Storage

Make sure Docker is installed and properly set up, configure your Azure Blob Storage and Azure Batch accounts, and install Nextflow following Nextflow information guide if you haven't done it yet. Once these prerequisites are in place, you can clone the repository and run the analysis.

The following instructions consider macOS systems.

Installing Java

Download Java if it is not already installed in your laptop.
Double click on downloaded .dmg file
If the java version in not included between 11 and 21 Nextflow could not work so download updated version of Java
Check your Java version with this command java -version in your terminal

Installing Docker

Download Docker if it is not already installed in your laptop.
Double click on downloaded .dmg file
Check your Docker version with this command docker --version in your terminal

Installing Nextflow

Install Nextflow using this simple command in your terminal curl -s https://get.nextflow.io | bash
Move Nextflow in a specific directory sudo mv nextflow /usr/local/bin
Check your Nextflow version with this command nextflow -version in your terminal

Step by step

This pipeline considers functions from the repository called Retrieve_info_MGnifyAPI. Functions has been modified to make them more simple and readable for a workflow. A file named mgnify_functions.py contains all the remade functions able to get info from MGnify and also download FASTQ files:

Functions_getInfo_MGnify_studies_analyses.py to retrieve a summary of MGnify studies and analyses for a given biome and data type (amplicon, shotgun metagenomics, metatranscriptomic, or assembly).
Functions_get_results_from_MGnifystudy.py to obtain abundance and functional tables, as well as other results for a MGnify study.
Functions_get_samplesMetadata_from_MGnifystudy.py to obtain metadata for the samples of a MGnify study
get_fastq_from_list_ids.py to obtain FASTQ files from MGnify studies.

To run this Nextflow script, use the command nextflow run main.nf --url_studies "https://www.ebi.ac.uk/metagenomics/api/v1/studies" --url_analyses "https://www.ebi.ac.uk/metagenomics/api/v1/analyses" --biome_name "example_biome" --experiment_type "example_experiment".

Repository structure

The table below provides an overview of the key files and directories in this repository, along with a brief description of each.

File	Description
nextflow.config	Configuration file which contains a nextflow configuration for running the bioinformatics workflow, including parameters for processing genomic data on Azure cloud service
nextflow_config_full_draft.txt	Text file which contains a configuration for nextflow workflow specifying resources requirements for each program used
Dockerfile	Docker file containing the necessary commands to assemble a docker image
requirements.txt	Text file containing all the dependencies to run the analysis
mgnify_functions.py	Python script which contains all the fuctions to retrieve info from MGnify and get FASTQ file to run the second pipeline

References

Authors

Acknowledgments

We would like to extend our heartfelt gratitude to DTU Biosustain and the Novo Nordisk Foundation Center for Biosustainability for providing the essential resources and support that have been fundamental in the development and success of the DSP (Data Science Platform) and MoNA (Multi-omics Network Analysis) projects.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
bin		bin
data		data
template		template
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nextflow retrieve info MGnify

Table of contents

About

Getting started

Prerequisites and installing

Azure setting

Connecting with GitHub

Upload samples into the VM

Installing Java

Installing Docker

Installing Nextflow

Step by step

Repository structure

References

Authors

Acknowledgments

About

Releases

Packages

Languages

License

marcoreverenna/nf-retrieve_info_mgnify

Folders and files

Latest commit

History

Repository files navigation

Nextflow retrieve info MGnify

Table of contents

About

Getting started

Prerequisites and installing

Azure setting

Connecting with GitHub

Upload samples into the VM

Installing Java

Installing Docker

Installing Nextflow

Step by step

Repository structure

References

Authors

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages