Blog Post 2

6 minute read

Published:

Amrnanopro: A Nextflow Pipeline for Comprehensive Bacterial Genome Analysis from Quality Control to Antimicrobial Resistance Profiling

Introduction

Antimicrobial Resistance (AMR) poses a significant threat to global health, undermining modern healthcare systems and efforts to control infectious diseases. Rapid and accurate AMR detection is crucial for timely intervention and preventing transmission. Amrnanopro is a Nextflow pipeline that bridges genomic data generation and public health applications by offering a scalable, modular, and user-friendly solution for bacterial whole genome sequencing data analysis.

Team Members

Name: Albert Rock Anasthase Gangbadja

Role: Lead Developer

Albert Rock oversees the project development, including setting up Nextflow processing, input handling, and workflow development. He also integrates MultiQC to ensure comprehensive and user-friendly reports.

Project Motivation

Driven by a passion for bioinformatics and aspiring to pursue a Ph.D., this project aims to create a robust tool to showcase during interviews for bioinformatics engineering positions. Undertaking this project independently challenges and enhances my capabilities as a bioinformatician.

Project Aim

Amrnanopro delivers a Nextflow pipeline for processing bacterial whole genome sequencing data, including:

  • Quality Control (QC)
  • AMR Profiling
  • Phylogenetic Analysis

Utilizing MultiQC, the pipeline generates comprehensive HTML reports, making results accessible to non-experts.

Technologies

Programming Languages

  • Apache Groovy
  • Nextflow DSL2
  • Python 3
  • Bash

Platforms

  • Ubuntu 22.04
  • Docker
  • Singularity

Bioinformatics Tools

  • NanoPlot: Quality control for Oxford Nanopore Technologies (ONT) reads
  • Chopper: Filtering of ONT reads
  • Flye: Long-read assembly
  • Medaka: Polishing ONT reads
  • Abricate: AMR profiling
  • MultiQC: Aggregated reporting tool for analysis summaries

Alternatives Considered

  • Snakemake: Evaluated but not selected due to limitations in modularity and portability compared to Nextflow DSL2.

Why Nextflow DSL2?

Nextflow DSL2 was chosen for its:

  • Scalability: Efficiently handles large-scale genomic data.
  • Modularity: Enables reusable and maintainable workflow components.
  • Portability: Executes across various computational environments (local, HPC, cloud).
  • Community Support: Benefits from the nf-core project’s resources and active user base.

Project Challenge

Context

AMR undermines antibiotic effectiveness, making infections harder to treat and increasing disease spread, severe illness, and mortality. Genomic surveillance is essential for detecting resistance markers, enabling timely interventions. Despite rapid genomic data generation, accessible tools for non-experts are needed to connect genome science with public health. Effective surveillance tools must identify high-risk genetic strains, categorize isolates into lineages, and recognize genetic components linked to AMR and virulence.

User Stories

  • As a Bioinformatician, I want to perform quality control on raw sequencing data to ensure the reliability and accuracy of downstream analyses.
    • The pipeline runs NanoPlot to assess ONT read quality.
    • MultiQC generates detailed quality metrics and visualizations.
    • Users can easily view QC results through the HTML report.
    • Results are reproducible across different environments (Docker, Singularity, HPC).

Features

  • Comprehensive Workflow: From raw reads to AMR profiling and phylogenetic analysis.
  • Quality Control: Uses NanoPlot for assessing read quality before and after filtering.
  • Read Filtering: Implements Chopper to filter low-quality reads.
  • Genome Assembly: Employs Flye for assembling long-read sequences.
  • Polishing: Utilizes Medaka to enhance assembly accuracy.
  • AMR Profiling: Integrates Abricate for identifying antimicrobial resistance genes.
  • Aggregated Reporting: Generates HTML reports with MultiQC.
  • Modularity: Built with Nextflow DSL2 modules for flexibility and reusability.
  • Scalability: Handles large datasets efficiently across various computing environments.

Architecture

Amrnanopro Workflow

Figure 1: Overview of the Amrnanopro pipeline architecture.

Infrastructure

  • Nf-core Tests: Ensures pipeline reliability and consistency by adhering to community standards.
  • Feature Branching: Organizes development and collaboration through Git branching strategies.
  • Continuous Integration with Quality Gates: Maintains high code and workflow standards via automated testing and validation.

Implementation

Workflow Components

  1. NanoPlot Pre-Filtering Analysis: Assesses raw ONT sequencing data quality.
  2. Chopper Filtering: Removes low-quality reads to improve downstream analyses.
  3. NanoPlot Post-Filtering Analysis: Evaluates read quality after filtering.
  4. Genome Assembly with Flye: Assembles filtered reads into contigs.
  5. Polishing with Medaka: Corrects assembly errors to enhance accuracy.
  6. AMR Profiling with Abricate: Identifies antimicrobial resistance genes in the assembled genome.
  7. MultiQC Reporting: Aggregates all analysis results into a comprehensive HTML report.

Execution Profiles

  • Standard: Default execution settings.
  • Docker: Runs the pipeline within Docker containers for reproducibility.
  • Singularity: Utilizes Singularity containers, suitable for HPC environments.

Usage

Installation

Clone the Repository

git clone https://github.com/AlbertRock/amrnanopro.git
cd amrnanopro

Install Dependencies

  • Using Docker:

    Ensure Docker is installed and running on your system.

Running the Pipeline

Execute the pipeline with your input FASTQ file:

nextflow run AlbertRock/amrnanopro \
            --input_fastq path/to/your_data.fastq.gz \
            -profile docker \
            --outdir path/to/your_output_directory

Parameters

  • --input_fastq: Path to the input FASTQ file(s).
  • --skip_chopper: Set to true to skip the Chopper filtering step (default: false).

Output

Results are generated in the results/ directory, including:

  • Quality Control Reports: Pre- and post-filtering reports from NanoPlot.
  • Filtered Reads: FASTQ files after Chopper filtering.
  • Assembly Files: Contigs and scaffolds from Flye.
  • Polished Genome: Improved assembly from Medaka.
  • AMR Profiling Results: Resistance genes identified by Abricate.
  • Aggregated Report: Comprehensive MultiQC report summarizing all analyses.

Challenges Faced

  • Sample Name Conflicts in MultiQC: Resolved by assigning unique identifiers to each sample stage.
  • Integrating Multiple Tools: Managed input/output formats and parameter configurations for seamless communication between NanoPlot, Chopper, Flye, Medaka, Abricate, and MultiQC.
  • Automated Testing and Continuous Integration: Established unit tests and GitHub Actions for maintaining code quality, despite a learning curve.

Lessons Learned

  • Modularity Enhances Maintainability: Building with modular Nextflow DSL2 components improves reusability and maintenance.
  • Comprehensive Documentation is Crucial: Detailed README.md and code comments facilitate collaboration and user adoption.
  • Community Support is Invaluable: Engaging with the Nextflow and nf-core communities provided essential assistance and insights.

Conclusion

Amrnanopro offers valuable tools for genomic surveillance and AMR monitoring, supporting efforts to combat antimicrobial resistance. By focusing on comprehensive data processing, insightful reporting, and user-friendly design, the pipeline makes advanced genomic analyses accessible to researchers and healthcare professionals.

Future Work

  • Assembly, Polishing and AMR Detection: Only the MVP is done, the assembly, polishing and AMR detection are yet to be integrated.
  • Extend Compatibility: Integrate additional sequencing platforms to broaden applicability.
  • User Interface Development: Develop a graphical interface for users without command-line experience.
  • Cloud Deployment: Enable deployment on cloud platforms for scalable and distributed processing.

Screenshots

MultiQC Report Overview

MultiQC Report Screenshot

Figure 2: MultiQC report summarizing software versions and reads length.

MultiQC Report Screenshot

Figure 3: MultiQC report summarizing reads quality.

Acknowledgements

  • Nextflow Community: For providing a robust workflow management system and support.
  • nf-core Project: For guidelines and best practices in pipeline development.
  • Open-Source Contributors: To all developers of the bioinformatics tools integrated into AMRNanoPro.

Thank you for exploring Amrnanopro! If you have any questions or suggestions, please reach out via GitHub Issues or connect on LinkedIn.


References

Leave a Comment