Long-Read 16S Sequencing (PacBio)

DukeGeneral

Long-read sequencing has certain advantages over our in-lab 16S sequencing with Illumina, like its high accuracy and species-level resolution, and the lab plans to switch to long-read sequencing in the future with the help of the Duke Microbiome Core. However, one of the drawbacks is that the pipeline takes forever; the one cohort I've run through so far took about five days to complete and often got preempted due to the sheer resources required by the run, so plan out enough time for the run accordingly and be patient with it.

To get the data from the sequencing core, create a Globus account by following the instructions here. Next, ask your contact at the sequencing core (for us, it was Graham Alexander) to add you to the data folder. To upload this folder to Isilon, click the link in the share email, which should open up the Globus Web App with the Collection and Path fields filled in, click 'Transfer or Sync to...' in the panel on the right, and fill the Collection and Path fields on the right. Click Start on the left, and you should get a green "Transfer request submitted successfully" notification. You can then check the status of the transfer on the Activity page, which can be accessed from the menu on the left.

Tip

You may need to configure a folder connection with Isilon in order to see it as an allowed collection to transfer to. To do so, follow the instructions here.

Long-read sequencing has certain advantages over short-read 16S sequencing with Illumina, like its high accuracy and species-level resolution. However, one of the drawbacks is that the pipeline takes forever; the one cohort I've run through so far took about five days to complete and often got preempted due to the sheer resources required by the run, so plan out enough time for the run accordingly and be patient with it.

To get the data from your sequencing core, create a Globus account and ask your contact at the core to share the data folder with you. To transfer the folder to your shared storage, click the link in the share email (this should open the Globus Web App with the Collection and Path fields filled in), click 'Transfer or Sync to...' in the panel on the right, and fill the Collection and Path fields on the right with your destination. Click Start on the left, and you should get a green "Transfer request submitted successfully" notification. You can check the status of the transfer on the Activity page, accessible from the menu on the left.

Tip

You may need to configure a folder connection with your shared storage in order to see it as an allowed collection to transfer to. Follow the instructions here.

Running the Pipeline

Download PacBio_16S_pipeline.R Download run_r_in_container.sbatch

The pipeline consists of two files: PacBio_16S_pipeline.R, the R script that processes raw PacBio HiFi reads into a finished phyloseq object, and run_r_in_container.sbatch, the Slurm batch script that runs the R script inside the lab's Apptainer container on the DCC. The R script uses DADA2 for all processing steps and GreenGenes2 for taxonomy assignment.

Requirements

The pipeline requires:

The lab's Apptainer container (metabarcoding.sif), which has R, DADA2, phyloseq, and all other dependencies pre-installed
A GreenGenes2 reference (gg2_2024_09_toSpecies_trainset.fa.gz)
A metadata CSV (user_biosamples.csv) with sample information; the first column should contain sample identifiers matching the biosample names in the FASTQ filenames
A folder of PacBio HiFi .fastq.gz files

File Structure

Set up a project folder on the DCC with the following structure:

[project-folder]/
├── PacBio_16S_pipeline.R
├── run_r_in_container.sbatch
├── gg2_2024_09_toSpecies_trainset.fa.gz
├── user_biosamples.csv
└── fastx_files/
    ├── [sample1].hifi_reads.[biosample1].hifi_reads.fastq.gz
    ├── [sample2].hifi_reads.[biosample2].hifi_reads.fastq.gz
    └── ...

Warning

The pipeline extracts sample names from the FASTQ filenames using the pattern *.hifi_reads.[BIOSAMPLE].hifi_reads.fastq.gz. If your filenames follow a different convention, you will need to update the regex on line 114 of PacBio_16S_pipeline.R.

Configuring the Batch Script

Before running, update the two path variables at the top of run_r_in_container.sbatch:

IMG=/hpc/group/ldavidlab/metabarcoding.sif
WORK=/hpc/group/ldavidlab/[path/to/project/folder]

IMG points to the Apptainer container and WORK points to your project folder. The batch script binds WORK to /work inside the container, so the R script sees all your input files under /work/. It also binds /scratch for fast temporary I/O during processing.

The rest of the batch script creates the necessary output directories, sets thread control variables to match the Slurm allocation (32 CPUs, 256 GB memory, 5-day time limit), and runs the R script:

mkdir -p "$WORK/logs" "$WORK/output"

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-32}
export OPENBLAS_NUM_THREADS=$OMP_NUM_THREADS
export MKL_NUM_THREADS=$OMP_NUM_THREADS

export TMPDIR=/scratch/$SLURM_JOB_ID
mkdir -p $TMPDIR

apptainer exec --bind "$WORK:/work,/scratch:/scratch" "$IMG" \
  Rscript /work/PacBio_16S_pipeline.R

To submit the job:

cd /hpc/group/ldavidlab/[path/to/project/folder]
sbatch run_r_in_container.sbatch

You can optionally add --mail-user=[NetID]@duke.edu after sbatch to receive an email when the job finishes or fails.

Walking Through the R Script

The R script runs end-to-end without user interaction. The walkthrough below describes each step for reference and troubleshooting.

Setup

The script begins by enabling parallel processing and loading packages:

options(mc.cores = parallel::detectCores())

library(dada2)
library(Biostrings)
library(ShortRead)
library(ggplot2)
library(phyloseq)
library(magrittr)
library(fs)
library(tidyverse)

It then sets the three input paths, all relative to the /work mount inside the container:

data.dir <- "/work/fastx_files"
greengenes2 <- "/work/gg2_2024_09_toSpecies_trainset.fa.gz"
user_biosamples <- "/work/user_biosamples.csv"

Intermediate RDS files are written to scratch space (or /work/temp as a fallback) for faster I/O:

path.out <- ifelse(Sys.getenv("TMPDIR") != "",
                   Sys.getenv("TMPDIR"),
                   "/work/temp")
path.rds <- file.path(path.out, "RDS")
dir.create(path.rds, recursive = TRUE, showWarnings = FALSE)

Primer Removal and Filtering

The script defines the 27F and 1492R primer sequences used for full-length 16S:

F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"

Primers are removed with removePrimers(), using the reverse complement of 1492R and allowing for either orientation:

nops <- file.path(noprimers_dir, basename(fns))
prim <- removePrimers(fns, nops, primer.fwd = F27,
                      primer.rev = dada2:::rc(R1492),
                      orient = TRUE, verbose = TRUE)

The primer-trimmed reads are then filtered for quality. The length bounds (1000–1600 bp) reflect the expected range for full-length 16S amplicons:

filts <- file.path(filtered_dir, basename(fns))
track <- filterAndTrim(nops, filts, minQ = 3, minLen = 1000, maxLen = 1600,
                       maxN = 0, rm.phix = FALSE, maxEE = 2,
                       compress = TRUE, multithread = TRUE)

Dereplication, Error Learning, and Denoising

Filtered reads are dereplicated, then the PacBio-specific error model is learned and used for denoising. Note the BAND_SIZE = 32 parameter, which is larger than the default to accommodate PacBio's longer reads:

drp <- derepFastq(filts, verbose = TRUE)

err <- learnErrors(drp, errorEstimationFunction = PacBioErrfun,
                   BAND_SIZE = 32, multithread = TRUE,
                   randomize = TRUE)

dd <- dada(drp, err = err, BAND_SIZE = 32, multithread = TRUE,
           verbose = 1)

An error model plot is saved to error_plot.pdf for quality control. Each of these steps also saves an intermediate RDS file, so if the pipeline fails partway through you can inspect the last successful output.

Read Tracking

A read tracking table is written to read_tracking.csv, recording how many reads survived each step:

tracking <- cbind(ccs = prim[,1], primers = prim[,2],
                  filtered = track[,2],
                  denoised = sapply(dd, function(x) sum(x$denoised)))
write.csv(tracking, file.path(path.out, "read_tracking.csv"))

Sequence Table, Taxonomy, and Chimera Removal

The denoised reads are assembled into a sequence table, then taxonomy is assigned using the GreenGenes2 reference with a minimum bootstrap confidence of 50:

st <- makeSequenceTable(dd)

tax <- assignTaxonomy(st, greengenes2, multithread = TRUE,
                      verbose = TRUE, minBoot = 50)

Escherichia/Shigella assignments are simplified to Escherichia, since the two genera are genomically indistinguishable by 16S:

tax[,"Genus"] <- gsub("Escherichia/Shigella", "Escherichia", tax[,"Genus"])

Chimeric sequences are then identified and removed from both the sequence table and the taxonomy table:

bim <- isBimeraDenovo(st, minFoldParentOverAbundance = 3.5,
                      multithread = TRUE, verbose = TRUE)

st_nochim <- st[,!bim]
tax_nochim <- tax[!bim,]

Sample Name Extraction

Sample names are extracted from the FASTQ filenames by pulling the biosample identifier from the *.hifi_reads.[BIOSAMPLE].hifi_reads.fastq.gz pattern:

file.names <- basename(fns)
sample.names <- sub(".*\\.hifi_reads\\.(.*?)\\.hifi_reads\\.fastq\\.gz",
                    "\\1", file.names)
rownames(st_nochim) <- sample.names

Phyloseq Creation

The metadata CSV is read in, sample names are verified against the sequence table, and the phyloseq object is assembled from the chimera-free sequence table, taxonomy, metadata, and reference sequences:

df <- read.csv(user_biosamples, header = TRUE, row.names = 1)

ps <- phyloseq(
  otu_table(st_nochim, taxa_are_rows = FALSE),
  sample_data(df[rownames(st_nochim), , drop = FALSE]),
  tax_table(tax_nochim)
)

dna <- Biostrings::DNAStringSet(colnames(st_nochim))
names(dna) <- colnames(st_nochim)
ps <- merge_phyloseq(ps, dna)

Output

The finished phyloseq and all intermediate objects are copied to output/ in your project folder:

[project-folder]/
└── output/
    ├── ps_PacBio_16S.rds          # final phyloseq
    ├── Sequence_Table_16S.rds     # chimera-free sequence table
    ├── Taxonomy_GreenGenes.rds    # chimera-free taxonomy
    ├── read_tracking.csv          # reads per sample per step
    └── error_plot.pdf             # DADA2 error model plot

If the job used scratch space for temporary files, those are also copied back to temp/RDS/ in your project folder before cleanup.

Short-Read 16S Sequencing (Illumina)