Long-Read 16S Sequencing (PacBio)
Long-read sequencing has certain advantages over our in-lab 16S sequencing with Illumina, like its high accuracy and species-level resolution, and the lab plans to switch to long-read sequencing in the future with the help of the Duke Microbiome Core. However, one of the drawbacks is that the pipeline takes forever; the one cohort I've run through so far took about five days to complete and often got preempted due to the sheer resources required by the run, so plan out enough time for the run accordingly and be patient with it.
To get the data from the sequencing core, create a Globus account by following the instructions here. Next, ask your contact at the sequencing core (for us, it was Graham Alexander) to add you to the data folder. To upload this folder to Isilon, click the link in the share email, which should open up the Globus Web App with the Collection and Path fields filled in, click 'Transfer or Sync to...' in the panel on the right, and fill the Collection and Path fields on the right. Click Start on the left, and you should get a green "Transfer request submitted successfully" notification. You can then check the status of the transfer on the Activity page, which can be accessed from the menu on the left.
Tip
You may need to configure a folder connection with Isilon in order to see it as an allowed collection to transfer to. To do so, follow the instructions here.
Long-read sequencing has certain advantages over short-read 16S sequencing with Illumina, like its high accuracy and species-level resolution. However, one of the drawbacks is that the pipeline takes forever; the one cohort I've run through so far took about five days to complete and often got preempted due to the sheer resources required by the run, so plan out enough time for the run accordingly and be patient with it.
To get the data from your sequencing core, create a Globus account and ask your contact at the core to share the data folder with you. To transfer the folder to your shared storage, click the link in the share email (this should open the Globus Web App with the Collection and Path fields filled in), click 'Transfer or Sync to...' in the panel on the right, and fill the Collection and Path fields on the right with your destination. Click Start on the left, and you should get a green "Transfer request submitted successfully" notification. You can check the status of the transfer on the Activity page, accessible from the menu on the left.
Tip
You may need to configure a folder connection with your shared storage in order to see it as an allowed collection to transfer to. Follow the instructions here.
Running the Pipeline
The pipeline consists of two files: PacBio_16S_pipeline.R, the R script that processes raw PacBio HiFi reads into a finished phyloseq object, and run_r_in_container.sbatch, the Slurm batch script that runs the R script inside the lab's Apptainer container on the DCC. The R script uses DADA2 for all processing steps and GreenGenes2 for taxonomy assignment.
Requirements
The pipeline requires:
- The lab's Apptainer container (
metabarcoding.sif), which has R, DADA2, phyloseq, and all other dependencies pre-installed - A GreenGenes2 reference (
gg2_2024_09_toSpecies_trainset.fa.gz) - A metadata CSV (
user_biosamples.csv) with sample information; the first column should contain sample identifiers matching the biosample names in the FASTQ filenames - A folder of PacBio HiFi
.fastq.gzfiles
File Structure
Set up a project folder on the DCC with the following structure:
[project-folder]/
├── PacBio_16S_pipeline.R
├── run_r_in_container.sbatch
├── gg2_2024_09_toSpecies_trainset.fa.gz
├── user_biosamples.csv
└── fastx_files/
├── [sample1].hifi_reads.[biosample1].hifi_reads.fastq.gz
├── [sample2].hifi_reads.[biosample2].hifi_reads.fastq.gz
└── ...
Warning
The pipeline extracts sample names from the FASTQ filenames using the pattern *.hifi_reads.[BIOSAMPLE].hifi_reads.fastq.gz. If your filenames follow a different convention, you will need to update the regex on line 114 of PacBio_16S_pipeline.R.
Configuring the Batch Script
Before running, update the two path variables at the top of run_r_in_container.sbatch:
IMG points to the Apptainer container and WORK points to your project folder. The batch script binds WORK to /work inside the container, so the R script sees all your input files under /work/. It also binds /scratch for fast temporary I/O during processing.
The rest of the batch script creates the necessary output directories, sets thread control variables to match the Slurm allocation (32 CPUs, 256 GB memory, 5-day time limit), and runs the R script:
mkdir -p "$WORK/logs" "$WORK/output"
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-32}
export OPENBLAS_NUM_THREADS=$OMP_NUM_THREADS
export MKL_NUM_THREADS=$OMP_NUM_THREADS
export TMPDIR=/scratch/$SLURM_JOB_ID
mkdir -p $TMPDIR
apptainer exec --bind "$WORK:/work,/scratch:/scratch" "$IMG" \
Rscript /work/PacBio_16S_pipeline.R
To submit the job:
You can optionally add --mail-user=[NetID]@duke.edu after sbatch to receive an email when the job finishes or fails.
Walking Through the R Script
The R script runs end-to-end without user interaction. The walkthrough below describes each step for reference and troubleshooting.
Setup
The script begins by enabling parallel processing and loading packages:
options(mc.cores = parallel::detectCores())
library(dada2)
library(Biostrings)
library(ShortRead)
library(ggplot2)
library(phyloseq)
library(magrittr)
library(fs)
library(tidyverse)
It then sets the three input paths, all relative to the /work mount inside the container:
data.dir <- "/work/fastx_files"
greengenes2 <- "/work/gg2_2024_09_toSpecies_trainset.fa.gz"
user_biosamples <- "/work/user_biosamples.csv"
Intermediate RDS files are written to scratch space (or /work/temp as a fallback) for faster I/O:
path.out <- ifelse(Sys.getenv("TMPDIR") != "",
Sys.getenv("TMPDIR"),
"/work/temp")
path.rds <- file.path(path.out, "RDS")
dir.create(path.rds, recursive = TRUE, showWarnings = FALSE)
Primer Removal and Filtering
The script defines the 27F and 1492R primer sequences used for full-length 16S:
Primers are removed with removePrimers(), using the reverse complement of 1492R and allowing for either orientation:
nops <- file.path(noprimers_dir, basename(fns))
prim <- removePrimers(fns, nops, primer.fwd = F27,
primer.rev = dada2:::rc(R1492),
orient = TRUE, verbose = TRUE)
The primer-trimmed reads are then filtered for quality. The length bounds (1000–1600 bp) reflect the expected range for full-length 16S amplicons:
filts <- file.path(filtered_dir, basename(fns))
track <- filterAndTrim(nops, filts, minQ = 3, minLen = 1000, maxLen = 1600,
maxN = 0, rm.phix = FALSE, maxEE = 2,
compress = TRUE, multithread = TRUE)
Dereplication, Error Learning, and Denoising
Filtered reads are dereplicated, then the PacBio-specific error model is learned and used for denoising. Note the BAND_SIZE = 32 parameter, which is larger than the default to accommodate PacBio's longer reads:
drp <- derepFastq(filts, verbose = TRUE)
err <- learnErrors(drp, errorEstimationFunction = PacBioErrfun,
BAND_SIZE = 32, multithread = TRUE,
randomize = TRUE)
dd <- dada(drp, err = err, BAND_SIZE = 32, multithread = TRUE,
verbose = 1)
An error model plot is saved to error_plot.pdf for quality control. Each of these steps also saves an intermediate RDS file, so if the pipeline fails partway through you can inspect the last successful output.
Read Tracking
A read tracking table is written to read_tracking.csv, recording how many reads survived each step:
tracking <- cbind(ccs = prim[,1], primers = prim[,2],
filtered = track[,2],
denoised = sapply(dd, function(x) sum(x$denoised)))
write.csv(tracking, file.path(path.out, "read_tracking.csv"))
Sequence Table, Taxonomy, and Chimera Removal
The denoised reads are assembled into a sequence table, then taxonomy is assigned using the GreenGenes2 reference with a minimum bootstrap confidence of 50:
st <- makeSequenceTable(dd)
tax <- assignTaxonomy(st, greengenes2, multithread = TRUE,
verbose = TRUE, minBoot = 50)
Escherichia/Shigella assignments are simplified to Escherichia, since the two genera are genomically indistinguishable by 16S:
Chimeric sequences are then identified and removed from both the sequence table and the taxonomy table:
bim <- isBimeraDenovo(st, minFoldParentOverAbundance = 3.5,
multithread = TRUE, verbose = TRUE)
st_nochim <- st[,!bim]
tax_nochim <- tax[!bim,]
Sample Name Extraction
Sample names are extracted from the FASTQ filenames by pulling the biosample identifier from the *.hifi_reads.[BIOSAMPLE].hifi_reads.fastq.gz pattern:
file.names <- basename(fns)
sample.names <- sub(".*\\.hifi_reads\\.(.*?)\\.hifi_reads\\.fastq\\.gz",
"\\1", file.names)
rownames(st_nochim) <- sample.names
Phyloseq Creation
The metadata CSV is read in, sample names are verified against the sequence table, and the phyloseq object is assembled from the chimera-free sequence table, taxonomy, metadata, and reference sequences:
df <- read.csv(user_biosamples, header = TRUE, row.names = 1)
ps <- phyloseq(
otu_table(st_nochim, taxa_are_rows = FALSE),
sample_data(df[rownames(st_nochim), , drop = FALSE]),
tax_table(tax_nochim)
)
dna <- Biostrings::DNAStringSet(colnames(st_nochim))
names(dna) <- colnames(st_nochim)
ps <- merge_phyloseq(ps, dna)
Output
The finished phyloseq and all intermediate objects are copied to output/ in your project folder:
[project-folder]/
└── output/
├── ps_PacBio_16S.rds # final phyloseq
├── Sequence_Table_16S.rds # chimera-free sequence table
├── Taxonomy_GreenGenes.rds # chimera-free taxonomy
├── read_tracking.csv # reads per sample per step
└── error_plot.pdf # DADA2 error model plot
If the job used scratch space for temporary files, those are also copied back to temp/RDS/ in your project folder before cleanup.