Creating the SQL File
These instructions will help you create a taxonomizr-prepared SQLite database for use in creating the references and phyloseq objects.
Download Scripts
To-do
Add instructions on downloading these scripts from GitHub once they are uploaded there.
EDIT 01/15/26: All scripts but conda.sh are located at https://github.com/LAD-LAB/mb-pipeline/tree/main/reference/sql-creation.
You will need to download the following four scripts to a folder on your computing cluster. To this same folder, also add a subfolder called tempdir.
The first script initializes conda; update the paths to match your conda installation:
export CONDA_EXE='/hpc/group/ldavidlab/users/[NetID]/miniconda3/bin/conda'
export _CE_M=''
export _CE_CONDA=''
export CONDA_PYTHON_EXE='/hpc/group/ldavidlab/users/[NetID]/miniconda3/bin/python'
# Copyright (C) 2012 Anaconda, Inc
# SPDX-License-Identifier: BSD-3-Clause
__conda_exe() (
"$CONDA_EXE" $_CE_M $_CE_CONDA "$@"
)
__conda_hashr() {
if [ -n "${ZSH_VERSION:+x}" ]; then
\rehash
elif [ -n "${POSH_VERSION:+x}" ]; then
: # pass
else
\hash -r
fi
}
__conda_activate() {
if [ -n "${CONDA_PS1_BACKUP:+x}" ]; then
PS1="$CONDA_PS1_BACKUP"
\unset CONDA_PS1_BACKUP
fi
\local ask_conda
ask_conda="$(PS1="${PS1:-}" __conda_exe shell.posix "$@")" || \return
\eval "$ask_conda"
__conda_hashr
}
conda() {
\local cmd="${1-__missing__}"
case "$cmd" in
activate|deactivate)
__conda_activate "$@"
;;
install|update|upgrade|remove|uninstall)
__conda_exe "$@" || \return
__conda_activate reactivate
;;
*)
__conda_exe "$@"
;;
esac
}
if [ -z "${CONDA_SHLVL+x}" ]; then
\export CONDA_SHLVL=0
if [ -n "${_CE_CONDA:+x}" ] && [ -n "${WINDIR+x}" ]; then
PATH="$(\dirname "$CONDA_EXE")/condabin${PATH:+":${PATH}"}"
else
PATH="$(\dirname "$(\dirname "$CONDA_EXE")")/condabin${PATH:+":${PATH}"}"
fi
\export PATH
if [ -z "${PS1+x}" ]; then
PS1=
fi
fi
export CONDA_EXE='[/path/to/miniconda3]/bin/conda'
export _CE_M=''
export _CE_CONDA=''
export CONDA_PYTHON_EXE='[/path/to/miniconda3]/bin/python'
# Copyright (C) 2012 Anaconda, Inc
# SPDX-License-Identifier: BSD-3-Clause
__conda_exe() (
"$CONDA_EXE" $_CE_M $_CE_CONDA "$@"
)
__conda_hashr() {
if [ -n "${ZSH_VERSION:+x}" ]; then
\rehash
elif [ -n "${POSH_VERSION:+x}" ]; then
: # pass
else
\hash -r
fi
}
__conda_activate() {
if [ -n "${CONDA_PS1_BACKUP:+x}" ]; then
PS1="$CONDA_PS1_BACKUP"
\unset CONDA_PS1_BACKUP
fi
\local ask_conda
ask_conda="$(PS1="${PS1:-}" __conda_exe shell.posix "$@")" || \return
\eval "$ask_conda"
__conda_hashr
}
conda() {
\local cmd="${1-__missing__}"
case "$cmd" in
activate|deactivate)
__conda_activate "$@"
;;
install|update|upgrade|remove|uninstall)
__conda_exe "$@" || \return
__conda_activate reactivate
;;
*)
__conda_exe "$@"
;;
esac
}
if [ -z "${CONDA_SHLVL+x}" ]; then
\export CONDA_SHLVL=0
if [ -n "${_CE_CONDA:+x}" ] && [ -n "${WINDIR+x}" ]; then
PATH="$(\dirname "$CONDA_EXE")/condabin${PATH:+":${PATH}"}"
else
PATH="$(\dirname "$(\dirname "$CONDA_EXE")")/condabin${PATH:+":${PATH}"}"
fi
\export PATH
if [ -z "${PS1+x}" ]; then
PS1=
fi
fi
The second script is the Slurm batch script that sources the conda environment and runs the R script:
#!/bin/bash
#SBATCH --job-name=taxonomizr
#SBATCH --partition common-old,scavenger
#SBATCH --mem=64000
#SBATCH -n 2 # Number of cores
#SBATCH --out=taxonomizr-%j.out
#SBATCH --error=taxonomizr-%j.err
#SBATCH --mail-user=[NetID]@duke.edu
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
# Usage: taxonomizr.sh [/path/to/SQL/directory]
# source QIIME2 environment
source [/path/to/conda.sh]
conda activate [qiime2-2022.8]
# load R and run taxonomizr script
Rscript Rscript-echo.R taxonomizr.R $1
#!/bin/bash
#SBATCH --job-name=taxonomizr
#SBATCH --partition [your-partition]
#SBATCH --mem=64000
#SBATCH -n 2 # Number of cores
#SBATCH --out=taxonomizr-%j.out
#SBATCH --error=taxonomizr-%j.err
#SBATCH --mail-user=[username]@[your-email]
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
# Usage: taxonomizr.sh [/path/to/SQL/directory]
# source conda environment
source [/path/to/conda.sh]
conda activate [your-conda-env]
# load R and run taxonomizr script
Rscript Rscript-echo.R taxonomizr.R $1
Note
Note that you must create a conda environment for use here:
You will learn how to set this up in the next section; don't forget to add the name of the environment you create back into taxonomizr.sh!
The third script is the R script that builds the SQL database:
# Prepare NCBI taxonomy SQL database on cluster (runs out of memory locally)
# Setup -----------------------------------------------------------------------
args <- commandArgs(trailingOnly=TRUE)
print(args)
setwd(args[2]) # Set the directory
library(taxonomizr); packageVersion('taxonomizr') # Read in library
# Format SQL database ---------------------------------------------------------
prepareDatabase('accessionTaxa.sql',
extraSqlCommand="PRAGMA temp_store_directory = args[2]")
The fourth script redirects Rscript output to an .Rout file:
# Using a combination of source() and sink(), get Rscript to produce an .Rout file like that
# produced by R CMD BATCH.
# Command-line usage: Rscript Rscript-echo.R [Primary script name] [Primary script args]
# Remember to adjust args indices of receiving script accordingly!
args <- commandArgs(TRUE)
srcfile <- args[1]
outfile <- file.path(args[2], paste0(make.names(date()), '.Rout'))
sink(outfile, split=TRUE)
source(srcfile, echo=TRUE)
Setting Up a Conda Environment
Next, you must set up a conda environment. Log into your computing cluster and run the following code to install MiniConda3, following the instructions when prompted:
Give the MiniConda install location as /hpc/group/ldavidlab/users/[NetID]/miniconda3:
Next, create a conda environment:
and follow the instructions. Install R to this environment by running:
and following the instructions. Next, activate R and install the necessary packages:
and after these packages are installed, run q() to exit R.
Running the Scripts
Navigate into the folder in which you have downloaded the above scripts. Make sure you update taxonomizr.sh with the name of your environment and path to conda.sh and make sure you update conda.sh with your paths!
Next, set the tempdir subfolder as your temporary directory:
And now you can run the scripts:
sbatch --mail-user=[NetID]@duke.edu /hpc/group/ldavidlab/users/[NetID]/[script-folder]/taxonomizr.sh /hpc/group/ldavidlab/users/[NetID]/[script-folder]
Last, make sure to upload the SQL file to Isilon! Given its size, this will take a while.
Next, set the tempdir subfolder as your temporary directory:
And now you can run the scripts:
sbatch --mail-user=[username]@[your-email] [/hpc/path/to/lab/directory]/users/[username]/[script-folder]/taxonomizr.sh [/hpc/path/to/lab/directory]/users/[username]/[script-folder]
Last, make sure to copy the SQL file to your shared storage so it can be accessed from R. Given its size, this will take a while.