vignettes/v10-Aligning_and_Quantifying_scRNA-Seq_Data.Rmd
v10-Aligning_and_Quantifying_scRNA-Seq_Data.Rmd
In this vignette, we will suggest a workflow for processing single cell RNA-Seq data to produce an SCtkExperiment object that can be used in the single cell toolkit.
Some of the steps in this workflow are performed outside of R and are optional, but they produce useful metrics that can be explored using the single cell toolkit. Please see the individual tool websites for installation instructions.
Quality control performed on the FASTQ files can identify low quality or failed samples that you may want to exclude from downstream analysis. FastQC is the standard tool used for quality control of fastq files. Run FastQC on the command line on each individual file:
fastqc input_read_1.fastq.gz
This command will create a fastqc HTML report and a zip file of fastqc result files. To combine individual sample reports into a single report, use multiqc:
multiqc .
The single cell object contains raw read counts, normalized expression data, sample annotations, and feature annotations along with downstream analysis. While you can make a single cell object with just a count matrix, to take advantage of some of the analyses available in the single cell toolkit, sample annotation information and feature information may be helpful.
For basic alignment and feature counting you can use the alignSingleCellData()
function in the single cell toolkit to align fastq data to a reference genome, count the number of reads per gene, and create a single cell object that contains annotation information. After running the alignSingleCellData()
function you can take the single cell object directly into the shiny app for downstream analysis. Detailed information about the options available in the alignSingleCellData()
function can be found in the function help, but here is an example command:
NOTE: Alignment to large genomes (the human genome) can take more than 8GB of memory. Make sure to run the
alignSingleCellData()
function from a computer with sufficient memory.
singlecellobject <- alignSingleCellData(inputfile1 = c(
"/path/to/sample1_1.fastq.gz","/path/to/sample2_1.fastq.gz"),
inputfile2 = c("/path/to/sample1_2.fastq.gz",
"/path/to/sample2_2.fastq.gz"),
indexPath = "/path/to/genome/index",
gtfAnnotation = "/path/to/gene/annotations.gtf",
sampleAnnotations = sample.annotation.df,
threads=4)
While setting the number of threads in the alignSingleCellData()
function will increase the speed of each individual alignment, the function will align each sample sequentially. To save time, you may want to perform alignment and feature counting on each file in a parallel computing environment and combine the samples to create an SCtkExperiment object. The alignSingleCellData()
function can be run on an individual file. Once individual sample objects have been created, combine them into a single SCtkExperiment object.
If your data requires additional processing steps (UMI normalization, etc), you can use the alignSingleCellData()
function to perform feature counting and create a single cell object without the alignment step if ‘.bam’ files are provided in the inputfile1
parameter. Note that if your read data was sequenced using paired-end reads, set the isPairedEnd
parameter TRUE
.
If you choose to align and quantify your data using alternative tools, you can use the createSCE()
function to create a single cell object that can be used in the single cell toolkit by providing a count data frame and optionally providing sample and feature annotations. See the createSCE()
function help page for details.