In this vignette, we will suggest a workflow for processing single cell RNA-Seq data to produce an SCtkExperiment object that can be used in the single cell toolkit.
Some of the steps in this workflow are performed outside of R and are optional, but they produce useful metrics that can be explored using the single cell toolkit. Please see the individual tool websites for installation instructions.
Quality control performed on the FASTQ files can identify low quality or failed samples that you may want to exclude from downstream analysis. FastQC is the standard tool used for quality control of fastq files. Run FastQC on the command line on each individual file:
This command will create a fastqc HTML report and a zip file of fastqc result files. To combine individual sample reports into a single report, use multiqc:
The single cell object contains raw read counts, normalized expression data, sample annotations, and feature annotations along with downstream analysis. While you can make a single cell object with just a count matrix, to take advantage of some of the analyses available in the single cell toolkit, sample annotation information and feature information may be helpful.
The sample annotation data frame should contain columns of annotation information and row names that match the sample names in the count matrix. The output table
multiqc_fastqc.txt can be used as a sample annotation data frame.
For basic alignment and feature counting you can use the
alignSingleCellData() function in the single cell toolkit to align fastq data to a reference genome, count the number of reads per gene, and create a single cell object that contains annotation information. After running the
alignSingleCellData() function you can take the single cell object directly into the shiny app for downstream analysis. Detailed information about the options available in the
alignSingleCellData() function can be found in the function help, but here is an example command:
NOTE: Alignment to large genomes (the human genome) can take more than 8GB of memory. Make sure to run the
alignSingleCellData()function from a computer with sufficient memory.
singlecellobject <- alignSingleCellData(inputfile1 = c( "/path/to/sample1_1.fastq.gz","/path/to/sample2_1.fastq.gz"), inputfile2 = c("/path/to/sample1_2.fastq.gz", "/path/to/sample2_2.fastq.gz"), indexPath = "/path/to/genome/index", gtfAnnotation = "/path/to/gene/annotations.gtf", sampleAnnotations = sample.annotation.df, threads=4)
While setting the number of threads in the
alignSingleCellData() function will increase the speed of each individual alignment, the function will align each sample sequentially. To save time, you may want to perform alignment and feature counting on each file in a parallel computing environment and combine the samples to create an SCtkExperiment object. The
alignSingleCellData() function can be run on an individual file. Once individual sample objects have been created, combine them into a single SCtkExperiment object.
If your data requires additional processing steps (UMI normalization, etc), you can use the
alignSingleCellData() function to perform feature counting and create a single cell object without the alignment step if ‘.bam’ files are provided in the
inputfile1 parameter. Note that if your read data was sequenced using paired-end reads, set the
If you choose to align and quantify your data using alternative tools, you can use the
createSCE() function to create a single cell object that can be used in the single cell toolkit by providing a count data frame and optionally providing sample and feature annotations. See the
createSCE() function help page for details.