How is my Gene Expression nCounter data normalized in ROSALIND? How does ROSALIND calculate differential expression for my Gene Expression nCounter data?

Short answer: we use the same methods as nSolver Advanced Analysis.

Gene Expression RCC Normalization

ROSALIND® follows the NanoString nCounter® Advanced Analysis protocol for data normalization of Gene Expression nCounter RCC Analysis.

Normalization for run-to-run and sample-to-sample variability is done by dividing counts within a lane by the geometric mean of the normalizer probes from the same lane. Normalizer probes are selected by the geNorm algorithm as implemented in the Bioconductor package NormqPCR. While the expression of a good housekeeping gene may vary between samples in non-normalized data, the ratio between the housekeepers should be stable. geNorm relies on the behavior of housekeepers rising and falling together to iteratively remove candidate housekeepers with the least stable expression relative to other candidates.

For scenarios where a user has more than one lot of the same panel or for PlexSet data, the user is able to define reference or calibration samples in ROSALIND during experiment setup. These samples are used to quantify and adjust for variability in probe efficiency across batches or lanes. Calibration factors are calculated on a per lot basis and are multiplied across all probes in that lot.

How does ROSALIND calculate differential expression for my data?

Differential gene expression analysis in ROSALIND is currently run with one of the two methods defined by NanoString and available in nSolver Advanced Analysis (1. "Fast" method or 2. “Optimal” method). To maintain consistency with nSolver Advanced Analysis default settings that have been utilized for many years, we have converted the default in ROSALIND to the “Fast” method. 

The Optimal method for calculating differential expression was initially designed to manage special cases for low count data using an additional component in the model to estimate noise dispersion. After extensive use, Fast has been demonstrated to successfully analysis of wide array of datasets, including low count data, and has been adopted as the default and more widely used method. With this history of use, we have concluded that the estimation of noise dispersion in the optimal method is an unnecessary addition for analysis of low count data.

For Gene Expression nCounter data, ROSALIND follows the nCounter® Advanced Analysis protocol to identity the targets which express significant increased or decreased expression. Differential expression is calculated based on user specified groups. In ROSALIND, users can set up comparisons based on sample attributes or selecting specific samples for each comparison of interest. ROSALIND implements the generalized linear model (GLM) that was developed by the NanoString Biostatistics team for analysis of count data, assuming a negative binomial distribution. The GLM is used to calculate fold-change and p-value for each gene. Adjusted p-values are calculated using the Benjamini-Hochberg False Discovery Rate (FDR) methodology.

We use the same procedure to generate normalization factors:

1. Calculate the geometric mean of the selected probes.

2. Calculate the geometric mean of these geometric means for all sample lanes.
3.  Divide the geometric mean by the geometric mean of each lane to generate a lane-specific normalization factor.

4. Multiply the counts for every probe by its lane-specific normalization factor.