A review of statistical analysis software programs used in biomedical research, based on a Labome survey of randomly selected, formal publications.
Biological research involves living organisms either manipulated in vitro or as wild types observed in nature. They are highly variable within a sample population. The utilization of mathematical and statistical methods and tools are warranted to have a predictable extrapolation of the sample observations to the total population. The issues of statistical applications and interpretations have been identified as one of the leading contributors to current research irreproducibility [1] and many suggestions have been put forth [2, 3]. The significance of the tools used underlies that not all biologists are statisticians and hence the software bridge the gap between data generation and data analysis to make meaningful conclusions. The tools empower the researchers without a deep understanding of statistics. Table 1 lists the major suppliers and their main software for data processing and statistical analysis cited in formal articles.
Supplier | Main brand | Num | Sample References |
---|---|---|---|
GraphPad | Prism, InStat | 101 | [4], Ver 9 [5, 6], Ver 8 [7, 8], Ver 7 [9, 10], Ver 5 [11] |
Microsoft Corporation | Excel | 17 | [12] |
OriginLab Corporation | Origin | 15 | [12, 13] |
IBM SPSS | SPSS | 13 | [14, 15] |
R | R | 6 | [10, 16] |
StataCorp LP | Stata | 5 | [17] |
Matlab | 3 | [13, 18, 19] |
The graphing and statistical analysis tool by GraphPad is one of the most popular tools in this review. Since its original development for biologists in both academic and industry they have stood on their strength like doing understandable statistics, and retracing every analysis. The software helps the user perform basic statistical methods needed in laboratory researchers and clinicians like t tests, nonparametric comparisons, one- and two-way ANOVA, analysis of contingency tables, and survival analysis.
The best part of the software is its interpreted result analysis page that is prepared at the end of the analysis. The language used is very simple and straightforward without the use of technical jargons of statistics that a researcher looking at biological interpretations is not much concerned about. The descriptives and assumptions made during the analysis are explained during detail post-analysis. The events that have led to the findings of the analysis can be clearly back-traced with their order retained. The data that has been lost or unused in the analysis is well located in the same table used for data analysis represented by asterisk-ed blue italics. The data can be annotated and then can be shared through LabArchives, a cloud-based laboratory notebook.
The software allows automation without programming by retaining the graph(s) or entire data analysis by just cloning the resident file or graph. This also adds to its pertaining behaviour of automatic re-analysis of data in cases where any of the data points are altered, all on runtime without any need to redo the analysis performed or graph drawn. The live-update and graph cloning save a lot of precious time by avoiding repetition of the routine every time the same analysis is done.
The Prism is not a full-fledged statistical program but its application in biology can be claimed to be near being complete. Given its ease of use, and the knowledge of statistics amongst biologists there is perfect complementation by this software to the community it serves, thereby claiming the majority of share in both data analysis and statistical data analysis. The graphing capability is of equal potential as the statistical analysis and just remains the reason for its dominance.
The software also has the ability to import raw files in basic formats like csv, txt, more popular xls and more of a modern-day standard XML file, apart from its own pzf format. Clairfeuille T et al. analyzed electrophysiological data through Prism 7, in addition to Excel and Origin Pro, to investigate the fast inactivation of voltage-gated sodium channels [12]. GraphPad Prism was used to perform statistical analysis to study the effect of sleep deprivation on tau and amyloid beta accumulation [20], the cognitive deficits of Alzheimer's disease [21], the BMP-dependent cardiac contraction [22], the role of the mGluR5-Erk pathway in tuberous sclerosis complex [23], the importance of TCR signaling to natural killer T cell development [24], the role of Lsd1 during blood cell maturation [25], the role of miR-146a in mouse hematopoietic stem cells [26], investigate mitochondrial division in yeast [27], the contributions of mast cells to dengue virus-induced vascular leakage [28], the synaptic transmission and cognitive function impairment caused by reduction of the cholesterol sensor SCAP in the brains of mice [29], and the modulation effect of USF1 in molecular and behavioral circadian rhythms in mammals [30], among others.
Without any need of introduction, Microsoft Corp Excel is used widely in statistical analysis per the dataset taken for this review. The program has a wider reach and knowledge of use is quite widespread that the amount of unknown is very less about the way-of-use and thus the ease-of-use reaches the highest among the reviewed software. The program has functions that perform simple and complex mathematical and statistical functions one at a time. The syntax of writing any function is well accompanied by the user-friendly highlighted help alongside the function to be used. It is intuitive as it was the first graphical user interfaced data analysis software that was introduced in the world of computers in the 1990s. The interpretations are not descriptive and so the user needs to obtain the inferences depending on his level of knowledge of statistics. The data representation is not all-encompassing as like any other graphing software with a simple example of the inability to plot a two-sided y-axis graph.
Although the Excel program is likely not be cited explicitly in articles (thus its actual number of citations is likely to be much higher); it is cited, as examples, among studies to investigate the functional property of the CK2 kinase in Drosophila [31], the molecular mechanism of organismal death in nematodes [32], the effect of JAK2 mutations on hematopoietic stem cells [33], mitochondrial division in yeast [27], the contributions of mast cells to dengue virus-induced vascular leakage [28], the role of cMyBP-C in the process of cardiac muscle contraction [34], the effect of colitis on tumorigenesis [35], the presence of Lgr5+ stem cells in mouse intestinal adenomas [36], the regulation of fine motor coordination by AMPAR in BG cells [37], and the regulation of meiotic non-crossover recombination by FANCM ortholog [38].
SAS is the largest purveyeor of advanced analytics, and its statistical software is used in a diverse array of scientific and engineering enterprises and organizations. SAS software was used to perform statistical analyses to study the effect of CO2 enhancement on organic carbon decomposition [39], the influence of genetic polymorphisms on complex traits and fitness in plants [40], the regulation of genetic traits by IGF pathway [41], the regulation of oogenic processes by MARF1 [42], the molecular mechanism of novelty seeking in honey bees [43], the effect of nonrandom pollinator movement on reinforcing selection [44], the promotion of beneficial heart growth by fatty acids from Burmese python [45], the regulation of wing pattern evolution in butterfly by optix gene [46], and the coevolution with pathogen in C. elegans [47].
Wavemetrics Igor, a scientific and technical data analysis program, has multiple capabilities, such as data processing, statistical analysis, image processing and analysis, graphing, and 3D and volume visualization. The program has been cited in studies including electrophysiology [19, 48-50], analytical ultracentrifugation [51, 52], high-speed atomic force microscopic observations [53], imaging [50, 54], calcium imaging [55], and X-ray emission spectroscopy [56].
OriginLab publishes Origin, a scientific graphing and data analysis software. It has been cited in studies including fluorescence intensity analysis [57], FRET-FLIM measurement [58], electrophysiological analysis [59], biochemical assays [60], among other data analyses [61-68].
The most comprehensive of the statistical tools available among those listed in the review, SPSS is a cross-disciplinary tool (biology, statistics, social sciences, etc.) with equal depth. The pre-requisite knowledge of statistics is needed to understand the software as in most cases the input needs to be defined appropriately. The methodology to be used for analysis is at times understood by the software but in most cases to be defined specifically and not that direct for a novice in statistics.
This business analytics software can handle large scale data with ease, unlike any other software. The software can help a researcher at various level of analytical process like planning, data collecting, data access, data management and preparation, analysis, reporting and deployment.
A very large set of possibilities can be enlisted in each step, for example, in data preparation, in case of a very large dataset, what can be a nightmare with other softwares can be a thoroughfare as it happens to sort a dataset based on unique properties which cannot be easily enlisted by visual glance or any other software. Herein SPSS can not just sort the dataset based on unique features but obtain characteristics like count on the number of samples with the feature and easily tabulate the information. The use and the immense potential of SPSS can be felt in particular if the dataset is very large. The amount of computational time it takes is very short even for a huge scale of data. SPSS has many modules apart from those that assist in specific set of calculations like bootstrapping to test stability and reliability of models, data categorisation for complex data and high-dimensional data, performing exact tests with both categorical and non-parametric data problems on small and large datasets, forecasting to predict trends and develop forecasts quick and easy, neural-network based methods that can related complex relationships in the data and regression methods to predict categorical outcomes and apply non-linear regression procedures.
The extensive capabilities in the broad spectrum of statistics is the greatest utilitarian benefit of IBM SPSS given its capacity in handling very large datasets. The expected medium level statistical knowledge of the user to unravel its true potential as a statistical software is its bottleneck.
SPSS software has been cited in multiple studies for statistical analysis [69, 70].
The plotting software SigmaPlot from Systat Software Inc. is a full fledged graphical representation software. It can import data ready for representation in a graphical representation in numerous formats from plain ASCII files, txt, csv,excel, MS Access, SigmaScan, ODBC complaint databases, run SQL queries on tables, Tablecurve 2D, 3D and import the data in all formats as well as graphical formats like BMP, JPEG, GIF, PNG, HTML, TIFF, PDF, PSD, EPS, etc.
SigmaPlot can save templates so as to reduce the process of repeating format and other customizations made to a graph if such a type of graph has to be created with some other dataset. The built-in macro-language interface allows visual basic compatible programming, macro-recorder to save and play-back. One can export the graphs directly to powerpoint ready for presentation and also insert into a word file under progress to accommodate the analyzed graph.
The program allows all sorts of symbols relevant for scientific representation to be incorporated in the graphs and the Report Editor is user-friendly for incorporation of reports into documents like Word, Excel via cut, copy and paste format. The report editor can be directly transferred to HTML page and shared online without any knowledge of web programming. SigmaPlot has plugins compatible with Excel thereby complementing and enhancing the Excel graphing tools. Tools like curve-fitting and other mathematical functions from the built-in library enhance the quality of data representation via SigmaPlot.
In short, SigmaPlot is an excellent plotting, graphing tool as its name suggests but the value addition is very little using its built-in functions.
R, a free statistics and visualization programming language, is gaining popularity among researchers in recent years. It has over 6000 packages, contributed by volunteers, covering a wide range of disciplines from molecular biology, phylogeny, to stock markets. PZ Wu et al used profileR and glmnet packages in R 3.6.1 to analyze histological data in order to understand the relative contributions of auditory nerve fibers, inner and outer hair cells to age-related hearing loss [16]. Chopra S et al generated volcano plots of prostaglandin lipidomics analysis in R studio with the bioconductor limma package [71]. Litke JL et al used R 3.4.1 and RStudio 1.0.157 for statistical analysis and data plotting [72].
A significant portion of this article appeared as a section in another Labome article Software Programs in Biomedical Research.
- Grafe U, Bocker H, Thrum H. Regulative influence of o-aminobenzoic acid on the biosynthesis of nourseothricin in cultures of Streptomyces noursei JA 3890b. II. Regulation of glutamine synthetase and the role of the glutamine synthetase/glutamate synthase pathway. Z Allg Mikrobiol. 1977;17:201-9 pubmed
- Materials and Methods [ISSN : 2329-5139] is a unique online journal with regularly updated review articles on laboratory materials and methods. If you are interested in contributing a manuscript or suggesting a topic, please leave us feedback.