What are the most used statistical methods in WM research besides (generalized) linear modeling?

What are the most used statistical methods in WM research besides (generalized) linear modeling?

I mean, by cognitive psychologists. I skimmed over a few articles from Journal of Cognitive Neuroscience and Trends in Cognitive Science and concluded that ANOVAs and t-tests are mostly used (without effect sizes). I'm sure that there are more scientific journals of cognitive psychology than these two, and I guess, that there are also more advanced statistical tools than comparing groups. For example, in social psychology, structural equations modelling and mediation analysis are in fashion.

When asking this question, I had working memory in mind. Common methods to measure working memory empirically are, among others: dual task paradigms, n-back tasks and composites of simple memory measures. The main statistical tools that are applied to these measures seem to be mainly ANOVAs, t-tests, and other variations on traditional linear modeling, but what are the alternatives?

What are the most used statistical methods in WM research besides (generalized) linear modeling?

It is difficult to overstate the extent to which analysis of variance-based linear modeling based for different groups dominates the cognitive sciences. A recent methodological review of the psychology literature suggesting that these analyses are used to test hypotheses in as much as 95% of studies (citation pending me recovering it). There are alternatives in working memory, but they often have more limited empirical support due to the dominance of the aforementioned analyses.

Hierarchical models

One alternative that doesn't veer too far from established practices is to use a hierarchical modeling approach, as you suggest. For example, Lee and Webb (2005) proposed a Bayesian hierarchical modeling of working memory which accounted for both individual differences and aggregated group data. The main benefit here is to take advantage of both individual- and group-level variability, but we ultimately end up with a linear model.

Many evaluations of cognitive models rely on data that have been averaged or aggregated across all experimental subjects, and so fail to consider the possibility of important individual differences between subjects. Other evaluations are done at the single-subject level, and so fail to benefit from the reduction of noise that data averaging or aggregation potentially provides. To overcome these weaknesses, we have developed a general approach to modeling individual differences using families of cognitive models in which different groups of subjects are identified as having different psychological behavior. Separate models with separate parameterizations are applied to each group of subjects, and Bayesian model selection is used to determine the appropriate number of groups.


An emerging, but currently more niche approach is to use topology. Topology is making some exciting inroads in the psychological methodology literature (e.g., Butner et al., 2014), but so far has not managed to see substantial use in the empirical WM literature. However, there are a few early birds who've made some attempts to capture WM within a topological context, namely Glassman's work on a "theory of relativity" for working memory (Glassman, 1999; Glassman, 2003).

In the succeeding excerpt, he explains why human verbal memory capacity is 3-4 items from a topological perspective:

Thus, a moment's WM is hypothesized here to reside in myriad activated cortical planar “patches,” each subdivided into up to four amoeboid “subpatches.” Two different lines of topological reasoning suggest orderly associations of such representations. (1) The four-color principle of map topology, and the related K4 is planar theorem of graph theory, imply that if a small cortical area is dynamically subdivided into no more than four, discretely bounded planar subareas, then each such segment has ample free access to each of the others. (2) A hypothetical alternative to such associative adjacency of simultaneously active cortical representations of chunk-attributes is associative overlap, whereby, in dense cortical neuropil, activated subpatches behave like Venn diagrams of intersecting sets. As the number of Venn-like coactive subpatches within a patch increases, maintaining ad hoc associativity among all combinations requires exponentially proliferating intersections. Beyond four, serpentine subpatch shapes are required, which could easily lead to pathologies of omission or commission.


I am tangentially aware of related attempts to develop a theory of (working) memory as a form of foraging behavior, but not sufficiently to give a summary of it. Hopefully, the link is instructive.


  • Butner, J. E., Gagnon, K. T., Geuss, M. N., Lessard, D. A., & Story, T. N. (2014). Utilizing Topology to Generate and Test Theories of Change. Psychological methods.
  • Lee, M. D., & Webb, M. R. (2005). Modeling individual differences in cognition. Psychonomic Bulletin & Review, 12(4), 605-621. doi:10.3758/BF03196751
  • Glassman, R. B. (1999). A working memory “theory of relativity”: elasticity in temporal, spatial, and modality dimensions conserves item capacity in radial maze, verbal tasks, and other cognition. Brain Research Bulletin, 48(5), 475-489. doi:10.1016/S0361-9230(99)00026-X
  • Glassman, R. B. (2003). Topology and graph theory applied to cortical anatomy may help explain working memory capacity for three or four simultaneous items. Brain Research Bulletin, 60(1-2), 25-42. doi:10.1016/S0361-9230(03)00030-3

A GLM is a more general version of a linear model: the linear model is a special case of a Gaussian GLM with the identity link. So the question is then: why do we use other link functions or other mean-variance relationships? We fit GLMs because they answer a specific question that we are interested in.

There is, for instance, nothing inherently wrong with fitting a binary response in a linear regression model if you are interested in the association between these variables. Indeed if a higher proportion of negative outcomes tends to be observed in the lower 50th percentile of an exposure and a higher proportion of positive outcomes is observed in the upper 50th percentile, this will yield a positively sloped line which correctly describes a positive association between these two variables.

Alternately, you might be interested in modeling the aforementioned association using an S shaped curve. The slope and the intercept of such a curve account for a tendency of extreme risk to tend toward 0/1 probability. Also the slope of a logit curve is interpreted as a log-odds ratio. That motivates use of a logit link function. Similarly, fitted probabilities very close to 1 or 0 may tend to be less variable under replications of study design, and so could be accounted for by a binomial mean-variance relationship saying that $se(hat) = hat(1-hat)$ which motivates logistic regression. Along those lines, a more modern approach to this problem would suggest fitting a relative risk model which utilizes a log link, such that the slope of the exponential trend line is interpreted as a log-relative risk, a more practical value than a log-odds-ratio.

Using GLMM to Avoid the Need for Transformation of Skewed RT Data

To illustrate the application of GLMM to address the problems with transformation outlined earlier, we re-analyzed the three experiments that Balota et al. (2013) recently demonstrated to yield contradictory outcomes in analyses conducted on raw and transformed data. They report that LMM analyses of the inverse RT transformed data that best satisfied criteria for normality yielded underadditive interactions rather than the additive effects of frequency and stimulus quality found with raw RT.

We report the results of six analyses of each of the three experiments. Two of the analyses parallel Balota et al.'s (2013), by using LMMs on raw RT (DV = RT) and inverse RT (DV = �/RT). By default, these analyses assume a Gaussian distribution and identity link function. The remaining four analyses are GLMMs on raw RT which assume either a Gamma or Inverse Gaussian distribution of the DV, and a linear (identity link function RT = μ ^ ) or inverse relationship (inverse link function RT = �/ μ ^ ) between the predictors and RT. We chose �/ μ ^ as the specific form of the inverse link function to parallel the inverse transformation applied to RTs in Balota et al.'s (2013) LMM analyses (i.e., �/RT). These GLMM analyses are therefore analogous to the LMM analyses conducted on inverse RT.

The primary interest is in the results from the properly specified GLMM based on the decisions described in the previous section, but we also aim to clarify how differences in the specification of the dependent variable and link function relate to the conflicting findings between raw and inverse transformed RT reported by Balota et al. (2013).

The analyses were conducted on RT data for correct word responses for Yap and Balota (2007) and Yap et al. (2008 Experiments 1 and 2) using version 1.0-5 of the lme4 package (Bates et al., 2013) in the R program for statistical computing (R Core Team, 2013) following the same trimming procedures described in Balota et al. (2013). Since there is continuing debate as to how p-values should be generated for LMMs because of computational issues regarding degrees of freedom, we follow the current practice of considering effects greater than two standard errors (i.e., |t|> 2) to be significant at the 0.05 level for datasets involving a large number of observations (Kliegl et al., 2010 Masson and Kliegl, 2013). The R syntax used to generate these models along with the full model output and predicted mean RT for each condition can be found in the Supplementary Materials.

Figure 2 summarizes the predictions of the models assuming a linear relationship between the predictors and RT for the three experiments. The corresponding results for models assuming an inverse relationship between the predictors and RT are presented in Figure 4. Each column of Figures 2𠄵 corresponds to a different experiment, while the rows of the figures present estimates from the LMM models (top row), and GLMM models assuming Gamma (middle row), and Inverse Gaussian (bottom row) distributions, respectively, of the DV.

Figure 2. Prediction plots illustrating the estimated frequency effect and statistical results (t- or z- value and corresponding p-value) of the word frequency by stimulus quality interaction (shaded region on x-axis) based on models assuming a linear relationship between the predictors and RT (identity link function). Note that the back-transformed estimates (shaded region on y-axis) are identical because of the identity link function. Each column of plots represents the results from a different experiment (from left to right: Yap and Balota, 2007 Yap et al., 2008: Experiment 1 and Yap et al., 2008: Experiment 2), and each row of plots represents a different assumption for the distribution of RTs (from top to bottom: Gaussian, Gamma, and Inverse Gaussian). Note that precise p-values are produced in GLMM for the Wald Z-statistic in R, while approximate p-values can only be inferred based on the magnitude of the t-value produced in LMM.

Figure 3. Plots of the residuals over predicted RT from models assuming a linear relationship between the predictors and RT (identity link function). Each column of plots represents the results from a different experiment (from left to right: Yap and Balota, 2007 Yap et al., 2008: Experiment 1 and Yap et al., 2008: Experiment 2), and each row of plots represents a different assumption for the distribution of RTs (from top to bottom: Gaussian, Gamma, and Inverse Gaussian).

Figure 4. Prediction plots illustrating the estimated frequency effect and statistical results (t- or z-value and corresponding p-value) of the word frequency by stimulus quality interaction (shaded region on x-axis) based on models assuming an inverse relationship between the predictors and RT (inverse link function). The plots also present the back-transformed estimates (shaded region on y-axis) on the original RT metric. Each column of plots represents the results from a different experiment (from left to right: Yap and Balota, 2007 Yap et al., 2008: Experiment 1 and Yap et al., 2008: Experiment 2), and each row of plots represents a different assumption for the distribution of RTs (from top to bottom: Gaussian, Gamma, and Inverse Gaussian). Note that precise p-values are produced in GLMM for the Wald Z-statistic in R, while approximate p-values can only be inferred based on the magnitude of the t-value produced in LMM.

Figure 5. Plots of the residuals over predicted RT (or inverse RT) from models assuming an inverse relationship between the predictors and RT (inverse link function). Each column of plots represents the results from a different experiment (from left to right: Yap and Balota, 2007 Yap et al., 2008: Experiment 1 and Yap et al., 2008: Experiment 2), and each row of plots represents a different assumption for the distribution of RTs (from top to bottom: Gaussian, Gamma, and Inverse Gaussian).

For each model summarized in Figures 2, 4, the shaded region of the prediction plot depicts the estimated effect of word frequency (difference between high and low frequency conditions) based on the fitted values for each of the four frequency by stimulus quality conditions as plotted on the model transformed scale (x-axis), while the y-axis plots the same difference after the mean estimates have been back-transformed via the link function on the original RT scale. The estimates are identical on the model and back-transformed RT scales in Figure 2 because the identity link assumes that the scale of the latent construct assessed by the model (x-axis) is synonymous with RT. The form of the link function itself is depicted by the solid black line connecting the diagonals of the plot.

Although an identity link function (DV = μ ^ ) was also specified for the LMM analysis on inverse transformed RTs (DV = �/RT), we depict a non-linear function in Figure 4 to illustrate the back-transformation from inverse to raw RT (RT = �/ μ ^ ) that researchers typically apply to interpret their data. The p-value corresponding to the critical interaction effect, which is presented in the bottom-right corner of each plot only assesses whether there is a significant difference in the linear effect of frequency on the model transformed scale (x-axis), and does not assess whether significant (linear) differences exist on the original RT scale (y-axis) unless the identity link was specified (Berry et al., 2010).

Selecting the Right Model

Each of the individual analyses in Figures 2, 4 produced subtle differences in the magnitude, direction or statistical significance of the word frequency by stimulus quality interaction. A decision must therefore be made about the best-fitting correctly specified model. There are a number of ways to address this question.

Throughout the previous sections, we have argued that, from a theoretical perspective, the dependent variable of theoretical interest in mental chronometric research like this is raw RT, and that additive factors logic assumes a linear relationship between the experimentally manipulated variables and RT itself. From this perspective, only the analyses using raw RT as the dependent variable and specifying an identity link function provide meaningfully interpretable results for this experiment (Figure 2).

To further discriminate between the analyses, we can identify the statistical model that provides predictions which best fits the observed RTs. Figure 3 allows a visual inspection of model fit, by plotting the residuals against predicted RT. The LMM analyses (top row of plots), which assume a Gaussian distribution of raw RT, clearly exhibit a heteroscedastic (fan-shaped) pattern that is not evident in the GLMM analyses assuming a Gamma or Inverse Gaussian distribution (middle and bottom row of plots). Therefore, these plots suggest that the Gamma or Inverse Gaussian distributions provide a better fit to the data because they explicitly account for the hetereoscedastic pattern of increasing variability with slower responses and therefore yield more normally distributed residuals.

A similar conclusion derives from AIC and BIC summary fit indices presented in Table 1, and the estimated Gaussian, Gamma, and Inverse Gaussian distribution fits to the observed RT density in Figure 1. Across the three experiments, the Inverse Gaussian distribution (followed by the Gamma and Gaussian distributions) produce parameters that best approximate the shape of the observed RT distribution, and yield fit values that are consistently lower than the Gamma or Gaussian distributions. Thus, on both these graphical and empirical indices, the Inverse Gaussian distribution provides the best fitting model.

Table 1. AIC and BIC indices of model fit comparing LMMs and GLMMs of different distribution and link assumptions for each of the three experiments.

Having identified the most appropriate statistical model, we can consider its results. Consistent with the ANOVA analyses reported in the original published papers, none of the three experiments yielded a significant interaction between word frequency and stimulus quality in the Inverse Gaussian GLMM with identity link function (bottom row of plots in Figure 2). This model predicted effects of frequency that were 5, 8, and 5 ms greater for the degraded than clear condition in the Yap and Balota (2007), Yap et al. (2008, Experiment 1), and Yap et al. (2008, Experiment 2) data, respectively. The magnitude and direction of these effects are essentially identical to the 6, 7, and 5 ms overadditive effect reported in original ANOVA analyses. Although these estimated effects are similar to those predicted in the poorer fitting Gamma and Gaussian GLMM with identity link (top and middle row of plots in Figure 2), the test statistic (t- or z-value) is larger and corresponding p-value lower for the better fitting models, suggesting that the standard errors have been more precisely estimated. Better fitting models provide more powerful adjustment to extreme values, particularly in the slowest condition of degraded low frequency words, where calculation of the average would be most affected, thus allowing greater power as well as reliability with which to assess individual differences between subjects and items (see Appendix in Supplementary Material for mean RT predicted for each condition by the six models).

Different conclusions about the relationship between word frequency and stimulus quality are suggested by the results of models using transformed RTs or link functions that assume a non-linear relationship between the predictors and RT. From the perspective of model fit alone, the analysis on inverse transformed RT produces residuals that offer the least amount of hetereoscedasticity (Figure 5), suggesting that the fit is at least as good, if not better, than the Inverse Gaussian GLMM with identity link described above 4 . This is the expected outcome of applying the Box𠄼ox procedure to estimate a power transformation that stabilizes variance in order to create normally distributed data. However, although these models meet the mathematical assumptions of normality required by LMM, as Balota et al. (2013) report, relying on the transformed DV in LMM put the researcher in the unhappy situation of developing an ad-hoc explanation of why the estimated effect of frequency is now underadditive (Figure 4), as opposed to the additive or slightly overadditive effects observed on raw RT.

These contradictions arise because interval differences in the dependent variable are distorted when non-linear transformations are applied. For each of the prediction plots based on an inverse transformation or inverse link function in Figure 4, almost all of the back-transformed estimates suggest no difference, or a slightly larger numerical effect of frequency for degraded words (a small overadditive effect) on the RT scale (y-axis). However, on the model estimate scale (x-axis), these differences are distorted by the non-linear inverse link function into a numerically larger effect of frequency for clear words (underadditive effect). For the Yap and Balota (2007) experiment, the distortion caused by the non-linear transformation was severe enough to push the underadditive effect to statistical significance in the LMM analysis (top left panel of Figure 4). The underadditive interactions in this dataset were also marginally significant in the GLMM analyses using the inverse link function.

To meaningfully interpret this underadditive effect, and effects assessed on the inverse RT scale more generally, the researcher must assume that the predictors are inversely related to RT. This view is consistent with recent attempts to map effects assessed on the reciprocal scale to differences in processing rate or processing speed (Kliegl et al., 2010). For example, processing rate or speed of evidence accumulation is assumed to be slower for visually degraded as opposed to clearly presented words in activation models (e.g., McClelland and Rumelhart, 1981), thus yielding the slower RT typically observed for these conditions. However, a core assumption within all of these models is that rate of evidence accumulation is linear over time (e.g., Borowsky and Besner, 1993 Ratcliff and Rouder's, 2000, diffusion model Brown and Heathcote's, 2008, linear ballistic accumulator)—in direct contrast to the non-linear relationship implied by the inverse scale. So while there may be physiological reasons to expect non-linearity at the level of neural spike rates (e.g., Carpenter and Williams, 1995), the implications associated with the reciprocal nature of this transformation on RT appears to be limited because psychological models assuming linearity are able to closely predict responses in observed data (Ratcliff, 1978 Brown and Heathcote, 2008).

Thus, the GLMM procedure allows researchers to select the DV most appropriate to their research question rather than use a transformed DV simply to meet mathematical assumptions. If raw RT is the most appropriate metric, as we have argued to be the case for most mental chronometric research, an Inverse Gaussian or Gamma distribution can be assumed to achieve more normal homoscedastic residuals, while retaining raw RT as the DV. As Figure 2 shows, this produces more power than LMMs conducted on raw RT. Alternatively, if the researcher's predictions are for a transformed scale, such as inverse RT, specifying a non-linear link function of the same form as the inverse transformation applied to RTs (inverse link function �/ μ ^ ) produces an identical distortion of frequency effects toward underadditivity (see middle and bottom row of prediction plots in Figure 4). Moreover, there appears to be no loss in model fit relative to the matching models using an identity link according to both a visual inspection of the residuals (Figures 3, 5) and empirical fit statistics (Table 1).

In summary, GLMMs allow assumptions regarding the relationship between the predictors and the dependent variable to be tested independently of assumptions regarding the distribution of dependent variable. In LMM, the two are confounded because the relationship between the predictors and the dependent variable is dictated by the transformation selected to normalize the distribution of the dependent variable. By contrast, GLMM allows the form of the link function to be determined by the theoretical issues under consideration.

Biostatistics and Genetics Edit

Biostatistical modeling forms an important part of numerous modern biological theories. Genetics studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. Gregor Mendel started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. Francis Galton tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of "Law of Ancestral Heredity". His ideas were strongly disagreed by William Bateson, who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as Walter Weldon, Arthur Dukinfield Darbishire and Karl Pearson, and Mendelians, who supported Bateson's (and Mendel's) ideas, such as Charles Davenport and Wilhelm Johannsen. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis.

Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology.

    developed several basic statistical methods in support of his work studying the crop experiments at Rothamsted Research, including in his books Statistical Methods for Research Workers (1925) end The Genetical Theory of Natural Selection (1930). He gave many contributions to genetics and statistics. Some of them include the ANOVA, p-value concepts, Fisher's exact test and Fisher's equation for population dynamics. He is credited for the sentence “Natural selection is a mechanism for generating an exceedingly high degree of improbability”. [1] developed F-statistics and methods of computing them and defined inbreeding coefficient. 's book, The Causes of Evolution, reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. Also developed the theory of primordial soup.

These and other biostatisticians, mathematical biologists, and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent, coherent whole that could begin to be quantitatively modeled.

In parallel to this overall development, the pioneering work of D'Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study.

Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at Caltech, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining." [2]

Any research in life sciences is proposed to answer a scientific question we might have. To answer this question with a high certainty, we need accurate results. The correct definition of the main hypothesis and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the experimental design, data collection methods, data analysis perspectives and costs evolved. It is essential to carry the study based on the three basic principles of experimental statistics: randomization, replication, and local control.

Research question Edit

The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the scientific question, an exhaustive literature review might be necessary. So, the research can be useful to add value to the scientific community. [3]

Hypothesis definition Edit

Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a hypothesis. The main propose is called null hypothesis (H0) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in test. In general, HO assumes no association between treatments. On the other hand, the alternative hypothesis is the denial of HO. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers. [3]

As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H0 would be that there is no difference between the two diets in mice metabolism (H0: μ1 = μ2) and the alternative hypothesis would be that the diets have different effects over animals metabolism (H1: μ1 ≠ μ2).

The hypothesis is defined by the researcher, according to his/her interests in answering the main question. Besides that, the alternative hypothesis can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (i.e. higher or shorter).

Sampling Edit

Usually, a study aims to understand an effect of a phenomenon over a population. In biology, a population is defined as all the individuals of a given species, in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a population is not only the individuals, but the total of one specific component of their organisms, as the whole genome, or all the sperm cells, for animals, or the total leaf area, for a plant, for example.

It is not possible to take the measures from all the elements of a population. Because of that, the sampling process is very important for statistical inference. Sampling is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the sample might catch the most variability across a population. [4] The sample size is determined by several things, since the scope of the research to the resources available. In clinical research, the trial type, as inferiority, equivalence, and superiority is a key in determining sample size. [3]

Experimental design Edit

Experimental designs sustain those basic principles of experimental statistics. There are three basic experimental designs to randomly allocate treatments in all plots of the experiment. They are completely randomized design, randomized block design, and factorial designs. Treatments can be arranged in many ways inside the experiment. In agriculture, the correct experimental design is the root of a good study and the arrangement of treatments within the study is essential because environment largely affects the plots (plants, livestock, microorganisms). These main arrangements can be found in the literature under the names of “lattices”, “incomplete blocks”, “split plot”, “augmented blocks”, and many others. All of the designs might include control plots, determined by the researcher, to provide an error estimation during inference.

In clinical studies, the samples are usually smaller than in other biological studies, and in most cases, the environment effect can be controlled or measured. It is common to use randomized controlled clinical trials, where results are usually compared with observational study designs such as case–control or cohort. [5]

Data collection Edit

Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.

Data collection varies according to type of data. For qualitative data, collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence. [6] For quantitative data, collection is done by measuring numerical information using instruments.

In agriculture and biology studies, yield data and its components can be obtained by metric measures. However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection. Finally, all data collected of interest must be stored in an organized data frame for further analysis.

Descriptive Tools Edit

Data can be represented through tables or graphical representation, such as line charts, bar charts, histograms, scatter plot. Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples:

One type of tables are the frequency table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be: [7]

Absolute: represents the number of times that a determined value appear

Relative: obtained by the division of the absolute frequency by the total number

In the next example, we have the number of genes in ten operons of the same organism.

G e n e s = 2 , 3 , 3 , 4 , 5 , 3 , 3 , 3 , 3 , 4

Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis. [9]

A bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format. [9]

In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016. [8] The sharp fall in December 2016 reflects the outbreak of Zika virus in the birth rate in Brazil.

The histogram (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by Karl Pearson. [10]

A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis. [11] They are also called scatter graph, scatter chart, scattergram, or scatter diagram. [12]

The median is the value in the middle of a dataset.

The mode is the value of a set of data that appears most often. [13]

Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. Outliers may be plotted as circles.

Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, correlation coefficients are required. They provide a numerical value that reflects the strength of an association. [9]

Pearson correlation coefficient is a measure of association between two variables, X and Y. This coefficient, usually represented by ρ (rho) for the population and r for the sample, assumes values between −1 and 1, where ρ = 1 represents a perfect positive correlation, ρ = -1 represents a perfect negative correlation, and ρ = 0 is no linear correlation. [9]

Inferential Statistics Edit

It is used to make inferences [14] about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The standard error of the mean is a measure of variability that is crucial to do inferences. [4]

Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set: [4]

  1. The hypothesis to be tested: as stated earlier, we have to work with the definition of a null hypothesis (H0), that is going to be tested, and an alternative hypothesis. But they must be defined before the experiment implementation.
  2. Significance level and decision rule: A decision rule depends on the level of significance, or in other words, the acceptable error rate (α). It is easier to think that we define a critical value that determines the statistical significance when a test statistic is compared with it. So, α also has to be predefined before the experiment.
  3. Experiment and statistical analysis: This is when the experiment is really implemented following the appropriate experimental design, data is collected and the more suitable statistical tests are evaluated.
  4. Inference: Is made when the null hypothesis is rejected or not rejected, based on the evidence that the comparison of p-values and α brings. It is pointed that the failure to reject H0 just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.

A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied. [4]

Power and statistical error Edit

When testing a hypothesis, there are two types of statistic errors possible: Type I error and Type II error. The type I error or false positive is the incorrect rejection of a true null hypothesis and the type II error or false negative is the failure to reject a false null hypothesis. The significance level denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and statistical power of the test is 1 − β.

P-value Edit

The p-value is the probability of obtaining results as extreme as or more extreme than those observed, assuming the null hypothesis (H0) is true. It is also called the calculated probability. It is common to confuse the p-value with the significance level (α), but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H0) is rejected. [15]

Multiple testing Edit

In multiple tests of the same hypothesis, the probability of the occurrence of falses positives (familywise error rate) increase and some strategy are used to control this occurrence. This is commonly achieved by using a more stringent threshold to reject null hypotheses. The Bonferroni correction defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α = α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to the Bonferroni correction is to control the false discovery rate (FDR). The FDR controls the expected proportion of the rejected null hypotheses (the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives. [16]

Mis-specification and robustness checks Edit

The main hypothesis being tested (e.g., no association between treatments and outcomes) is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification. [17] Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification.

Model selection criteria Edit

Model criteria selection will select or model that more approximate true model. The Akaike's Information Criterion (AIC) and The Bayesian Information Criterion (BIC) are examples of asymptotically efficient criteria.

Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as sequencing technologies, Bioinformatics and Machine learning (Machine learning in bioinformatics).

Use in high-throughput data Edit

New biomedical technologies like microarrays, next-generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously. [18] Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed. [19]

Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R 2 -values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R 2 of the validation test set, not those of the training set.

Often, it is useful to pool information from multiple predictors together. For example, Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes. [20] These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway) using this approach.

Bioinformatics advances in databases, data mining, and biological interpretation Edit

The development of biological databases enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as PubMed. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to SNPs (dbSNP), the knowledge on genes characterization and their pathways (KEGG) and the description of gene function classifying it by cellular component, molecular function and biological process (Gene Ontology). [21] In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the Arabidopsis thaliana genetic and molecular database – TAIR. [22] Phytozome, [23] in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the International Nucleotide Sequence Database Collaboration (INSDC) [24] which relates data from DDBJ, [25] EMBL-EBI, [26] and NCBI. [27]

Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by machine learning area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of supervised and unsupervised learning, regression, detection of clusters and association rule mining, among others. [21] To indicate some of them, self-organizing maps and k-means are examples of cluster algorithms neural networks implementation and support vector machines models are examples of common machine learning algorithms.

Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientists is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results. [21]

Use of computationally intensive methods Edit

On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and re-sampling methods.

In recent times, random forests have gained popularity as a method for performing statistical classification. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems. [ citation needed ]

Public health Edit

Public health, including epidemiology, health services research, nutrition, environmental health and health care policy & management. In these medicine contents, it's important to consider the design and analysis of the clinical trials. As one example, there is the assessment of severity state of a patient with a prognosis of an outcome of a disease.

With new technologies and genetics knowledge, biostatistics are now also used for Systems medicine, which consists in a more personalized medicine. For this, is made an integration of data from different sources, including conventional patient data, clinico-pathological parameters, molecular and genetic data as well as data generated by additional new-omics technologies. [28]

Quantitative genetics Edit

The study of Population genetics and Statistical genetics in order to link variation in genotype with a variation in phenotype. In other words, it is desirable to discover the genetic basis of a measurable trait, a quantitative trait, that is under polygenic control. A genome region that is responsible for a continuous trait is called Quantitative trait locus (QTL). The study of QTLs become feasible by using molecular markers and measuring traits in populations, but their mapping needs the obtaining of a population from an experimental crossing, like an F2 or Recombinant inbred strains/lines (RILs). To scan for QTLs regions in a genome, a gene map based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping. [29]

However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population. [30] For this reason, the Genome-wide association study was proposed in order to identify QTLs based on linkage disequilibrium, that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput SNP genotyping. [31]

In animal and plant breeding, the use of markers in selection aiming for breeding, mainly the molecular ones, collaborated to the development of marker-assisted selection. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotyped and but not phenotyped population, called testing population. [32] This kind of study could also include a validation population, thinking in the concept of cross-validation, in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model.

As a summary, some points about the application of quantitative genetics are:

  • This has been used in agriculture to improve crops (Plant breeding) and livestock (Animal breeding).
  • In biomedical research, this work can assist in finding candidates genealleles that can cause or influence predisposition to diseases in human genetics

Expression data Edit

Studies for differential expression of genes from RNA-Seq data, as for RT-qPCR and microarrays, demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as exons that are part of a gene sequence. As microarray results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the Poisson one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a negative binomial distribution. Generalized linear models are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered. [33] Some examples of other analysis on genomics data comes from microarray or proteomics experiments. [34] [35] Often concerning diseases or disease stages. [36]

Other studies Edit

    , ecological forecasting
  • Biological sequence analysis[37] for gene network inference or pathways analysis. [38] , especially in regards to fisheries science. and evolution

There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them:

    : Another software developed by VSNi [39] that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using restricted maximum likelihood (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different variance-covariance matrix structures.
  • CycDesigN: [40] A computer package developed by VSNi [39] that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and crossover designs. It includes less used designs the Latinized ones, as t-Latinized design. [41] : A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics. [21] : An open source environment and programming language dedicated to statistical computing and graphics. It is an implementation of S language maintained by CRAN. [42] In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications. In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as Bioconductor. It is also possible to use packages under development that are shared in hosting-services as GitHub. : A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name (SAS Institute), it uses SAS language for programming.
  • PLA 3.0: [43] Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data. : A Java software for machine learning and data mining, including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R. [21]

Almost all educational programmes in biostatistics are at postgraduate level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics.

In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as epidemiology. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology, whereas older departments, typically affiliated with schools of public health, will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics. In larger universities around the world, where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry (quality control), business and economics and biological areas other than medicine.

4. Stepwise Regression

This form of regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automatic process, which involves no human intervention.

This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to discern significant variables. Stepwise regression basically fits the regression model by adding/dropping co-variates one at a time based on a specified criterion. Some of the most commonly used Stepwise regression methods are listed below:

  • Standard stepwise regression does two things. It adds and removes predictors as needed for each step.
  • Forward selection starts with most significant predictor in the model and adds variable for each step.
  • Backward elimination starts with all predictors in the model and removes the least significant variable for each step.

The aim of this modeling technique is to maximize the prediction power with minimum number of predictor variables. It is one of the method to handle higher dimensionality of data set.


Regional amyloid burden is associated with disruption of the WM microstructure in cognitively unimpaired subjects. This association was driven mostly by amyloid burden in the precuneus and it could be most robustly measured in the body of the CC, where a non-linear biphasic relationship between DTI metrics and amyloid burden was apparent. Our results might reflect initial amyloid-associated compensatory mechanisms of the WM, followed by axonal degeneration patterns in the earliest stages of AD. Taken together, DTI measures of the body of the CC are promising early AD biomarkers and could, together with amyloid burden, support risk-profiling efforts in preclinical AD subjects.


Upon realizing that statistical tests are usually misinterpreted, one may wonder what if anything these tests do for science. They were originally intended to account for random variability as a source of error, thereby sounding a note of caution against overinterpretation of observed associations as true effects or as stronger evidence against null hypotheses than was warranted. But before long that use was turned on its head to provide fallacious support for null hypotheses in the form of �ilure to achieve” or �ilure to attain” statistical significance.

We have no doubt that the founders of modern statistical testing would be horrified by common treatments of their invention. In their first paper describing their binary approach to statistical testing, Neyman and Pearson [108] wrote that “it is doubtful whether the knowledge that [a P value] was really 0.03 (or 0.06), rather than 0.05…would in fact ever modify our judgment” and that “The tests themselves give no final verdict, but as tools help the worker who is using them to form his final decision.” Pearson [109] later added, “No doubt we could more aptly have said, ‘his final or provisional decision.’” Fisher [110] went further, saying “No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses he rather gives his mind to each particular case in the light of his evidence and his ideas.” Yet fallacious and ritualistic use of tests continued to spread, including beliefs that whether P was above or below 0.05 was a universal arbiter of discovery. Thus by 1965, Hill [111] lamented that “too often we weaken our capacity to interpret data and to take reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’ from ‘no significant difference.’”

In response, it has been argued that some misinterpretations are harmless in tightly controlled experiments on well-understood systems, where the test hypothesis may have special support from established theories (e.g., Mendelian genetics) and in which every other assumption (such as random allocation) is forced to hold by careful design and execution of the study. But it has long been asserted that the harms of statistical testing in more uncontrollable and amorphous research settings (such as social-science, health, and medical fields) have far outweighed its benefits, leading to calls for banning such tests in research reports𠅊gain with one journal banning P values as well as confidence intervals [2].

Given, however, the deep entrenchment of statistical testing, as well as the absence of generally accepted alternative methods, there have been many attempts to salvage P values by detaching them from their use in significance tests. One approach is to focus on P values as continuous measures of compatibility, as described earlier. Although this approach has its own limitations (as described in points 1, 2, 5, 9, 15, 18, 19), it avoids comparison of P values with arbitrary cutoffs such as 0.05, (as described in 3, 4, 6𠄸, 10�, 15, 16, 21 and 23�). Another approach is to teach and use correct relations of P values to hypothesis probabilities. For example, under common statistical models, one-sided P values can provide lower bounds on probabilities for hypotheses about effect directions [45, 46, 112, 113]. Whether such reinterpretations can eventually replace common misinterpretations to good effect remains to be seen.

A shift in emphasis from hypothesis testing to estimation has been promoted as a simple and relatively safe way to improve practice [5, 61, 63, 114, 115] resulting in increasing use of confidence intervals and editorial demands for them nonetheless, this shift has brought to the fore misinterpretations of intervals such as 19� above [116]. Other approaches combine tests of the null with further calculations involving both null and alternative hypotheses [117, 118] such calculations may, however, may bring with them further misinterpretations similar to those described above for power, as well as greater complexity.

Meanwhile, in the hopes of minimizing harms of current practice, we can offer several guidelines for users and readers of statistics, and re-emphasize some key warnings from our list of misinterpretations:

Correct and careful interpretation of statistical tests demands examining the sizes of effect estimates and confidence limits, as well as precise P values (not just whether P values are above or below 0.05 or some other threshold).

Careful interpretation also demands critical examination of the assumptions and conventions used for the statistical analysis—not just the usual statistical assumptions, but also the hidden assumptions about how results were generated and chosen for presentation.

It is simply false to claim that statistically nonsignificant results support a test hypothesis, because the same results may be even more compatible with alternative hypotheses𠅎ven if the power of the test is high for those alternatives.

Interval estimates aid in evaluating whether the data are capable of discriminating among various hypotheses about effect sizes, or whether statistical results have been misrepresented as supporting one hypothesis when those results are better explained by other hypotheses (see points 4𠄶). We caution however that confidence intervals are often only a first step in these tasks. To compare hypotheses in light of the data and the statistical model it may be necessary to calculate the P value (or relative likelihood) of each hypothesis. We further caution that confidence intervals provide only a best-case measure of the uncertainty or ambiguity left by the data, insofar as they depend on an uncertain statistical model.

Correct statistical evaluation of multiple studies requires a pooled analysis or meta-analysis that deals correctly with study biases [68, 119�]. Even when this is done, however, all the earlier cautions apply. Furthermore, the outcome of any statistical procedure is but one of many considerations that must be evaluated when examining the totality of evidence. In particular, statistical significance is neither necessary nor sufficient for determining the scientific or practical significance of a set of observations. This view was affirmed unanimously by the U.S. Supreme Court, (Matrixx Initiatives, Inc., et al. v. Siracusano et al. No. 09�. Argued January 10, 2011, Decided March 22, 2011), and can be seen in our earlier quotes from Neyman and Pearson.

Any opinion offered about the probability, likelihood, certainty, or similar property for a hypothesis cannot be derived from statistical methods alone. In particular, significance tests and confidence intervals do not by themselves provide a logically sound basis for concluding an effect is present or absent with certainty or a given probability. This point should be borne in mind whenever one sees a conclusion framed as a statement of probability, likelihood, or certainty about a hypothesis. Information about the hypothesis beyond that contained in the analyzed data and in conventional statistical models (which give only data probabilities) must be used to reach such a conclusion that information should be explicitly acknowledged and described by those offering the conclusion. Bayesian statistics offers methods that attempt to incorporate the needed information directly into the statistical model they have not, however, achieved the popularity of P values and confidence intervals, in part because of philosophical objections and in part because no conventions have become established for their use.

All statistical methods (whether frequentist or Bayesian, or for testing or estimation, or for inference or decision) make extensive assumptions about the sequence of events that led to the results presented—not only in the data generation, but in the analysis choices. Thus, to allow critical evaluation, research reports (including meta-analyses) should describe in detail the full sequence of events that led to the statistics presented, including the motivation for the study, its design, the original analysis plan, the criteria used to include and exclude subjects (or studies) and data, and a thorough description of all the analyses that were conducted.

In closing, we note that no statistical method is immune to misinterpretation and misuse, but prudent users of statistics will avoid approaches especially prone to serious abuse. In this regard, we join others in singling out the degradation of P values into “significant” and “nonsignificant” as an especially pernicious statistical practice [126].


Clinical and demographic data for PD longitudinal study

A PD cohort with sample size of 76 was followed longitudinally and completed comprehensive study visits at baseline, 18-, and 36-months (Huang R01 NS060722) [19]. PD patients were from a tertiary movement disorders clinic and were free of major or acute medical issues other than PD. The diagnosis of PD was made by movement disorder specialists according to published criteria [20, 21]. Multi-domain clinical scales were obtained at each visit, including the Movement Disorders Society Unified PD Rating Scale (MDS-UPDRS) I-II, Hamilton Depression Rating Scale (HAMD) measuring depression, the Montreal Cognitive Assessment (MoCA) measuring global cognitive function, and the Hoehn and Yahr (HY) Scale. Levodopa equivalent daily dose (LEDD) reflecting dopaminergic drug usage was calculated at each study visit based on published conversion factors [22], and duration of illness (DOI) was defined as the time since PD diagnosis by a medical professional [18]. Besides these clinical measures, important demographic information also was collected at the baseline visit for confounding adjustment. This was a retrospective study, and no power analysis was performed.

Coding data and defining outcomes for PD longitudinal study

First, the predictor variables including time (months) since baseline visit, age at baseline (years), gender (1 = Male 0 = Female), education (years), DOI (years), HAMD, MoCA, and LEDD were included to build up a prediction model after controlling for medication quantified by LEDD. Of note is that one point was added to the MoCA raw score if the subject’s education was less than 13 years, and HAMD and LEDD were standardized before analysis [23].

We considered three major clinical outcomes: 1) the MDS-UPDRS-II score that quantified motor aspects of daily living activities using a 5-point scale (0–4). The total sum score from 13 items was treated as a continuous variable with integer scores ranging from 0 (normal) to 52 (the most severe) 2) MDS-UPDRS-I that quantified non-motor aspects of daily living activities. This also was treated as a continuous variable and used the same scale and range of scores as the MDS-UPDRS-II 3) imbalance, which is known to mark an important functional disability regarding balance and walking. This was a binary outcome with a value of 1 if the HY scale or item 2.12 from the MDS-UPDRS-II (issues with walking and balance) was > = 3, otherwise the term was scored 0. The reason we chose the MDS-UPDRS-I and II subscales as outcome measures is that they are less rater- and drug-state dependent. In particular, the MDS-UPDRS-II was used to assess PD motor symptoms unlike other studies that focused on the MDS-UPDRS-III because of the recent finding that MDS-UPDRS-II was a better predictor for quality of life than the traditional MDS-UPDRS-III motor scale, which depends heavily on the timing of the exam, the medication status of the subject (“on,” “off,” or “transitional”), and the rater [24].

Statistical methods

Demographic information was summarized using the mean ± SD for continuous variables and the frequency for categorical variables. The normality assumption was investigated for continuous variables (i.e., MDS-UPDRS-I or II) based on graphical checking (e.g., histogram, QQ-plot) and the Anderson-Darling (AD) test. The p-values for repeated measures comparisons were obtained from mixed-effect models with random intercepts and a significance level of 0.05.

We randomly split the complete dataset into two sets, the training data and the testing data. Joint modeling of multivariate longitudinal scales was performed using multivariate GLMM because of two main advantages: 1) both continuous and discrete types of outcomes can be analyzed jointly and simultaneously 2) the correlation among repeated measures and multiple response outcomes can be incorporated into the model. The parameter estimation and inference were obtained from the Bayesian approach by utilizing the package “mixAK” in R software [17]. Then, the individual-level prediction of longitudinal trajectories for each outcome was performed. Note that the prediction methods within the training and testing dataset were different. The details of the statistical model and inference are shown next.

Let D train and D test denote a training dataset with sample size N and a testing dataset with sample size N ∗ , where N may or may not be equal to N ∗ and the two datasets are independent of each other. D train is used to build the prediction model and D test is used to evaluate the prediction for new subjects. For observation i in D train,we denote Y ikj as the j th measure of the k th type of scale for the i th subject at time point t ij, which could be continuous or discrete, i = 1 , 2 , ⋯ N k = 1 , 2 , ⋯ K j = 1 , 2 , ⋯ n i. Given Y ikj following a distribution from the exponential family with the dispersion parameter ϕ k, we have the following mean model

where X ikj is the p-dimentional covariate for fixed effects with β k as the associated p × 1 vector of parameters, Z ikj is the q-dimentional covariate for random effects with γ ik as the associated q × 1 vector of parameters, g is a monotone link function depending on the type of outcomes (e.g., identity function for continuous outcomes, logit function for binary outcomes, and log function for count outcomes), and the random effects ( <oldsymbol>_i=>_^T,<oldsymbol>_^T,cdots <oldsymbol>_^T ight)>^Tsim MVNleft(oldsymbol, Sigma ight) ) . Note that Σ not only takes into account the correlation of repeated measures of each outcome, but also incorporates the association between multiple outcomes. In addition, this is a hierarchically centered GLMM [25], thus X ikj and Z ikj may not contain the same variables due to the identifiability problem. Of note, Y iY j , γ iγ j for ij and given γ ik,the random variables ( _,_,cdots _<>_i> ) are independent of each other for the i th subject. Given that the parameter vector is ( Theta =>_k^T,oldsymbol, varSigma, _k ight>>_^K ) , the likelihood is shown as below

Here, we utilize the Bayesian approach based on MCMC for parameter estimation, inference, and prediction. The vague priors are used for all elements in Θ, and M (i.e., M = 2000 after burn in) posterior samples are obtained for the parameters and random effects denoted by ( left<^<(m)>,<oldsymbol>_^<(m)>,m=1,2,cdots M ight>. ) The procedures for prediction are performed as follows:

Internal prediction for PD subjects using the training dataset

For the l th PD subject in the training set D train (l = 1, 2, ⋯ N), we aim to predict the measures at a future time point t ′ given the outcome history ( >_lleft(^ ight)=left<_,0le _<^ ight> ) and the covariate history ( >_lleft(^ ight)= ) <X lkj , Z lkj , 0 ≤ t ijt ′ >. The prediction can be achieved by plugging in the estimates of the parameters and random effects from the posterior samples. For a continuous outcome, the predicted value based on the m th posterior sample is

where ( _^<(m)>left(^ ight)sim Nleft(0,_^<2(m)> ight) ) . Thus, the predicted estimate is shown as below

Similar procedures can be followed for the other types of outcomes.

External prediction of PD subjects using the testing dataset

For the l th new PD subject in the testing dataset D test (l = 1, 2, ⋯ N ∗ ), we also aim to predict the measures at a future time point t ′ given the outcome history ( >_lleft(^ ight) ) and the covariate history ( >_lleft(^ ight) ) defined the same as above, and the expectation can be calculated with respect to the posterior distribution of the parameters f(Θ| D train). The key issue is to obtain the estimate of random effects [26] as shown below

The sample of random effects can be drawn from the above posterior distribution after replacing Θ by Θ (m) , denoted by ( <oldsymbol>_<oldsymbol>^ <(m)>) for m = 1 , 2 , ⋯ M. The procedures to obtain the predictive measure ( >_left(^ ight) ) will be the same as above.

In addition, univariate longitudinal analyses were conducted with a different specification in Σ, where the correlation among outcomes was assumed to be 0.

Finally, we evaluated the predictive accuracy for continuous outcomes by using the statistical criteria, the root mean square error (RMSE) and the absolute bias (AB), with less values indicating better goodness-of-fit [27, 28] and defined by

where N is the sample size, and n i is the number of repeated observations for the i th subject. With regards to binary outcomes, we evaluated the predictive accuracy by assessing the discrimination ability using the receiver operating characteristic (ROC) curve with varied threshold on the predictive probabilities. The area under the curve (AUC) can be calculated, and the comparison between the AUCs can be tested based on the DeLong’s method [29].

R package: ClusterBootstrap


The latest stable version of ClusterBootstrap can be installed from the CRAN repository. The package can be loaded using

Input and exploration

Data needs to be arranged in a long format: every observation is represented in a single row. A unique identifier distinguishes the clusters (e.g., a subject that has multiple measurement occasions) from one another. This format is also appropriate for GLMM and GEE. The current version of ClusterBootstrap uses the glm function that is part of the base install of R. This makes available the binomial, Gaussian, gamma, inverse Gaussian, Poisson, quasibinomial and quasi-Poisson distributions, as well as the quasi option for a user-defined variance function. The distributions that have been tested intensely thus far are the Gaussian and the binomial. Our example data is included in the package and can be loaded using

To get an idea of what the data look like, we can look at the first five measurement occasions of participants 1 and 10:

> medication[c(1:5,21:25),] id treattimepos 111 0.0000 106.7 211 0.3333 100.0 311 0.6667 100.0 411 1.0000 100.0 511 1.3333 100.0 21100 0.0000 243.3 22100 0.3333 226.7 23100 0.6667 236.7 24100 1.0000 183.3 25100 1.3333 166.7 showing the cluster identifier ( id ), a between-subjects variable ( treat ), a variable varying within subjects ( time ), and a variable pos , which is the dependent variable in our analysis.


The main analysis can be carried out using the clusbootglm function in the following way:

> set.seed(1) > model.1 <- clusbootglm(pos ∼ treat⋆time, data = medication, clusterid = id) Other arguments that can be specified are B for the number of bootstrap samples, family for the error distribution, confint.level for the level of the confidence interval, and n.cores for the number of CPU cores to be used in parallel to speed up the computations.

Parallel computing

For parallel computing, ClusterBootstrap depends on the parallel package, using the random number generator of L’Ecuyer (1999) without a predefined seed as a subsequent to the seed that was initially set by the user. This gives certainty to the reproducibility of the findings when the user sets the seed prior to calling the clusbootglm function. If one wishes to use multiple CPU cores, it is advised (especially for Windows and Sparc Solaris operating systems) to leave at least one processor unused. The number of available processors can be requested by parallel::detectCores() . By not making use of forking, which is not available for Windows, the implementation of parallel processing is identical for all operating systems, as is the generated output given a certain seed.

Investigating the output

The function summary can be used to get an overview of the parameter estimates and their dispersion characteristics.

> options(digits= 3) > summary(model.1) Call: clusbootglm(model = pos ∼ treat ⋆ time, data = medication, clusterid = id)

Estimate Std.error CI 2.5% CI 97.5% (Intercept) 167.259.09150.48186.52 treat-6.3312.27-31.5016.73 time-2.051.46-4.601.29 treat:time5.682.211.5210.26 --- 95% confidence interval using bias corrected and accelerated cluster bootstrap intervals

The summary function returns parameter estimates, the bootstrap standard deviation, and, by default, the confidence interval at the level that was specified in the analysis. The standard interval method is BCa, though this can be altered using the interval.type argument in the summary function.

The confint function lets the user change the level of the confidence interval post hoc (i.e., the bootstrap procedure need not to be performed again). For example, to get a 90% parametric confidence interval level of the time and the treat⋆time parameters, one can use

> confint(model.1, level=.90, parm=c("treat","treat:time"), interval.type="parametric") 5%95% treat-26.59 13.77 treat:time2.039.32

To extract the parameter estimates from the model, the function coef can be used, with the option to choose either the bootstrap coefficient means (which is the default) or the coefficients from the GLM that was fitted on the original data:

> coef(model.1, estimate.type="GLM") GLM (Intercept) 167.26 treat -6.41 time -2.04 treat:time 5.68 Based on the regression parameters and their confidence intervals, our conclusion would be that although there are no overall differences between the treatment conditions regarding their positive mood and there is no main effect for the time variable, there is a difference between the two treatment groups regarding their effects over time. Assuming the nonsignificant main effects are zero and assuming the treatment group is coded 1 and the placebo group is coded 0, the significant estimate of 5.68 exclusively for the treatment group would lead one to conclude that the treatment group gains positive mood over time, where the placebo group does not.

The bootstrapped covariance matrix of the parameter estimates can be obtained using the estimates from the individual bootstrap samples:

> cov(model.1$coefficients) (Intercept)treattime treat: time (Intercept)82.69 -82.98 -7.887.81 treat-82.98 150.518.06-12.27 time-7.888.062.15-2.13 treat:time7.81 -12.27-2.134.90

The covariance matrix can be interpreted easily in the light of the bootstrap procedure. For example: within the 5000 bootstrap samples, there seems to be a positive relation between the estimated values of treatment and time ( (r approx -7.88/sqrt <150.51 imes 2.15>approx .44) ) and a negative association between the estimated coefficients of treatment and the interaction term (r ≈−.45).

Checking bootstrap samples with issues

An issue that might evolve in any bootstrap procedure is that the statistics of interest cannot be computed in some of the bootstrap samples. In the context of GLM, this might occur when there is complete or quasi-complete separation. For example, complete separation occurs in logistic regression when a hyperplane can pass through the explanatory variable space in such a way that all cases with yi = 0 are on one side of the hyperplane and all cases with yi = 1 are on the other side (Agresti, 2013, p. 234). Quasi-complete separation refers to a weaker form of this situation (i.e., there is an almost perfect discrimination of the outcome variable by the explanatory variable space). Another potential issue is when there is no variation in the outcome variable. In logistic regression, for example, the chance of the absence of variation in the outcome variable in any of the bootstrap samples increases when the count of either one of the outcome categories decreases. To simulate such a situation, we can split the pos variable from the medication data at the 99th percentile, and use the dichotomous resultant as an outcome in a logistic regression with the cluster bootstrap:

> medication$pos_dich <- with(medication, ifelse(pos>quantile(pos,.99),1,0)) > set.seed(1) > model.2 <- clusbootglm(pos_dich ∼ treat⋆time, data = medication, clusterid = id, family = binomial)

Now, when the summary function is invoked, there is an extra line, indicating a problem in 30 bootstrap samples: > summary(model.2) Call: clusbootglm(model = pos_dich ∼ treat ⋆ time, data = medication, clusterid = id, family = binomial)

Estimate Std.error CI 2.5% CI 97.5% (Intercept)-5.3573.851-21.57-2.812 treat-2.5887.161-20.234.791 time-0.2910.648-2.160.733 treat:time0.3480.993-1.082.983 --- 95% confidence interval using bias corrected and accelerated cluster bootstrap intervals There were 30 bootstrap samples which returned at least one NA

We can investigate which bootstrap samples are having issues:

> model.2$samples.with.NA.coef [1] 13 431 517 622 704 1009 [7] 1334 2244 2249 2277 2302 2328 [13] 2388 2406 2519 2579 2662 2935 [19] 3180 3675 3927 4023 4143 4458 [25] 4484 4562 4593 4656 4777 4887

If we wish to further investigate any of these bootstrap samples (e.g., the first one, being bootstrap sample 13), we can obtain the corresponding dataset:

> clusbootsample(model.2,13) id treattime pos pos_dich 100 2810.000 1070 101 2810.333 1200 102 2810.667 1270 103 2811.333 1000 104 2811.667 1470 105 2812.000 1270 . << 1254rows omitted>>. 60914115.001770 61014115.332800 61114115.671670 61214116.002300 61314116.331870 61414116.672800

Summing the fifth column of this data frame tells us that all the values on the dichotomous outcome are zero, indicating no variation in the outcome variable. In any case, the resulting data frame could subsequently be used in a regular application of the glm() function to obtain relevant information about the issue at hand or, for example, to obtain the parameter estimates:

> glm(pos_dich ∼ treat⋆time, data = clusbootsample(model.2,13), family = binomial) Call:glm(formula = pos_dich ∼ treat⋆time, family = binomial, data = clusbootsample(model.2,13)) Coefficients: (Intercept)treattimetreat:time -2.66e + 012.52e-13-1.24e-27-5.59e-14 Degrees of Freedom: 1265 Total (i.e. Null) 1262 Residual Null Deviance:0 Residual Deviance: 7.34e-09AIC: 8 Warning message: algorithm did not converge

For each of the coefficients, we can also obtain the amount of NA s in our bootstrap samples:

> model.2$failed.bootstrap.samples (Intercept)treattimetreat:time 30303030

In this example, the number of NA s is equal for all coefficients, which might indicate 30 bootstrap samples have some overall convergence problems, e.g., no variance in the outcome variable. However, when the analysis involves a categorical independent variable, and there is a small cell count in one of the categories, the occurrence of NA s might also be indicative of one of the categories not appearing in some of the bootstrap samples, leaving it out of the samples’ GLMs. The failed.bootstrap.samples element would then show the presence of NA s for that particular category.

To our knowledge, the possibility to easily investigate problematic bootstrap samples is not implemented in other software with bootstrapping procedures. This functionality makes the ClusterBootstrap package useful when applying the bootstrap to GLMs in general, even when there is no clustering in the data. For these applications, one could set clusterid to a unique identifier for each observation (i.e., each row in the data).


Atkinson, Q. D. (2011a). Linking spatial patterns of language variation to ancient demography and population migrations. Linguist. Typol. 15, 321�. doi: 10.1515/LITY.2011.022

Atkinson, Q. D. (2011b). Phonemic diversity supports a serial founder effect model of language expansion from Africa. Science 332, 346�. doi: 10.1126/science.1199295

Baayen, R. H., Davidson, D. J., and Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. J. Mem. Lang. 59, 390�. doi: 10.1016/j.jml.2007.12.005

Baayen, R. H., and Milin, P. (2010). Analyzing reaction times. Int. J. Psychol. Res. 3, 12�. doi: 10.1287/mksc.12.4.395

Baker, M. (2016). Statisticians issue warning over misuse of P values. Nature 531:151. doi: 10.1038/nature.2016.19503

Barr, D. J. (2013). Random effects structure for testing interactions in linear mixed-effects models. Front. Psychol. 4:328. doi: 10.3389/fpsyg.2013.00328

Bates, D., Maechler, M., Bolker, B. M., and Walker, S. (2015). Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 1�. doi: 10.18637/jss.v067.i01

Bolker, B. M. (2015). “Linear and generalized linear mixed models,” in Ecological Statistics: Contemporary Theory and Application, eds G. A. Fox, S. Negrete-Yankelevich, and V. J. Sosa (Oxford: Oxford University Press), 309�. doi: 10.1093/acprof:oso/9780199672547.003.0014

Box, G., and Cox, D. (1964). An analysis of transformations. J. R. Stat. Soc. Ser. B 26, 211�.

Burnham, K. P., Anderson, D. R., and Huyvaert, K. P. (2011). AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons. Behav. Ecol. Sociobiol. 65, 23�. doi: 10.1007/s00265-010-1029-6

Bybee, J. (2011). How plausible is the hypothesis that population size and dispersal are related to phoneme inventory size? Introducing and commenting on a debate. Linguist. Typol. 15, 147�. doi: 10.1515/LITY.2011.009

Clements, G. N. (2003). �ture economy as a phonological universal,” in Proceedings of the 15th International Congress of Phonetic Sciences, (Barcelona: Futurgraphic), 371�.

Cook, B. L., and Manning, W. G. (2013). Thinking beyond the mean: a practical guide for using quantile regression methods for health services research. Biostat. Psychiatry 25, 55�. doi: 10.3969/j.issn.1002-0829.2013.01.011

Coupé, C., Oh, Y. M., Pellegrino, F., and Marsico, E. (2014). 𠇌ross-linguistic investigations of oral and silent reading,” in Proceedings of the 15th Interspeech Conference, ಜully, 514�.

Cysouw, M., Dediu, D., and Moran, S. (2012). Comment on “Phonemic diversity supports a serial founder effect model of language expansion from Africa”. Science 335:657. doi: 10.1126/science.1208841

de Vlaming, R., and Groenen, P. J. F. (2015). The current and future use of ridge regression for prediction in quantitative genetics. Biomed. Res. Int. 2015:143712. doi: 10.1155/2015/143712

Donohue, M., and Nichols, J. (2011). Does phoneme inventory size correlate with population size? Linguist. Typol. 15, 161�. doi: 10.1515/LITY.2011.011

Drager, K., and Hay, J. (2012). Exploiting random intercepts: two case studies in sociophonetics. Lang. Var. Change 24, 59�. doi: 10.1017/S0954394512000014

Dryer, M. S., and Haspelmath, M. (2013). The World Atlas of Language Structures. Available at:

Earp, B. D., and Trafimow, D. (2015). Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol. 6:621. doi: 10.3389/fpsyg.2015.00621

Gelman, A., and Hill, J. (2007). Data Analysis Using Regression and Multilevelhierarchical Models, Vol. 1. Cambridge: Cambridge University Press.

Gigerenzer, G., Krauss, S., and Vitouch, O. (2004). “The null ritual. What you always wanted to know about significance testing but were afraid to ask,” in The Sage Handbook of Quantitative Methodology for the Social Sciences, ed. D. Kaplan (Thousans Oaks, CA: Sage), 391�. doi: 10.4135/9781412986311.n21

Global Mapping International SIL International (2012). World Language Mapping System. Language Area and Point Data for Geographic Information Systems (GIS). Available at:

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31, 337�. doi: 10.1007/s10654-016-0149-3

Hastie, T., and Tibshirani, R. (1986). Generalized additive models. Stat. Sci. 1, 297�. doi: 10.1016/j.csda.2010.05.004

Hay, J., and Bauer, L. (2007). Phoneme inventory size and population size. Language 83, 388�. doi: 10.1353/lan.2007.0071

Hurlbert, S. H. (1984). Pseudoreplication and the design of ecological field experiments. Ecol. Monogr. 54, 187�. doi: 10.2307/1942661

Ives, A. R. (2015). For testing the significance of regression coefficients, go ahead and log-transform count data. Methods Ecol. Evol. 6, 828�. doi: 10.1111/2041-210X.12386

Jaeger, T. F., Graff, P., Croft, W., and Pontillo, D. (2011). Mixed effect models for genetic and areal dependencies in linguistic typology. Linguist. Typol. 15, 281�. doi: 10.1515/LITY.2011.021

Johnson, D. E. (2014). Progress in Regression: Why Natural Language Data Calls for Mixed-Effects Models. Available at:

Johnson, K. (2008). Quantitative Methods in Linguistics. Oxford: Blackwell.

Jones, R. H. (2011). Bayesian information criterion for longitudinal and clustered data. Stat. Med. 30, 3050�. doi: 10.1002/sim.4323

Kirby, K. R., Gray, R. D., Greenhill, S. J., Jordan, F. M., Gomes-Ng, S., Bibiko, H.-J., et al. (2016). D-PLACE: a global database of cultural, linguistic and environmental diversity. PLoS One 11:e0158391. doi: 10.1371/journal.pone.0158391

Korner-Nievergelt, F., Roth, T., von Felten, S., Guélat, J., Almasi, B., and Korner-Nievergelt, P. (2015). Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan. Cambridge, MA: Academic Press.

Kuha, J. (2004). AIC and BIC: comparisons of assumptions and performance. Sociol. Methods Res. 33, 188�. doi: 10.1177/0049124103262065

Lo, S., and Andrews, S. (2015). To transform or not to transform: Using generalized linear mixed models to analyse reaction time data. Front. Psychol. 6:1171. doi: 10.3389/fpsyg.2015.01171

Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in R. Behav. Res. Methods 49, 1494�. doi: 10.3758/s13428-016-0809-y

Maddieson, I. (1984). Patterns of Sounds. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511753459

Maddieson, I., Flavier, S., Marsico, E., Coupé, C., and Pellegrino, F. (2013). “LAPSyD: Lyon - albuquerque phonological systems database,” in Proceedings of the 14th Interspeech Conference, Albuquerque, NM, 3022�.

Moran, S., McCloy, D., and Wright, R. (2012). Revisiting population size vs. phoneme inventory size. Language 88, 877�. doi: 10.1353/lan.2012.0087

Moran, S., McCloy, D., and Wright, R. (2014). PHOIBLE Online. Available at:

Morrison, G. S., and Kondaurova, M. V. (2009). Analysis of categorical response data: use logistic regression rather than endpoint-difference scores or discriminant analysis (L). J. Acoust. Soc. Am. 126, 2159�. doi: 10.1121/1.3216917

Moscoso Del Prado Martín, F. (2009). A Theory of Reaction Time Distributions. Available at:

O’Hara, R. B., and Kotze, D. J. (2010). Do not log-transform count data. Methods Ecol. Evol. 1, 118�. doi: 10.1111/j.2041-210X.2010.00021.x

Pellegrino, F., Coupé, C., and Marsico, E. (2011). A cross-language perspective on speech information rate. Language 87, 539�. doi: 10.1353/lan.2011.0057

Pericliev, V. (2004). There is no correlation between the size of a community speaking a language and the size of the phonological inventory of that language. Linguist. Typol. 8, 376�. doi: 10.1515/lity.2004.8.3.376

Quantum GIS Development Team (2017). Quantum GIS Geographic Information System. Open Source Geospatial Foundation Project. Available at:

R Development Core Team (2017). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.

Rigby, R. A., and Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape (with discussion). Appl. Stat. 54, 507�. doi: 10.1016/j.placenta.2014.02.001

Rigby, R. A., Stasinopoulos, D. M., and Akantziliotou, C. (2008). A framework for modelling overdispersed count data, including the Poisson-shifted generalized inverse Gaussian distribution. Comput. Stat. Data Anal. 53, 381�. doi: 10.1016/j.csda.2008.07.043

Rigby, R. A., Stasinopoulos, D. M., and Lane, P. W. (2007). Generalized additive models for location, scale and shape. J. R. Stat. Soc. Ser. C Appl. Stat. 54, 507�. doi: 10.1111/j.1467-9876.2005.00510.x

Sóskuthy, M. (2017). Generalised Additive Mixed Model for Dynamic Analysis in Linguistics: A Practical Introduction. Available at:

Stasinopoulos, M., Rigby, R. A., Heller, G. Z., Voudouris, V., and De Bastiani, F. (2017). Flexible Regression and Smoothing Using GAMLSS in R. Boca Raton, FL: CRC Press.

Symonds, M. R. E., and Blomberg, S. P. (2014). 𠇊 primer on phylogenetic generalised least squares,” in Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology, ed. L. Z. Garamszegi (Berlin: Springer), 105�.

Warton, D. I., Lyons, M., Stoklosa, J., and Ives, A. R. (2016). Three points to consider when choosing a LM or GLM test for count data. Methods Ecol. Evol. 7, 882�. doi: 10.1111/2041-210X.12552

Wasserstein, R. L., and Lazar, N. A. (2016). The ASA’s statement on p-values: context, Process, and Purpose. Am. Stat. 70, 129�. doi: 10.1080/00031305.2016.1154108

Wichmann, S., Holman, E. W., and Brown, C. H. (2016). The ASJP Database. Available at:

Wichmann, S., Rama, T., and Holman, E. W. (2011). Phonological diversity, word length, and population sizes across languages: the ASJP evidence. Linguist. Typol. 15, 177�. doi: 10.1515/LITY.2011.013

Winter, B., and Wieling, M. (2016). How to analyze linguistic change using mixed models, growth curve analysis and generalized additive modeling. J. Lang. Evol. 1, 7�. doi: 10.1093/jole/lzv003

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. R. Stat. Soc. 73, 3�. doi: 10.1111/j.1467-9868.2010.00749.x

Wood, S. N., Pya, N., and S๏ken, B. (2016). Smoothing parameter and model selection for general smooth models. J. Am. Stat. Assoc. 111, 1548�. doi: 10.1080/01621459.2016.1180986

Zha, L., Lord, D., and Zou, Y. (2016). The Poisson inverse Gaussian (PIG) generalized linear regression model for analyzing motor vehicle crash data. J. Transp. Saf. Secur. 8, 18�. doi: 10.1080/19439962.2014.977502

Zuur, A. F., Ieno, E. N., and Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods Ecol. Evol. 1, 3�. doi: 10.1111/j.2041-210X.2009.00001.x

Keywords : mixed-effects models, generalized linear models, generalized additive models, smooth terms, phonemic inventory size, Delaporte distribution, Box-Cox t distribution, GAMLSS

Citation: Coupé C (2018) Modeling Linguistic Variables With Regression Models: Addressing Non-Gaussian Distributions, Non-independent Observations, and Non-linear Predictors With Random Effects and Generalized Additive Models for Location, Scale, and Shape. Front. Psychol. 9:513. doi: 10.3389/fpsyg.2018.00513

Received: 15 November 2017 Accepted: 27 March 2018
Published: 16 April 2018.

Mario Cortina-Borja, University College London, United Kingdom
Erich Round, The University of Queensland, Australia

Copyright © 2018 Coupé. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.