In human research, we define our universe and estimate the effect size we expect to observe, then we can draw the a sample of a given size.

How is sample size determined for experiments in the animal research field?

Determining sample size for an animal experiment is no different than in research with human subjects. What you need to know is the effect size, the significance level and the power (which is the probability that the test detects a significant effect assuming that there is one). The tricky part is getting the effect size (for an interesting discussion have a look at this thread). The other two things are chosen by you as the experimenter. An overview paper on power analysis was written by Cohen (1992).

References

Cohen, J. (1992). A power primer. *Psychological Bulletin*, 112(1), 155. PDF

## Research Design Review

Qualitative and quantitative research designs require the researcher to think carefully about how and how many to sample within the population segment(s) of interest related to the research objectives. In doing so, the researcher considers demographic and cultural diversity, as well as other distinguishing characteristics (e.g., usage of a particular service or product) and pragmatic issues (e.g., access and resources). In qualitative research, the number of events (i.e., the number of in-depth interviews, focus group discussions, or observations) and participants is often considered at the early design stage of the research and then again during the field stage (i.e., when the interviews, discussions, or observations are being conducted). This two-stage approach, however, can be problematic. One reason is that giving an accurate sample size prior to data collection can be difficult, particularly when the researcher expects the number to change as the result of in-the-field decisions.

Another potential problem arises when researchers rely solely on the concept of saturation to assess sample size when in the field. In grounded theory, theoretical saturation

“refers to the point at which gathering more data about a theoretical category reveals no new properties nor yields any further theoretical insights about the emerging grounded theory.” (Charmaz, 2014, p. 345)

In the broader sense, Morse (1995) defines saturation as “‘data adequacy’ [or] collecting data until no new information is obtained” (p. 147).

Reliance on the concept of saturation presents two overarching concerns: 1) As discussed in two earlier articles in *Research Design Review* – Beyond Saturation: Using Data Quality Indicators to Determine the Number of Focus Groups to Conduct and Designing a Quality In-depth Interview Study: How Many Interviews Are Enough? – the emphasis on saturation has the potential to obscure other important considerations in qualitative research design such as data quality and 2) Saturation as an assessment tool potentially leads the researcher to focus on the obvious “new information” obtained by each interview, group discussion, or observation rather than gaining a deeper sense of participants’ contextual meaning and more profound understanding of the research question. As Morse (1995) states,

“Richness of data is derived from detailed description, not the number of times something is stated…It is often the infrequent gem that puts other data into perspective, that becomes the central key to understanding the data and for developing the model. It is the *implicit* that is interesting.” (p. 148)

With this as a backdrop, a couple of recent articles on saturation come to mind. In “A Simple Method to Assess and Report Thematic Saturation in Qualitative Research” (Guest, Namey, & Chen, 2020), the authors present a novel approach to assessing sample size in the in-depth interview method that can be applied during or after data collection. This approach is born from quantitative research design and indeed the authors reference concepts such as “power calculations,” p-values, and odds ratios. When used during data collection, the qualitative researcher applies the assessment tool by calculating the “saturation ratio,” i.e., the number of new themes derived from a specified “run” of interviews (e.g., two) divided by the “base” number of “unique themes,” i.e., themes identified at the initial stage of interviewing. Importantly, the rationale for this approach is lodged in the idea that “most novel information in a qualitative dataset is generated early in the process” (p. 6) and indeed “the most prevalent, high-level themes are identified very early on in data collection, within about six interviews” (p. 10).

This perspective on saturation assessment is balanced by two other recent articles – “To Saturate or Not to Saturate? Questioning Data Saturation as a Useful Concept for Thematic Analysis and Sample-size Rationales” (Braun & Clarke, 2019) and “The Changing Face of Qualitative Inquiry” (Morse, 2020). In these articles, the authors express similar viewpoints on at least two considerations pertaining to sample size and the use of saturation in qualitative research. The first has to do with the importance of meaning 1 and the idea that finding meaning requires the researcher to actively look for contextual understandings and to have good analytical skills. For Braun and Clarke, “meaning is not inherent or self-evident in data” but rather “meaning requires interpretation” (p. 10). In this way, themes do not simply pop-up during data collection but rather are the result of actively conducting an analysis to construct an interpretation.

Morse talks about the importance of meaning from the perspective that saturation hampers meaningful insights by restricting the researcher’s exploration of “new data.” Instead of using “redundancy as an indication for broadening the sample, or wondering why this replication occurs,” the researcher stops collecting data leading to a “more shallow” analysis and “trivial” results (p. 5).

The second consideration related to saturation discussed in both the Braun and Clarke and Morse articles is the idea that sample size determination requires a nuanced approach, with careful attention given to many factors related to each project. For researchers using reflexive thematic analysis, Braun and Clarke mention 10 “intersecting aspects,” including “the breadth and focus of the research question,” population diversity, “scope and purpose of the project,” and “pragmatic constraints” (p. 11). In a similar manner, Morse includes on her list of eight “criteria” such items as “the complexity of the questions/phenomenon being studied,” “the scope of inquiry,” and “variation of participants” (p. 5).

The potential danger of relying on saturation to establish sample size in qualitative research is multifold. The articles discussed here, and the image above, highlight the underlying concern that a reliance on saturation: 1) ignores the purpose and unique attributes of qualitative research as well as each study, along with a variety of quality considerations during data collection, which 2) misguides the researcher towards prioritizing manifest content over the pursuit of contextual understanding derived from latent, less obvious data, which 3) leads to superficial interpretations and 4) ultimately results in less useful research.

1 Sally Thorne (2020) shares this perspective on the importance of meaning in her discussion of pattern recognition in qualitative analysis – “…qualitative research is meant to add value to a field rather than simply reporting what we can detect about it that has the qualities of a pattern… it should clearly add to our body of understanding in some meaningful manner.” (p. 2)

Braun, V., & Clarke, V. (2019). To saturate or not to saturate? Questioning data saturation as a useful concept for thematic analysis and sample-size rationales. *Qualitative Research in Sport, Exercise and Health*. https://doi.org/10.1080/2159676X.2019.1704846

Charmaz, K. (2014). *Constructing Grounded Theory* (2nd ed.). Sage Publications.

## ORIGINAL RESEARCH article

Effect sizes are the currency of psychological research. They quantify the results of a study to answer the research question and are used to calculate statistical power. The interpretation of effect sizes—when is an effect small, medium, or large?—has been guided by the recommendations Jacob Cohen gave in his pioneering writings starting in 1962: Either compare an effect with the effects found in past research or use certain conventional benchmarks. The present analysis shows that neither of these recommendations is currently applicable. From past publications without pre-registration, 900 effects were randomly drawn and compared with 93 effects from publications with pre-registration, revealing a large difference: Effects from the former (median *r* = 0.36) were much larger than effects from the latter (median *r* = 0.16). That is, certain biases, such as publication bias or questionable research practices, have caused a dramatic inflation in published effects, making it difficult to compare an actual effect with the real population effects (as these are unknown). In addition, there were very large differences in the mean effects between psychological sub-disciplines and between different study designs, making it impossible to apply any global benchmarks. Many more pre-registered studies are needed in the future to derive a reliable picture of real population effects.

## Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the *p* value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the *p* value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s *d* is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s *d* is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word *Yes*, then this combination would be statistically significant for both Cohen’s *d* and Pearson’s *r*. If it contains the word *No*, then it would not be statistically significant for either. There is one cell where the decision for *d* and *r* would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Table 13.1 How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant

Although Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

## SHOULD SAMPLE SIZE CALCULATIONS BE PERFORMED BEFORE OR AFTER THE STUDY?

The answer is definitely before, occasionally during, and sometimes after.

In designing a study we want to make sure that the work that we do is worthwhile so that we get the correct answer and we get it in the most efficient way. This is so that we can recruit enough patients to give our results adequate power but not too many that we waste time getting more data than we need. Unfortunately, when designing the study we may have to make assumptions about desired effect size and variance within the data.

Interim power calculations are occasionally used when the data used in the original calculation are known to be suspect. They must be used with caution as repeated analysis may lead to a researcher stopping a study as soon as statistical significance is obtained (which may occur by chance at several times during subject recruitment). Once the study is underway analysis of the interim results may be used to perform further power calculations and adjustments made to the sample size accordingly. This may be done to avoid the premature ending of a study, or in the case of life saving, or hazardous therapies, to avoid the prolongation of a study. Interim sample size calculations should only be used when stated in the a priori research method.

When we are assessing results from trials with negative results it is particularly important to question the sample size of the study. It may well be that the study was underpowered and that we have incorrectly accepted the null hypothesis, a type II error. If the study had had more subjects, then a difference may well have been detected. In an ideal world this should never happen because a sample size calculation should appear in the methods section of all papers, reality shows us that this is not the case. As a consumer of research we should be able to estimate the power of a study from the given results.

Retrospective sample size calculation are not covered in this article. Several calculators for retrospective sample size are available on the internet (UCLA power calculators (http://calculators.stat.ucla.edu/powercalc/), Interactive statistical pages (http://www.statistics.com/content/javastat.html).

## Summary

Determining the appropriate design of a study is more important than the statistical analysis a poorly designed study can never be salvaged, whereas a poorly analyzed study can be re-analyzed. A critical component in study design is the determination of the appropriate sample size. The sample size must be large enough to adequately answer the research question, yet not too large so as to involve too many patients when fewer would have sufficed. The determination of the appropriate sample size involves statistical criteria as well as clinical or practical considerations. Sample size determination involves teamwork biostatisticians must work closely with clinical investigators to determine the sample size that will address the research question of interest with adequate precision or power to produce results that are clinically meaningful.

The following table summarizes the sample size formulas for each scenario described here. The formulas are organized by the proposed analysis, a confidence interval estimate or a test of hypothesis.

## 5 Steps for Calculating Sample Size

Nearly all granting agencies require an estimate of an adequate sample size to detect the effects hypothesized in the study. But all studies are well served by estimates of sample size, as it can save a great deal on resources.

Why? Undersized studies can’t find real results, and oversized studies find even insubstantial ones. Both undersized and oversized studies waste time, energy, and money the former by using resources without finding results, and the latter by using more resources than necessary. Both expose an unnecessary number of participants to experimental risks.

The trick is to size a study so that it is *just* large enough to detect an effect of scientific importance. If your effect turns out to be bigger, so much the better. But first you need to gather some information about on which to base the estimates.

Once you’ve gathered that information, you can calculate by hand using a formula found in many textbooks, use one of many specialized software packages, or hand it over to a statistician, depending on the complexity of the analysis. But regardless of which way you or your statistician calculates it, you need to first do the following 5 steps:

**Step 1. Specify a hypothesis test.**

Most studies have many hypotheses, but for sample size calculations, choose one to three main hypotheses. Make them explicit in terms of a null and alternative hypothesis.

**Step 2. Specify the significance level of the test.**

It is usually alpha = .05, but it doesn’t have to be.

**Step 3. Specify the smallest effect size that is of scientific interest.**

This is often the hardest step. The point here is *not* to specify the effect size that you *expect* to find or that others have found, but the *smallest effect size of scientific interest*.

What does that mean? Any effect size can be statistically significant with a large enough sample. Your job is to figure out at what point your colleagues will say, “So what if it is significant? It doesn’t affect anything!”

For some outcome variables, the right value is obvious for others, not at all.

- If your therapy lowered anxiety by 3%, would it actually improve a patient’s life? How big would the drop have to be?
- If response times to the stimulus in the experimental condition were 40 ms faster than in the control condition, does that mean anything? Is a 40 ms difference meaningful? Is 20? 100?
- If 4 fewer beetles were found per plant with the treatment than with the control, would that really affect the plant? Can 4 more beetles destroy, or even stunt a plant, or does it require 10? 20?

**Step 4. Estimate the values of other parameters necessary to compute the power function.**

Most statistical tests have the format of effect/standard error. We’ve chosen a value for the effect in step 3. Standard error is generally the standard deviation/n. To solve for n, which is the point of all this, we need a value for standard deviation. There are only two ways to get it.

1. The best way is to use data from a pilot study to compute standard deviation.

2. The other way is to use historical data–another study that used the same dependent variable. If you have more than one study, even better. Average their standard deviations for a more reliable estimate.

Sometimes both sources of information can be hard to come by, but if you want sample sizes that are even remotely accurate, you need one or the other.

**Step 5. Specify the intended power of the test.**

The power of a test is the probability of finding significance if the alternative hypothesis is true.

A power of .8 is the minimum. If it will be difficult to rerun the study or add a few more participants, a power of .9 is better. If you are applying for a grant, a power of .9 is always better.

This section uses the concept of variables to help you calculate the number of animals required for an experiment.

When building your experimental paradigm, describe how the experiment will be conducted and the various branches or variations that will be presented to your animal population. These variations will be added within this section as variables.

Remember that variables are always multipliers. Considerations for approximating numbers should include, but are not limited to, repetitions, animal availability, strains, dependent variables, numbers/group, statistical significance, experimental time points/endpoints.

Example: Repetitions: in order to identify and estimate variability, each experiment will be repeated 3 times Drugs: in order to evaluate the effects of immunoglobulins, 4 different antibodies will be tested (see below) Study Endpoints: in order to evaluate the effects of other variables over time, mice will be euthanized at day 1, 7, 30 and 90 Animals/Group: based on a power analysis, each group will have five mice Number of strains/genotypes: in order to evaluate the effects of genotype, 4 different mouse strains will be used (see below).

In this example, the experiment has 5 key variables that affect the experimental paradigm.

## Sample Size for Time to an Event

### Simple Approaches

The statistical analysis of time to an event involves complicated statistical models however, there are two simple approaches to estimating sample size for this type of variable. The first approach is to estimate sample size using the proportions in the two experimental groups exhibiting the event by a certain time. This method converts time to an event into a dichotomous variable, and sample size is estimated by Appendix Equation 1. This approach generally yields sizes that are somewhat larger than more precise calculations based on assumptions about the equation that describes the curve of outcome versus time.

The second approach is to treat time to occurrence as a continuous variable. This approach is applicable only if all animals are followed to event occurrence (e.g., until death or time to exhibit a disease such as cancer), but it cannot be used if some animals do not reach the event during the study. Time to event is a continuous variable, and sample size may be computed using Appendix Equation 2.

### Unequal Number of Animals in Different Groups

Studies of transgenic mice often involve crossing heterozygous mice to produce homozygous and heterozygous litter-mates, which are then compared. Typically, there will be twice as many heterozygotes in a litter as homozygotes, although the proportions may be different in more complicated crosses. In such experiments, the researcher wishes to estimate the number of animals with the expected ratio between the experimental groups. The equations provided in the Appendix become considerably more complex. The reader is directed to our website for unequal sample size calculations (the expected ratio of group sizes is entered in place of the 1.0 provided on the chi-squared test on proportions web page): < http://www.biomath.info >.

This policy covers all animals on BU premises used for research, teaching, training, breeding, and related activities, hereinafter referred to collectively as “activities”, and is applicable to all persons responsible for conducting activities involving live vertebrate animals at or under the auspices of Boston University.

### Sample-Size Calculations

Estimation of the number of subjects required to answer an experimental question is an important step in planning a study. On one hand, an excessive sample size can result in waste of animal life and other resources, including time and money, because equally valid information could have been gleaned from a smaller number of subjects. However, underestimates of sample size are also wasteful, since an insufficient sample size has a low probability of detecting a statistically significant difference between groups, even if a difference really exists. Consequently, an investigator might wrongly conclude that groups do not differ, when in fact they do.

### What Is Involved In Sample Size Calculations

While the need to arrive at appropriate estimates of sample size is clear, many scientists are unfamiliar with the factors that influence determination of sample size and with the techniques for calculating estimated sample size. A quick look at how most textbooks of statistics treat this subject indicates why many investigators regard sample-size calculations with fear and confusion.

While sample-size calculations can become extremely complicated, it is important to emphasize, first, that all of these techniques produce estimates, and, second, that there are just a few major factors influencing these estimates. As a result, it is possible to obtain very reasonable estimates from some relatively simple formulae.

When comparing two groups, the major factors that influence sample size are:

- How large a difference you need to be able to detect.
- How much variability there is in the factor of interest.
- What “p” value you plan to use as a criterion for statistical “significance.”
- How confident you want to be that you will detect a “statistically significant” difference, assuming that a difference does exist.

### An Intuitive Look at a Simple Example

Suppose you are studying subjects with renal hypertension, and you want to test the effectiveness of a drug that is said to reduce blood pressure. You plan to compare systolic blood pressure in two groups, one which is treated with a placebo injection, and a second group which is treated with the drug being tested. While you don’t yet know what the blood pressures will be in each of these groups, just suppose that if you were to test a ridiculously large number of subjects (say 100,000) treated with either placebo or drug, their systolic blood pressures would follow two clearly distinct frequency distributions, as shown in Figure 1.

As you would expect, both groups show some variability in blood pressure, and the frequency distribution of observed pressures conforms to a bell shaped curve. As shown here, the two groups overlap, but they are clearly different systolic pressures in the treated group are an average of 20 mm Hg less than in the untreated controls.

Since there were 100,000 in each group, we can be confident that the groups differ. Now suppose that although we treated 100,000 of each, we only obtained pressure measurements from only three in each group, because the pressure measuring apparatus broke. In other words we have a random sample of N=3 from each group, and their systolic pressures are as follows:

Pressures are lower in the treated group, but we cannot be confident that the treatment was successful. There is a distinct possibility that the difference we see is just due to chance, since we took a small random sample. So the question is: how many would we have to measure (sample) in each group to be confident that any observed differences were not simply the result of chance?

How large a sample is needed depends on the four factors listed above. To illustrate this intuitively, suppose that the blood pressures in the treated and untreated subjects were distributed as shown in Figure 2 or in Figure 3.

In Figure 2 the amount of variability is the same, but the difference between the groups is smaller. It makes sense that you will need a larger sample to be confident that differences in your sample are real.

In Figure 3 the difference in pressures is about the same as it was in Figure 1, but there is less variability in pressure readings within each group. Here it seems obvious that a smaller sample would be required to confidently determine a difference.

The size of the sample you need also depends on the “p value” that you use. A “p value” of less than 0.05 is frequently used as the criterion for deciding whether observed differences are likely to be due to chance. If p<0.05, it means that the probability that the difference you observed was due to chance is less than 5%. If want to use a more rigid criterion (say, p<0.01) you will need a larger sample. Finally, the size of the sample you will need also depends on “power,” that is the probability that you will observe a statistically significant difference, assuming that a difference really exists.

To summarize, in order to calculate a sample-size estimate if you need some estimate of how different the groups might be or how large a difference you need to be able to detect, and you also need an estimate of how much variability there will be within groups. In addition, your calculations must also take in account what you want to use as a “p value” and how much “power” you want.

### The Information You Need To Do Sample Size Calculations

Since you haven’t actually done the experiment yet, you won’t know how different the groups will be or what the variability (as measured by the standard deviation) will be. But you can usually make reasonable guesses. Perhaps from your experience (or from previously published information) you anticipate that the untreated hypertensive subjects will have a mean systolic blood pressure of about 160 mm Hg with a standard deviation of about +10 mm Hg. You decide that a reduction in systolic blood pressure to a mean of 150 mm Hg would represent a clinically meaningful reduction. Since no one has ever done this experiment before, you don’t know how much variability there will be in response, so you will have to assume that the standard deviation for the test group is at least as large as that in the untreated controls. From these estimates you can calculate an estimate of the sample size you need in each group.

### Sample Size Calculations For A Difference In Means

The actual calculations can get a little bit cumbersome, and most people don’t even want to see equations. Consequently, I have put together a spreadsheet (Lamorte’s Power Calculations) which does all the calculations automatically. All you have to do is enter the estimated means and standard deviations for each group. In the example shown here, I assumed that my control group (group 1) would have a mean of 160 and a standard deviation of 10. I wanted to know how many subjects I would need in each group to detect a significant difference of 10 mm Hg. So, I plugged in a mean of 150 for group 2 and assumed that the standard deviation for this group would be the same as for group 1.

The spreadsheet actually generates a table which shows estimated sample sizes for different “p values” and different power levels. Many people arbitrarily use p=0.05 and a power level of 80%. With these parameters you would need about 16 subjects in each group. If you want 90% power, you would need about 21 subjects in each group.

The format in this spreadsheet makes it easy to play “what if.” If you want to get a feel for how many subjects you might need if the treatment reduces pressures by 20 mm Hg, just change the mean for group 2 to 140, and all the calculations will automatically be redone for you.

### Sample Size Calculations For A Difference In Proportions

The bottom part of the same spreadsheet generates sample-size calculations for comparing differences in frequency of an event. Suppose, for example, that a given treatment was successful 50% of the time and you wanted to test a new treatment with the hope that it would be successful 90% of the time. All you have to do is plug these (as fractions) into the spreadsheet, and the estimated sample sizes will be calculated automatically as shown here:

The illustration from the spreadsheet below shows that to have a 90% probability of showing a statistically significant difference (using p< 0.05) in proportions this great, you would need about 22 subjects in each group.

### Spreadsheet

The Statistical Explanation Sample Spreadsheet described above can be found here.

### Responsible Parties

Principal Investigators are responsible for: preparing and submitting applications making modifications in applications in order secure IACUC approval ensuring adherence to approved protocols, and ensuring humane care and use of animals. It is the responsibility of the IACUC to assure that the number of animals to be used in an animal use protocol is appropriate. The Animal Welfare Program and the Institutional Animal Care and Use Committee are responsible for overseeing implementation of and ensuring compliance with this policy.