How to Conduct a Fisher’s Exact Test in Small Sample Psychological Studies

In psychological research, particularly when working with small sample sizes, selecting the appropriate statistical test is fundamental to drawing valid conclusions. Fisher's exact test is a statistical hypothesis test used to assess the association between two binary variables in a contingency table and is particularly useful when working with small sized samples. This comprehensive guide will walk you through everything you need to know about conducting Fisher's Exact Test in small sample psychological studies, from understanding the theoretical foundations to practical implementation and interpretation.

What is Fisher's Exact Test?

Fisher's exact test is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis (e.g., p-value) can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. The test was developed by renowned statistician Ronald Fisher, and it is named after its inventor, Ronald Fisher, who is said to have devised the test following a comment from Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup. He tested her claim in the "lady tasting tea" experiment.

Fisher's exact test determines whether a statistically significant association exists between two categorical variables. You can also use it for a 2-sample proportion test when you have a small sample size. Unlike many statistical tests that rely on approximations, Fisher's exact test provides precise probability calculations, making it invaluable when dealing with limited data.

The Purpose and Applications in Psychology

The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. In psychological research, this might include examining relationships between treatment conditions and outcomes, demographic variables and behavioral responses, or diagnostic categories and intervention success rates.

Psychological researchers frequently encounter situations where sample sizes are necessarily small due to practical constraints such as limited participant availability, rare clinical populations, specialized experimental conditions, or resource limitations. In these scenarios, Fisher's Exact Test becomes an essential analytical tool.

When to Use Fisher's Exact Test

Sample Size Considerations

One or more cell value counts in the contingency table is small (less than 5). Where all values are more than 5, a chi-squared test should be performed instead. This is the primary criterion for choosing Fisher's Exact Test over alternative methods. Fisher's exact test is recommended when the total sample size is less than 1000, and use the chi-square or G–test for larger sample sizes.

Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. However, while Fisher's exact test is theoretically valid when samples are large, it is computationally intensive and so usually only used for small samples.

Data Type Requirements

For Fisher's Exact Test to be appropriate, your data must meet specific criteria:

Categorical Variables: Both variables should be categorical and binary, meaning they can take one of two values, so that a 2 x 2 contingency table can be populated.
Independence of Observations: Independence of observations—variable should not be paired or related. Each observation must be independent of all others.
Mutually Exclusive Groups: Mutually exclusive groups—an individual cannot belong to more than one cell in the contingency table.
Small Expected Frequencies: The literature indicates that the usual rule for deciding whether the χ² approximation is good enough is that the Chi-square test is not appropriate when the expected values in one of the cells of the contingency table is less than 5, and in this case the Fisher's exact test is preferred.

Comparison with Chi-Square Test

They both serve the same purpose—assessing a relationship between categorical variables. However, differences in the underlying methodology affect when you should use each method. Understanding these differences is crucial for selecting the appropriate test.

The Chi-Square Test of Independence is a more traditional hypothesis test that uses a test statistic (chi-square) and its sampling distribution to calculate the p-value. However, the chi-square sampling distribution only approximates the correct distribution, providing better p-values as the cell values in the table increase. Consequently, chi-square p-values are invalid when you have small cell counts.

Fisher's exact test calculates an exact probability and is ideal for small sample sizes or low expected cell counts, while the chi-squared test uses a large-sample approximation that can become inaccurate when expected counts fall below 5. This fundamental difference makes Fisher's Exact Test the gold standard for small sample research.

Understanding the Statistical Theory

The Hypergeometric Distribution

The Fisher exact test works somewhat differently to the chi-square test (or in fact any of the other hypothesis tests) insofar as it doesn't have a test statistic; it calculates the p-value "directly". The test relies on the hypergeometric distribution to calculate exact probabilities.

The Fisher's exact test is performed by calculating the probability of the data that is observed if the null hypothesis (no association) is true, by using all possible 2 x 2 tables that hypothetically could have been observed, for the same row and column totals as those that are observed in the data (these are sometimes referred to as the marginal totals). In other words, we are assessing how extreme our table of frequencies is in relation to all possible versions of it that could have occurred under the marginal totals and from this, making an inference about the association between the two variables.

Hypotheses in Fisher's Exact Test

The hypotheses of the Fisher's exact test are the same than for the Chi-square test, that is: H₀ : the variables are independent, there is no relationship between the two categorical variables. Knowing the value of one variable does not help to predict the value of the other variable. H₁ : the variables are dependent, there is a relationship between the two categorical variables.

When conducting the test, you're essentially asking: "If there truly is no association between these variables, what is the probability of observing data as extreme as, or more extreme than, what we actually observed?"

Assumptions and Limitations

The test assumes that all row and column sums of the contingency table were fixed by design and tends to be conservative and underpowered outside of this setting. This is an important consideration when interpreting results, as the test may be somewhat conservative in certain research designs.

When one or both of the row or column totals are unconditioned, the Fisher's exact test is not, strictly speaking, exact. Instead, it is somewhat conservative, meaning that if the null hypothesis is true, you will get a significant (P<0.05) P value less than 5% of the time. This makes it a little less powerful (harder to detect a real difference from the null, when there is one).

Step-by-Step Guide to Conducting Fisher's Exact Test

Step 1: Organize Your Data into a Contingency Table

A 2 x 2 contingency table, sometimes referred to as a cross-tabulation or a two-way table, is a useful tool in statistics that displays in its cell values the frequencies (counts) of each combination of two categorical variables with row and column totals included. They are powerful tools that help us to understand the relationship between two variables in a sample of data.

Your first step is to arrange your data in a clear, organized format. A 2×2 contingency table has four cells representing all possible combinations of your two binary variables. For example, if you're studying the relationship between therapy type (Treatment A vs. Treatment B) and outcome (Improved vs. Not Improved), your table would show the count of participants in each combination.

The table should include:

Row totals (marginal totals for each row)
Column totals (marginal totals for each column)
Grand total (total number of observations)
Clear labels for each variable and category

Step 2: Check Assumptions

Before proceeding with the test, verify that your data meets the necessary assumptions:

Both variables are categorical with two categories each
Observations are independent (no participant appears in multiple cells)
Categories are mutually exclusive
At least one cell has an expected frequency less than 5, or total sample size is small

Although it is a good practice to check the expected frequencies before deciding between the Chi-square and the Fisher test, it is not a big issue if you forget. As you can see above, when doing the Chi-square test in R (with chisq.test()), a warning such as "Chi-squared approximation may be incorrect" will appear. This warning means that the smallest expected frequencies is lower than 5. Therefore, do not worry if you forgot to check the expected frequencies before applying the appropriate test to your data, R will warn you that you should use the Fisher's exact test instead of the Chi-square test if that is the case.

Step 3: Determine Your Significance Level

Before conducting the test, establish your alpha level (significance level). In psychological research, the conventional significance level is α = 0.05, though you may choose a more stringent level (e.g., 0.01) depending on your research context and the consequences of Type I errors.

Step 4: Choose Between One-Tailed and Two-Tailed Tests

H₀:p₁ / p₂ = φ₀ versus H₁: p₁ / p₂ ≠φ₀; this is often called the two-tailed test. H₀:p₁ / p₂ ≤φ₀ versus H₁: p₁ / p₂ > φ₀; this is often called the upper-tailed test.

Ruxton and Neuhauser (2010) surveyed articles in the journal Behavioral Ecology and Sociobiology and found several that reported the results of one-tailed Fisher's exact tests, even though two-tailed would have been more appropriate. Apparently some statistics textbooks and programs perpetuate confusion about one-tailed vs. two-tailed Fisher's tests. You should almost always use a two-tailed test, unless you have a very good reason.

A two-tailed test is appropriate when you're interested in detecting any association between variables, regardless of direction. A one-tailed test should only be used when you have a strong theoretical reason to predict the direction of the relationship before collecting data.

Step 5: Calculate the Exact Probability

While Fisher's Exact Test can be calculated by hand for simple 2×2 tables, this is rarely practical. The calculation involves computing factorials and can become extremely tedious. The actual computations as performed by statistical software packages will as a rule differ from those described above, because numerical difficulties may result from the large values taken by the factorials. A simple, somewhat better computational approach relies on a gamma function or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities remains an active research area.

Modern statistical software handles these calculations efficiently and accurately. The software will compute the probability of observing your specific table configuration, as well as all more extreme configurations, given the marginal totals.

Step 6: Interpret the Results

From the output we see that the p-value is less than the significance level of 5%. Like any other statistical test, if the p-value is less than the significance level, we can reject the null hypothesis.

When interpreting your results:

If p-value ≤ α: Reject the null hypothesis. There is sufficient evidence to conclude that a statistically significant association exists between the two variables.
If p-value > α: Fail to reject the null hypothesis. There is insufficient evidence to conclude that an association exists between the variables.

The sample data is strong enough to conclude that a relationship between the categorical variables exists in the population. Knowing the value of one variable provides information about the value of the other variable.

Detailed Example: Therapy Effectiveness Study

Let's work through a comprehensive example relevant to psychological research. Suppose a clinical psychologist is investigating whether a new cognitive-behavioral therapy (CBT) intervention is more effective than standard treatment for reducing social anxiety in adolescents. Due to the specialized nature of the population and resource constraints, the study includes only 20 participants.

Study Design

Independent Variable: Treatment type (New CBT vs. Standard Treatment)
Dependent Variable: Clinical outcome (Clinically Significant Improvement vs. No Clinically Significant Improvement)
Sample Size: 20 participants (10 in each treatment group)

Data Collection and Organization

After the intervention period, the psychologist assesses each participant and categorizes them based on whether they showed clinically significant improvement (defined as a reduction of at least 30% on a standardized social anxiety measure). The data are organized in a 2×2 contingency table:

Contingency Table:

New CBT - Improved: 8 participants
New CBT - Not Improved: 2 participants
Standard Treatment - Improved: 3 participants
Standard Treatment - Not Improved: 7 participants

Conducting the Analysis

The psychologist notes that with only 20 total participants and some cells having small counts, Fisher's Exact Test is the appropriate choice. Using statistical software, they input the contingency table data and specify a two-tailed test with α = 0.05.

Interpreting the Output

The software provides several pieces of information:

P-value: The exact probability of observing data this extreme or more extreme under the null hypothesis
Odds Ratio: A measure of effect size indicating the strength of association
Confidence Interval: The range within which the true odds ratio likely falls

If the p-value is 0.028 (less than 0.05), the psychologist would reject the null hypothesis and conclude that there is a statistically significant association between treatment type and clinical outcome. The odds ratio would indicate how much more likely participants in one group are to improve compared to the other group.

Software and Tools for Fisher's Exact Test

Statistical Software Packages

Most modern statistical packages will calculate the significance of Fisher tests, in some cases even where the chi-squared approximation would also be acceptable. Here are the primary options for conducting Fisher's Exact Test:

R Statistical Software

R is a free, open-source statistical computing environment widely used in psychological research. To perform the Fisher's exact test in R, use the fisher.test() function as you would do for the Chi-square test. The basic syntax is straightforward, and R provides comprehensive output including the p-value, odds ratio, and confidence intervals.

R is particularly advantageous because it's free, has extensive documentation, and is widely supported by the research community. Many psychology departments teach R as part of their statistics curriculum.

SPSS

By default, SPSS calculates Fisher whenever we apply chi-square in a 2 × 2 contingency tables. This makes SPSS particularly user-friendly for researchers who may not be comfortable with command-line interfaces. SPSS automatically provides Fisher's Exact Test results when analyzing 2×2 tables, and for tables other than 2 × 2, you need to click Exact box in Crosstab dialog box and then select Exact.

Python

For researchers comfortable with programming, Python offers Fisher's Exact Test through the SciPy library. Python is increasingly popular in psychological research, particularly for researchers who also conduct computational modeling or machine learning analyses.

Online Calculators

For quick analyses or when statistical software isn't readily available, several reliable online calculators can perform Fisher's Exact Test:

GraphPad QuickCalcs - User-friendly interface with clear explanations
MedCalc Fisher's Exact Calculator - Provides detailed output including odds ratios
Stats Kingdom - Offers visualization options alongside calculations

These online tools are particularly useful for teaching purposes, quick verification of results, or when working on computers without statistical software installed.

Reporting Fisher's Exact Test Results

Essential Elements to Report

When interpreting and reporting Fisher's exact test results, it is important to include the exact p-value, provide the odds ratio along with a 95% confidence interval, and clearly state whether a one- or two-sided test was performed. Results should also be supported with a practical context to help readers understand the real-world significance of the findings.

A complete report of Fisher's Exact Test results should include:

The exact p-value (not just "p < 0.05")
The test name (Fisher's Exact Test)
Whether the test was one-tailed or two-tailed
The odds ratio and its confidence interval
Sample sizes for each group
A clear statement of the conclusion in context

Example Reporting Format

Here's an example of how to report Fisher's Exact Test results in APA style:

"A Fisher's Exact Test was conducted to examine the association between treatment type and clinical outcome. Results indicated a statistically significant relationship between treatment type and improvement status (p = 0.028, two-tailed). Participants receiving the new CBT intervention were significantly more likely to show clinically significant improvement (80%) compared to those receiving standard treatment (30%), with an odds ratio of 9.33 (95% CI [1.38, 63.15])."

Visual Presentation

While the statistical results are crucial, visual presentation enhances understanding. Consider including:

A clearly labeled contingency table with frequencies and percentages
A bar chart or mosaic plot showing the distribution across categories
Effect size visualizations when appropriate

Common Pitfalls and How to Avoid Them

Misunderstanding Independence

One of the most common errors is violating the independence assumption. Each observation must be independent of all others. This means:

No participant can appear in multiple cells
Repeated measures from the same participants violate independence
Matched pairs or related samples require different tests (such as McNemar's test)

Inappropriate Use of One-Tailed Tests

Researchers sometimes inappropriately use one-tailed tests to achieve statistical significance. Remember that the decision to use a one-tailed test must be made before data collection, based on strong theoretical grounds, not after seeing the data.

Ignoring Effect Size

Statistical significance doesn't necessarily mean practical significance. Always report and interpret effect sizes (such as odds ratios) alongside p-values. A statistically significant result with a very small effect size may not be clinically or practically meaningful.

Multiple Comparisons Without Correction

When conducting multiple Fisher's Exact Tests on the same dataset, the risk of Type I errors increases. Consider applying appropriate corrections (such as Bonferroni correction) when conducting multiple tests.

Confusing Statistical and Clinical Significance

In psychological and clinical research, statistical significance must be interpreted within the context of clinical or practical significance. A statistically significant finding may not translate to meaningful real-world impact.

Extensions and Alternatives

Larger Contingency Tables

However the principle of the test can be extended to the general case of an m × n table, and some statistical packages provide a calculation (sometimes using a Monte Carlo method to obtain an approximation) for the more general case. While Fisher's Exact Test is most commonly used for 2×2 tables, extensions exist for larger tables.

When working with tables larger than 2×2, be aware that computational demands increase substantially. Many software packages use Monte Carlo simulation methods to approximate the exact p-value for larger tables.

Alternative Exact Tests

An alternative exact test, Barnard's exact test, has been developed and proponents of it suggest that this method is more powerful, particularly in 2×2 tables. Furthermore, Boschloo's test is an exact test that is uniformly more powerful than Fisher's exact test by construction.

However, Statisticians continue to argue about alternatives to Fisher's exact test, but the improvements seem pretty small for reasonable sample sizes, with the considerable cost of explaining to your readers why you are using an obscure statistical test instead of the familiar Fisher's exact test. I think most biologists, if they saw you get a significant result using Barnard's test, or Boschloo's test, or Santner and Snell's test, or Suissa and Shuster's test, or any of the many other alternatives, would quickly run your numbers through Fisher's exact test.

Power and Sample Size Considerations

Understanding Statistical Power

Statistical power is the probability of detecting a true effect when it exists. With small sample sizes, statistical power is often limited. Be aware that while very small sample sizes are valid for this test, they do reduce statistical power, meaning that it can be hard to obtain significant results.

When planning a study that will use Fisher's Exact Test, consider:

The expected effect size based on previous research or pilot data
The desired power level (typically 0.80 or 80%)
The significance level (typically 0.05)
Practical constraints on sample size

Sample Size Planning

The Fisher's exact test is applicable when the sample size is small and we could expect that there may be one or more small cells (< 5). Therefore, different sample size calculation procedure following the Fisher's exact test will be applied when the sample size is small.

Specialized software or online calculators can help determine the minimum sample size needed to achieve adequate power for detecting an effect of a given size. However, remember that in many psychological research contexts, sample size is constrained by practical factors such as participant availability or resource limitations.

Real-World Applications in Psychological Research

Clinical Psychology

Fisher's Exact Test is frequently used in clinical psychology research to evaluate treatment outcomes with small samples. Examples include:

Comparing response rates between treatment and control groups in pilot studies
Examining the relationship between diagnostic categories and treatment adherence
Analyzing adverse event rates in small clinical trials
Investigating associations between comorbid conditions and treatment outcomes

Developmental Psychology

Developmental researchers often work with specialized or hard-to-recruit populations, making Fisher's Exact Test particularly valuable:

Examining developmental milestone achievement across different groups
Analyzing categorical outcomes in longitudinal studies with attrition
Investigating relationships between early risk factors and later outcomes
Studying rare developmental disorders or conditions

Social Psychology

Social psychologists may use Fisher's Exact Test when studying:

Group decision-making processes in small groups
Categorical responses to experimental manipulations
Relationships between demographic variables and behavioral choices
Pilot testing of experimental paradigms

Neuropsychology

In neuropsychological research, small sample sizes are common due to the specialized nature of patient populations:

Comparing performance categories (impaired vs. unimpaired) across patient groups
Examining relationships between lesion location and functional outcomes
Analyzing categorical neuropsychological test results
Investigating associations between neurological conditions and behavioral symptoms

Advanced Considerations

Dealing with Tied P-Values

Because Fisher's Exact Test uses discrete probability distributions, tied p-values can occur. Different software packages may handle ties slightly differently, which can occasionally lead to minor discrepancies in results across platforms. This is generally not a concern for interpretation but is worth noting when comparing results across different software.

Mid-P Correction

Some statisticians advocate for using a mid-p correction, which can make Fisher's Exact Test less conservative. The mid-p value is calculated by subtracting half the probability of the observed table from the standard Fisher's exact p-value. This approach is controversial, and you should clearly state if you use it.

Confidence Intervals for Odds Ratios

Different methods exist for calculating confidence intervals for odds ratios in Fisher's Exact Test. Most software packages use methods that maintain the exact nature of the test, but the specific algorithm may vary. This rarely affects interpretation but can lead to slight differences in confidence interval bounds across software packages.

Ethical Considerations in Small Sample Research

Transparency in Reporting

When conducting research with small samples, transparency is crucial. Researchers should:

Clearly justify the small sample size
Acknowledge limitations related to statistical power
Report all analyses conducted, not just significant findings
Discuss the generalizability of findings
Consider pre-registration of analyses when possible

Avoiding P-Hacking

With small samples, the temptation to conduct multiple analyses until finding significance can be strong. Researchers must resist this temptation and:

Plan analyses before data collection
Report all analyses conducted
Apply appropriate corrections for multiple comparisons
Distinguish between confirmatory and exploratory analyses

Replication and Validation

Findings from small sample studies should be interpreted cautiously and ideally replicated in independent samples. Researchers should:

Frame findings as preliminary when appropriate
Encourage replication efforts
Consider meta-analytic approaches to combine findings across small studies
Be transparent about the exploratory nature of small sample research

Practical Tips for Success

Before Data Collection

Determine whether Fisher's Exact Test is appropriate for your research question
Conduct a power analysis to understand the limitations of your sample size
Pre-register your analysis plan when possible
Ensure your variables are truly categorical and binary
Plan for how you'll handle missing data

During Data Collection

Maintain strict independence of observations
Document any deviations from your planned protocol
Ensure consistent categorization of outcomes
Keep detailed records of all data collection procedures

During Analysis

Verify that assumptions are met before conducting the test
Use reliable, well-documented statistical software
Double-check data entry and table construction
Report exact p-values rather than just "p < 0.05"
Calculate and report effect sizes
Consider sensitivity analyses when appropriate

When Writing Up Results

Provide complete descriptive statistics
Include a clear contingency table
Report all relevant test statistics
Interpret findings in practical, not just statistical, terms
Acknowledge limitations related to sample size
Discuss implications for future research

Learning Resources and Further Reading

For researchers wanting to deepen their understanding of Fisher's Exact Test and small sample statistics, several excellent resources are available:

Textbooks and Academic Resources

Statistical textbooks specifically addressing categorical data analysis
Research methods textbooks with sections on small sample statistics
Online statistics courses focusing on non-parametric methods
University statistics department websites with tutorial materials

Online Communities and Support

Statistics forums where researchers can ask questions
R user groups and mailing lists
Academic social media communities focused on research methods
YouTube channels with statistics tutorials

Software Documentation

Official R documentation for the fisher.test() function
SPSS user guides and tutorials
Python SciPy documentation
Software-specific user forums and communities

For additional guidance on statistical analysis in psychology, consider exploring resources from the American Psychological Association and the Association for Psychological Science.

Conclusion

Fisher's Exact Test is an indispensable tool for psychological researchers working with small sample sizes and categorical data. Unlike approximate tests that rely on large sample assumptions, Fisher's Exact Test computes exact p-values, making it the gold standard for small datasets. For instance, in clinical trials or genetics, where sample sizes can be limited, this test provides a reliable method to determine if two variables are related.

By understanding when to use Fisher's Exact Test, how to properly conduct and interpret it, and how to report results transparently, researchers can draw valid conclusions from limited data. While small sample sizes present inherent challenges, Fisher's Exact Test provides a rigorous statistical framework for analyzing categorical associations when larger samples are not feasible.

The key to successful application of Fisher's Exact Test lies in careful planning, rigorous adherence to assumptions, transparent reporting, and appropriate interpretation of results within the broader context of psychological theory and practice. As psychological research continues to grapple with questions involving specialized populations and rare phenomena, Fisher's Exact Test will remain a critical tool in the researcher's statistical toolkit.

Remember that statistical significance is just one piece of the puzzle. Effect sizes, confidence intervals, replication, and theoretical coherence all contribute to building a robust body of psychological knowledge. By combining rigorous statistical methods like Fisher's Exact Test with thoughtful research design and transparent reporting practices, researchers can make meaningful contributions to psychological science even when working with small samples.