Hardy-Weinberg Simulation in R

EduSTEM Bioinformatics: Hardy-Weinberg Simulation in R

Ayush Noori | EduSTEM Bioinformatics


Adapted from the publicly-available AP Biology Investigative Labs: An Inquiry-Based Approach, Investigation 2.

“Mathematics is biology’s next microscope, only better …” (Cohen 2004)

It is not hard to understand the value of microscope technology to biology and how this technology opened up entire new worlds of biological understanding. However, for some, it is not as easy to see the value of mathematics to the study of biology. Like the microscope, math and computers provide tools to explore the complexity of biology and biological systems, offering deeper insights to and understanding of what makes living systems work. Even the incredible complexity of evolution in populations is illuminated by relatively simple mathematical equations, several of which are based on the HardyWeinberg (H-W) equilibrium. Students (and their teachers) have traditionally found the topic of population genetics in an introductory biology class to be challenging, due in part to the fact that for the last couple of generations biology has been thought of as the science with only a minimal mathematics foundation — particularly in comparison to chemistry or physics. Modern biology, however, is vastly different.

One of the specific difficulties of the H-W null hypothesis is that it is the null hypothesis — it is what would happen to allele frequencies in the absence of any evolutionary parameter. This is counterintuitive for most students. H-W is the standard by which evolution can be measured. To that end, most simulations that try to create a population manipulated by students to model H-W are flawed from the beginning. Student classroom populations, by definition, are so small that genetic drift will swamp any other factors that the simulation is trying to model. By starting with a population that is modeled on a computer spreadsheet with explicit randomness, students are able to build their knowledge toward an inference on allele inheritance patterns in a theoretically infinite population. They do this by creating larger and larger populations to minimize fluctuations from the expected probabilities.

Having students build their own model is a requisite component to this investigation. It is during the model building, testing, and corrective phase of construction that the reflective and analytical skills of model building are learned. These skills, by their very nature, have a broad application to learning. In this investigation, the students will build use R to model how a hypothetical gene pool changes from one generation to the next. This model will allow for the exploration of parameters that affect allele frequencies, such as selection, mutation, and migration.

This investigation also provides an opportunity for students to review concepts they might have studied previously, including natural selection as the major mechanism of evolution; the relationship among genotype, phenotype, and natural selection; and fundamentals of classic Mendelian genetics.


Load the requisite libraries.



Set the working directory to where you would like your output data to result.

dir = "<insert appropriate directory address here>"

Set Initial Values

Set the initial p value.

p_initial = 0.99
q_initial = 1 - p_initial

n is the size of the population.

n = 5000

f is the number of generations.

f = 20

Assign fitness values for each genotype. Experiment with these values as you wish. AA, AB, and BB are homozygous dominant, heterozygous, and homozygous recessive respectively (i.e. bi-allelic markers).

fitnessAA = 1
fitnessAB = 7
fitnessBB = 11

Normalize fitness values.

fitness = c(fitnessAA, fitnessAB, fitnessBB)
if(max(fitness) != min(fitness)) {
  fitnessAA = fitnessAA/max(fitness)
  fitnessAB = fitnessAB/max(fitness)
  fitnessBB = fitnessBB/max(fitness)

Create the data frames to process and store the data.

genotypes = data.frame(matrix(nrow=n, ncol=3))
colnames(genotypes) = c("Sperm", "Egg", "Zygote")

results = data.frame(matrix(nrow=f, ncol=5))
colnames(results) = c("AA", "AB", "BB", "P of Next Generation", "Q of Next Generation")

p_new = p_initial
q_new = q_initial

Hardy-Weinberg Simulation

Finally, run the simulation. Study this code carefully to fully understand how this simulation operates.

for (i in 1:f) {
  for (k in 1:n) {
    if(runif(1) <= p_new) { genotypes[k, "Sperm"] = "A" } else { genotypes[k, "Sperm"] = "B"}
    if(runif(1) <= p_new) { genotypes[k, "Egg"] = "A" } else { genotypes[k, "Egg"] = "B"}
    genotypes[k, "Zygote"] = paste(genotypes[k, "Sperm"], genotypes[k, "Egg"], sep="")
  totalAA = fitnessAA * sum(genotypes[, "Zygote"] == "AA")
  totalAB = fitnessAB * (sum(genotypes[, "Zygote"] == "AB") + sum(genotypes[, "Zygote"] == "BA"))
  totalBB = fitnessBB * sum(genotypes[, "Zygote"] == "BB")
  results[i, "AA"] = totalAA
  results[i, "AB"] = totalAB
  results[i, "BB"] = totalBB
  allelesA = 2*totalAA + totalAB
  allelesB = 2*totalBB + totalAB
  p_new = allelesA/(allelesA + allelesB)
  q_new = allelesB/(allelesA + allelesB)
  results[i, "P of Next Generation"] = p_new
  results[i, "Q of Next Generation"] = q_new

Ternary Plot

Make a ternary plot to visualize your results.

A ternary plot is a barycentric plot on three variables which sum to a constant. It graphically depicts the ratios of the three variables as positions in an equilateral triangle. In population genetics, it is often called a de Finetti diagram.

In this ternary plot, the center parabola represents the ratio p^2+2pq+p^2, or the Hardy-Weinberg Equilibrium.

Create a gradient from green to blue as generations progress.

colfunc = colorRampPalette(c("springgreen","royalblue"))
colorvector = colfunc(f)

Create the ternary plot! The acceptance region corresponds to a Chi-square test. alpha is the significance level, if the p-value is less than alpha then data is significant. signifcolour colors the markers based on a significance test.

ternary_plot = HWTernaryPlot(results[, 1:3], f, vbounds = TRUE, alpha = 0.05, region = 1, vertex.cex = 2, markercol = colorvector, signifcolour = FALSE, curvecols = c("black", "red"))

With these courses, we hope to further our mission to make high-quality STEMX education accessible for all. For questions or support, please feel free to reach out to me at anooristudent@outlook.com.

Best Regards,

Ayush Noori

EduSTEM Boston Chapter Founder


  1. NCBI PubMed

The premier source of past and present medical literature. Most supplemental information in Extensions is available via PubMed. When searching PubMed, be sure to use the “Free full text,” and “Sort by: Best Match” filters to find relevant and accessible results.

  1. RCSB Protein Data Bank (PDB)

A large database of useful 3D structures of large biological molecules, including proteins and nucleic acids. Use the search bar to find a molecule of interest, which can then be examined using the Web-based 3D viewer.