Statistics in R: Binomial Distribution

2018, Mar 23 alyosha

Let’s use the R and RStudio to learn more about the binomial distribution!

Downloading the Project
Introduction
Generating binomial data
Probability Density
Cumulative Density
Quantiles
Quantile-Quantile Plots
Further Reading

Downloading the Project

We will be using the ProbabilityDistributions project, which you can download in the form of a ZIP archive ProbabilityDistributions.zip. After you’ve downloaded the archive, proceed by extracting it, go into a newly created ProbabilityDistributions folder and double-click on the ProbabilityDistributions.Rproj icon. This opens up the ProbabilityDistributions project in the RStudio.

Introduction

This entire paragraph was taken over from the Introduction to Statistical Thought (written by Michael Lavine; published on June 24, 2010).

Statisticians often have to consider observations of the following type:

A repeatable event results in either a success or a failure.
Many repetitions are observed
Successes and failures are counted
The number of successes helps us learn about the probability of success.

Because binomial experiments are so prevalent there is specialized language to describe them. Each repetition is called a trial; the number of trials is usually denoted N; the unknown probability of success is usually denoted either p or θ; the number of successes is usually denoted X. We write _X ∼ Bin(N, p) _. The symbol “∼” is read is distributed as; we would say “X is distributed as Binomial N p” or “X has the Binomial N, p distribution”.

Some important assumptions about binomial experiments are that N is fixed in advance, θ is the same for every trial, and the outcome of any trial does not influence the outcome of any other trial.

Generating binomial data

Let’s create a function summarize.binomial.observations, which takes the following parameters:

trials - a number of trials (N)
theta - a probability of success (θ)
observations - a number of observations to be made

library(tidyverse)

summarize.binomial.observations <- function(trials, theta, observations) {
  sample.space <- c(1,0)
  results <- 1:observations %>%
  map_int(function(x)
    as.integer(
      sum(
        sample(sample.space, size = trials, replace = TRUE, prob = c(theta, 1 - theta))
      )
    )
  )
  return(results)
}

Now we can use the function to generate 3000 observations of X ∼ Bin(80, 0.6):

summary <- summarize.binomial.observations(80, 0.6, 3000)
resulting.df <- data.frame(flips <- summary) # ggplot only works with data frames
names(resulting.df) <- c("flips")
#binPDF <- data.frame(x<-1:80, y<-dbinom(1:80, size=80, prob = 0.6))
#names(binPDF) <- c("flips", "prob")

ggplot(resulting.df) + 
  geom_histogram(breaks=seq(1, 80, 2), aes(x=flips, y=..density..), position="identity") + 
  geom_density(aes(x=flips, y = ..density..), colour="red")

binomial probability distribution function

It turns out that R already contains a function which produces the same data as summarize.binomial.observations. The function is called rbinom and it repeatedly performs a random draw from a binomial distribution.

drawsV1 <- summarize.binomial.observations(80, 0.6, 3000)
drawsV2 <- rbinom(3000, 80, 0.6)
mean(drawsV1)
mean(drawsV2)
length(drawsV1)
length(drawsV2)

produces:

[1] 48.01467
[1] 48.052
[1] 3000
[1] 3000

Probability Density

The probability of observing exactly 5 heads out of 10 flips can be estimated as

mean(rbinom(1000, 10, .5) == 5)

result:

[1] 0.232

There is a special function in R which calculates the probability density - it is called dbinom:

dbinom(5, 10, .5)

yields:

[1] 0.2460938

Cumulative Density

The probability of observing exactly at most 4 heads out of 10 flips can be estimated as

mean(rbinom(1000, 10, .5) <= 4)

result:

[1] 0.392

There is a special function in R which calculates the cumulative density - it is called pbinom:

pbinom(4, 10, .5)

output:

[1] 0.3769531

Quantiles

Quantile works as an inverse of the cumulative density function, as illustrated below:

qbinom(pbinom(4, 10, .5), 10, .5)

[1] 4

Visit Basic Probability Distributions in R for more information.

Quantile-Quantile Plots

As described in the Q-Q Plot Tutorial, The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

The following code compares observations stored in x.sample with those in y.sample:

x.sample <- rbinom(1000, 19, .5)
y.sample <- rbinom(1000, 19, .5)
qqplot(x.sample, y.sample)
abline(0,1)

qq-plot of binomial data

The same test using ggplot2:

nq <- 31
p <- (1 : nq) / nq - 0.5 / nq
ggplot() + geom_point(aes(x = quantile(x.sample, p), y = quantile(y.sample, p))) +
  geom_abline(intercept = 0, slope = 1)

qq-plot of binomial data created with the ggplot2

Do the following to compare the x.sample with a binomial distribution Bin(19, 0.5):

xdf <-data.frame(x = x.sample)
ggplot(xdf, aes(sample = x)) + 
  stat_qq(distribution = stats::qbinom, dparams = list(size = 19, prob=0.5)) + 
  geom_abline(intercept = 0, slope = 1)

qq-plot of a data sample compared to data generated by qbinom

Pastinak