The birthday paradox goes… in a room of 23 people there is a 50-50 chance that two of them share a birthday.
OK, so the first step in introducing a paradox is to explain why it is a paradox in the first place. One might think that for each person, there is 1/365 chance of another person having the same birthday as them. Indeed, I can think of only one other person I’ve met that has the same birthday as me—and he is my twin brother! Since I’ve met far more than 23 people, how can this be true?
This reasoning is flawed for several reasons, the first of which is that the question wasn’t asking about if there was another person in the room with a specific birthday—any pair of people (or more!) can share a birthday to increase the chances of the statement being true.
The complete answer gets heavy into the math, but I want to show you how to convince yourself it is true by simulating the experiment. Simulation is programming a computer or model to act as if the real thing was happening. Usually, you set this up so that the cost of simulation is much less than doing the actual thing. For example, putting a model airplane wing in a wind tunnel is a simulation. I’ve simulate the birthday paradox in a computer programming language called Python and this post is available in notebook-style here. Indeed, this is much easier than being in a room with 23 people.
Below I will not present the code (again, that’s over here), but I will describe how the simulation works and present the results.
Call the number of people we need to ask before we get a repeated birthday n. This is what is called a random variable because its value is not known and may change due to conditions we have no control over (like who happens to be in the room).
Now we simulate an experiment realising a value for n as follows.
- Pick a random person and ask their birthday.
- Check to see if someone else has given you that answer.
- Repeat step 1 and 2 until a birthday is said twice.
- Count the number of people that were asked and call that n.
Getting to step 4 constitutes a single experiment. The number that comes out may be n = 2 or n = 100. It all depends on who is in the room. So we repeat all the steps many many times and look at how the numbers fall. The more times we repeat, the more data we obtain and the better our understanding of what’s happening.
Here is what it looks like when we run the experiment one million times.
So what do all those numbers mean? Well, let’s look at how many times n = 2 occurred, for example. In these one million trials, the result 2 occurred 2679 times, which is relatively 0.2679%. Note that this is close to 1/365 ≈ 0.274%, which is expected since the probability that the second person has the same as the first is exactly 1/365. So each number of occurrences divided by one million is approximately the probability that we would see that number in a single experiment.
We can then plot the same data considering the vertical axis the probability of needing to see n people before a repeated birthday.
Adding up the value of each of these bars sums to 100%. This is because one of the values must occur when we do an experiment. OK, so now we can just add up these probabilities starting at n = 2 and increasing until we get to 50%. Visually, it is the number which splits the coloured area above into two equal parts. That number will be the number of people we need to meet to have a 50-50 shot at getting a repeated birthday. Can you guess what it will be?
Drum roll… 23! Tada! The birthday paradox simulated and solved by simulation!
But, wait! There’s more.
What about those leap year babies? In fact, isn’t the assumption that birthdays are equally distributed wrong? If we actually tried this experiment out in real life, would we get 23 or some other number?
Happily, we can test this hypothesis with real data! At least for US births, you can find the data over at fivethirteight’s github page. Here is what the actual distribution looks like.
Perhaps by eye it doesn’t look too uniform. You can clearly see 25 Dec and 31 Dec have massive dips. Much has been written about this and many beautiful visualizations are out there. But, our question is whether this has an effect on the birthday paradox. Perhaps the fact that not many people are born on 25 Dec means it is easy to find a shared birthday on the remaining days, for example. Let’s test this hypothesis by simulating the experiment with the real distribution of birthdays.
To do this, we perform the same 4 steps as above, but randomly sampling answers from the actual distribution of birthdays. The result of another one million experiments is plotted below.
And the answer is the same! The birthday paradox persists with the actual distribution of birthdays.
The above discussion is very good evidence that the birthday paradox is robust to the actual distribution of births. However, it does not constitute a mathematical proof. An experiment can only provide evidence. So I will end this with a technical question for those mathematical curiosos out there. (What I am about to do is also called Nerd Sniping.)
Here is the broad problem: quantify the above observation. I think there is more than one question here. For example, it should be possible to bound the 50-50 threshold as a function of the deviation from a uniform distribution.
(Cover image credit: Ed g2s, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=303792)