Probability of winning the National Lottery with Python (Monte Carlo)
135 Calculate the real probability of winning the National Lottery with Python: cumulative formula, Monte Carlo and why historical data doesn't predict.

Probability of Winning Spain’s National Lottery: Python Analysis, Historical Data, and Monte Carlo
In my family, a strange tradition has popped up over the last few years: we go to bingo once a year—specifically on Three Kings’ Day. I’m not really into gambling at all, so this habit sparked a few interesting conversations over the holidays. And that led to the inevitable question:
- How likely is it to actually win the lottery?
- If we all know the odds are tiny, can we optimize anything at all?
Before we go any further, one important warning: always gamble responsibly. Statistics are not on our side here. Still, this is a great excuse to learn applied statistics properly and debunk a bunch of common beliefs.
In this article, I’ll analyze typical behaviors around Spain’s National Lottery (a 5‑digit number):
- Does it make sense to check historical results to pick numbers?
- If I play for years, what is my real probability of hitting the big prize?
- If I can’t increase the probability of winning, can I reduce the chance of sharing the prize?
Scope note: I’m talking about the Spanish National Lottery (00000–99999). This is not Primitiva/Bonoloto/EuroMillions, which have different rules and odds.
What does “winning” actually mean?
Before touching code, we need to define the goal precisely. Depending on what you mean by “winning”, the analysis changes:
- Match the exact number (top prize associated with that number)
- Win something (any prize)
- Win more than you spend (profitability)
In this post I’ll focus on the most common meaning when people talk about “the jackpot”: matching the exact number.
Getting the data (and why I work locally)
First things first: where do we get the data?
The official site shows results, but I didn’t find a convenient bulk export for analysis (at least not at first glance). So I used a web-accessible dataset provided by the lottery administration Eduardo Losilla, which exposes a historical record as JSON.
- Source:
https://api.eduardolosilla.es/botes/actuales?uts
Important for reproducibility: I download the JSON as a snapshot and work locally (for example, data/loteria.json). That way, the analysis does not depend on the API being available when someone else runs it.
This gives us a JSON containing draws and several precomputed fields. For example:
{
"numero": 6703,
"fecha_sorteo": 1767654000,
"nombre": "Extraordinario del Niño",
"enlace_pdf_premios": null,
"temporada": 2026,
"numero_texto": "06703",
"suma": 16,
"pares": 3,
"impares": 2,
"bajos": 3,
"altos": 2,
"diferentes": 4,
"reduccion1d": 7
}Here’s the first “small detail” that breaks analyses: numbers can start with 0. 06703 is not the same as 6703 if you treat it as an integer. We fix this by normalizing everything to 5 digits.
The JSON also includes a leyenda (legend) explaining calculated fields (digit sum, evens/odds, low/high digits, etc.) and tipos_sorteo if you want to filter by draw type.
Loading into Pandas and normalizing the winning number
import json
import pandas as pd
path = "data/loteria.json" # adjust to your local snapshot
with open(path, "r", encoding="utf-8") as f:
raw = json.load(f)
df = pd.json_normalize(raw["sorteos"])
# Date: epoch seconds -> datetime with timezone
df["fecha"] = (
pd.to_datetime(df["fecha_sorteo"], unit="s", utc=True)
.dt.tz_convert("Europe/Madrid")
)
# Winning number as 5 digits (zero-padded)
df["numero_5"] = df["numero_texto"].astype(str).str.zfill(5)
# Chronological order: useful for streaks and time analysis
df = df.sort_values("fecha").reset_index(drop=True)
df.head()Minimum validations (not flashy, very necessary)
Before drawing conclusions, I check the basics: nulls, duplicates, lengths, and ranges.
print("Rows, columns:", df.shape)
null_rate = df.isna().mean().sort_values(ascending=False)
print(null_rate.head(12))
print(
"Duplicates:",
df.duplicated(subset=["fecha_sorteo", "numero_5", "nombre"]).sum()
)
print(df["numero_5"].str.len().value_counts())
print(df[["suma","pares","impares","bajos","altos","diferentes"]].describe())A common shortcut is: “drop rows with nulls and move on.”
- Why don’t I do that by default? Because sometimes what’s missing is an auxiliary column, and deleting rows can introduce bias. I’d rather understand what’s missing and why.
Theoretical probability: the baseline (and what history can’t change)
If the game is a 5‑digit number between 00000 and 99999, there are 100,000 possible combinations.
If we’re talking about matching an exact number (the top prize associated with that number), the probability per draw is:
- p = 1 / 100000
The useful question is:
“If I play many times, how much does my probability increase?”
Playing many times: cumulative probability without fooling yourself
The probability of winning at least once after n independent draws is not a simple sum. It’s:
- P(at least one) = 1 − (1 − p)ⁿ
p = 1/100_000
def prob_at_least_one(n, p=p):
return 1 - (1 - p)**n
for n in [10, 100, 1_000, 10_000]:
print(n, prob_at_least_one(n))Honest interpretation:
- Playing more always increases your cumulative probability.
- But when p is 1/100,000, it increases painfully slowly.
What historical data is good for: auditing “randomness” with simple tests
Here we change goals. Instead of trying to predict, we use history to check whether outcomes look compatible with a reasonable random process.
Last digit: a quick sanity check
Over a long history, the last digit should be roughly uniform from 0 to 9.
import numpy as np
df["last_digit"] = df["numero_5"].str[-1].astype(int)
counts = df["last_digit"].value_counts().sort_index()
freq = counts / counts.sum()
print(freq.round(4))Chi-square uniformity test (no jargon)
What it measures: whether the observed distribution deviates “too much” from a uniform one.
from scipy.stats import chisquare
observed = counts.values
expected = np.full_like(observed, observed.sum()/10, dtype=float)
chi2, pvalue = chisquare(observed, f_exp=expected)
print("chi2=", chi2, "pvalue=", pvalue)How to read it honestly:
- A high p-value does not prove randomness; it just means you don’t see strong evidence of deviation.
- A low p-value could mean a real bias… or a dataset artifact (missing draws, mixed draw types, rule changes).
Tip: repeat the test by
nombre(draw type) and by time periods to reduce false positives.
When it “doesn’t look random”: patterns that trick intuition
Your dataset has perfect features to teach a key idea: many truly random results don’t feel random.
Digit sum
import matplotlib.pyplot as plt
df["suma"].plot(kind="hist", bins=range(0, 46), title="Distribution of digit sums")
plt.xlabel("Sum")
plt.show()- Many people avoid
12345because it “looks too obvious.” - But if the event is “match this exact number,”
12345and80417are equally likely.
Evens and odds
even_freq = df["pares"].value_counts().sort_index() / len(df)
even_freq.plot(kind="bar", title="How many even digits in the winning number")
plt.show()Repeated digits (how many unique digits)
uniq_freq = df["diferentes"].value_counts().sort_index() / len(df)
uniq_freq.plot(kind="bar", title="Unique digits in the winning number")
plt.show()11111 looks unusual, but it’s not “less random.” It just conflicts with our expectation of variety.
“Hot numbers” and streaks: why they happen without magic
Two common biases:
- “It hasn’t appeared in a while, so it’s due” (gambler’s fallacy).
- “It’s appearing a lot, so it will keep appearing” (hot-hand fallacy).
We can look at streaks of the last digit to see why our brains love these stories.
last = df["last_digit"].to_numpy()
runs = []
current_digit = last[0]
length = 1
for x in last[1:]:
if x == current_digit:
length += 1
else:
runs.append((current_digit, length))
current_digit = x
length = 1
runs.append((current_digit, length))
run_lengths = pd.Series([l for _, l in runs])
print(run_lengths.value_counts().sort_index().head(12))In long sequences, seeing a few eye-catching streaks is normal. Subjective “rarity” isn’t evidence of bias.
Monte Carlo simulation: another way to understand “time until you win”
Simulation doesn’t predict the next number. Its value is educational: it helps you visualize variability and typical waiting times.
import numpy as np
p = 1/100_000
N = 200_000
# Draws until the first success
times = np.random.geometric(p, size=N)
print("Median draws:", np.median(times))
print("Mean draws:", np.mean(times))
print("90th percentile:", np.percentile(times, 90))Technical choice: using geometric directly.
- Alternative: simulate draw-by-draw (repeated Bernoulli trials).
- Why I skip it here: it’s slower and doesn’t add intuition for the “typical waiting time” question.
The only real “optimization”: reduce the chance of sharing the prize
Here’s the nuance that’s usually missed:
- You cannot increase the probability of matching the exact number.
- But you can pick less common choices to reduce the chance of splitting the prize if you ever do win.
Examples of numbers many people buy (so they’re more likely to be shared):
- Dates (
01024for 01/01/24, etc.) - Patterns (
12345,11111,00000) - “Pretty” or symmetric combinations
This doesn’t improve your probability of winning, but it can improve your outcome conditional on winning.
Expected value (EV): connecting analysis to a real decision
If your goal is deciding whether it’s “worth playing,” the useful concept is expected value.
- EV = Σ(probability of each prize × prize amount) − cost
Important limitation: with only historical winning numbers, you can’t compute full EV unless you also include the official prize table and its probabilities.
Python template for when you have the prize table:
prizes = [
{"name": "Top prize", "p": 1/100_000, "amount": 300_000},
{"name": "Refund", "p": 0.3, "amount": 20},
]
cost = 20
ev = sum(x["p"] * x["amount"] for x in prizes) - cost
print("EV per ticket:", ev)What you can learn—and what you can’t “optimize”
- If your goal is the top prize, the base probability dominates. History doesn’t change it.
- If your goal is understanding how probability grows over time, cumulative probability and Monte Carlo explain it without tricks.
- If your goal is “picking better,” the honest conclusion is:
- you can’t increase the probability of matching the number,
- but you can avoid popular choices to reduce the chance of sharing the prize if you ever win.
Repository
Code and example dataset on GitHub:
FAQ
Does checking historical results help you pick numbers?
Not to increase the probability of matching the exact number. It does help you understand distributions and debunk biases.
If I play for years, do I get a “meaningful” probability?
It increases, but very slowly. The correct formula is: 1 − (1 − p)^n.
Do “hot numbers” exist?
Streaks happen naturally in random processes; seeing them isn’t evidence that the system is biased.
Is there any useful optimization?
Not to win more often. The practical optimization is reducing the risk of splitting the prize by avoiding popular choices (dates/patterns).