Probability of winning the National Lottery with Python (Monte Carlo)

135 Calculate the real probability of winning the National Lottery with Python: cumulative formula, Monte Carlo and why historical data doesn't predict.

Roger Bosch Jan 18, 2026

Probability of Winning Spain’s National Lottery: Python Analysis, Historical Data, and Monte Carlo

In my family, a strange tradition has popped up over the last few years: we go to bingo once a year—specifically on Three Kings’ Day. I’m not really into gambling at all, so this habit sparked a few interesting conversations over the holidays. And that led to the inevitable question:

How likely is it to actually win the lottery?
If we all know the odds are tiny, can we optimize anything at all?

Before we go any further, one important warning: always gamble responsibly. Statistics are not on our side here. Still, this is a great excuse to learn applied statistics properly and debunk a bunch of common beliefs.

In this article, I’ll analyze typical behaviors around Spain’s National Lottery (a 5‑digit number):

Does it make sense to check historical results to pick numbers?
If I play for years, what is my real probability of hitting the big prize?
If I can’t increase the probability of winning, can I reduce the chance of sharing the prize?

Scope note: I’m talking about the Spanish National Lottery (00000–99999). This is not Primitiva/Bonoloto/EuroMillions, which have different rules and odds.

What does “winning” actually mean?

Before touching code, we need to define the goal precisely. Depending on what you mean by “winning”, the analysis changes:

Match the exact number (top prize associated with that number)
Win something (any prize)
Win more than you spend (profitability)

In this post I’ll focus on the most common meaning when people talk about “the jackpot”: matching the exact number.

Getting the data (and why I work locally)

First things first: where do we get the data?

The official site shows results, but I didn’t find a convenient bulk export for analysis (at least not at first glance). So I used a web-accessible dataset provided by the lottery administration Eduardo Losilla, which exposes a historical record as JSON.

Source: https://api.eduardolosilla.es/botes/actuales?uts

Important for reproducibility: I download the JSON as a snapshot and work locally (for example, data/loteria.json). That way, the analysis does not depend on the API being available when someone else runs it.

This gives us a JSON containing draws and several precomputed fields. For example:

{
  "numero": 6703,
  "fecha_sorteo": 1767654000,
  "nombre": "Extraordinario del Niño",
  "enlace_pdf_premios": null,
  "temporada": 2026,
  "numero_texto": "06703",
  "suma": 16,
  "pares": 3,
  "impares": 2,
  "bajos": 3,
  "altos": 2,
  "diferentes": 4,
  "reduccion1d": 7
}

Here’s the first “small detail” that breaks analyses: numbers can start with 0. 06703 is not the same as 6703 if you treat it as an integer. We fix this by normalizing everything to 5 digits.

The JSON also includes a leyenda (legend) explaining calculated fields (digit sum, evens/odds, low/high digits, etc.) and tipos_sorteo if you want to filter by draw type.

Loading into Pandas and normalizing the winning number

import json
import pandas as pd

path = "data/loteria.json"  # adjust to your local snapshot

with open(path, "r", encoding="utf-8") as f:
    raw = json.load(f)

df = pd.json_normalize(raw["sorteos"])

# Date: epoch seconds -> datetime with timezone

df["fecha"] = (
    pd.to_datetime(df["fecha_sorteo"], unit="s", utc=True)
      .dt.tz_convert("Europe/Madrid")
)

# Winning number as 5 digits (zero-padded)

df["numero_5"] = df["numero_texto"].astype(str).str.zfill(5)

# Chronological order: useful for streaks and time analysis

df = df.sort_values("fecha").reset_index(drop=True)

df.head()

Minimum validations (not flashy, very necessary)

Before drawing conclusions, I check the basics: nulls, duplicates, lengths, and ranges.

print("Rows, columns:", df.shape)

null_rate = df.isna().mean().sort_values(ascending=False)
print(null_rate.head(12))

print(
    "Duplicates:",
    df.duplicated(subset=["fecha_sorteo", "numero_5", "nombre"]).sum()
)

print(df["numero_5"].str.len().value_counts())

print(df[["suma","pares","impares","bajos","altos","diferentes"]].describe())

A common shortcut is: “drop rows with nulls and move on.”

Why don’t I do that by default? Because sometimes what’s missing is an auxiliary column, and deleting rows can introduce bias. I’d rather understand what’s missing and why.

Theoretical probability: the baseline (and what history can’t change)

If the game is a 5‑digit number between 00000 and 99999, there are 100,000 possible combinations.

If we’re talking about matching an exact number (the top prize associated with that number), the probability per draw is:

p = 1 / 100000

The useful question is:

“If I play many times, how much does my probability increase?”

Playing many times: cumulative probability without fooling yourself

The probability of winning at least once after n independent draws is not a simple sum. It’s:

P(at least one) = 1 − (1 − p)ⁿ

p = 1/100_000

def prob_at_least_one(n, p=p):
    return 1 - (1 - p)**n

for n in [10, 100, 1_000, 10_000]:
    print(n, prob_at_least_one(n))

Honest interpretation:

Playing more always increases your cumulative probability.
But when p is 1/100,000, it increases painfully slowly.

What historical data is good for: auditing “randomness” with simple tests

Here we change goals. Instead of trying to predict, we use history to check whether outcomes look compatible with a reasonable random process.

Last digit: a quick sanity check

Over a long history, the last digit should be roughly uniform from 0 to 9.

import numpy as np

df["last_digit"] = df["numero_5"].str[-1].astype(int)
counts = df["last_digit"].value_counts().sort_index()
freq = counts / counts.sum()

print(freq.round(4))

Chi-square uniformity test (no jargon)

What it measures: whether the observed distribution deviates “too much” from a uniform one.

from scipy.stats import chisquare

observed = counts.values
expected = np.full_like(observed, observed.sum()/10, dtype=float)

chi2, pvalue = chisquare(observed, f_exp=expected)
print("chi2=", chi2, "pvalue=", pvalue)

How to read it honestly:

A high p-value does not prove randomness; it just means you don’t see strong evidence of deviation.
A low p-value could mean a real bias… or a dataset artifact (missing draws, mixed draw types, rule changes).

Tip: repeat the test by nombre (draw type) and by time periods to reduce false positives.

When it “doesn’t look random”: patterns that trick intuition

Your dataset has perfect features to teach a key idea: many truly random results don’t feel random.

Digit sum

import matplotlib.pyplot as plt

df["suma"].plot(kind="hist", bins=range(0, 46), title="Distribution of digit sums")
plt.xlabel("Sum")
plt.show()

Many people avoid 12345 because it “looks too obvious.”
But if the event is “match this exact number,” 12345 and 80417 are equally likely.

Evens and odds

even_freq = df["pares"].value_counts().sort_index() / len(df)
even_freq.plot(kind="bar", title="How many even digits in the winning number")
plt.show()

Repeated digits (how many unique digits)

uniq_freq = df["diferentes"].value_counts().sort_index() / len(df)
uniq_freq.plot(kind="bar", title="Unique digits in the winning number")
plt.show()

11111 looks unusual, but it’s not “less random.” It just conflicts with our expectation of variety.

“Hot numbers” and streaks: why they happen without magic

Two common biases:

“It hasn’t appeared in a while, so it’s due” (gambler’s fallacy).
“It’s appearing a lot, so it will keep appearing” (hot-hand fallacy).

We can look at streaks of the last digit to see why our brains love these stories.

last = df["last_digit"].to_numpy()

runs = []
current_digit = last[0]
length = 1

for x in last[1:]:
    if x == current_digit:
        length += 1
    else:
        runs.append((current_digit, length))
        current_digit = x
        length = 1

runs.append((current_digit, length))

run_lengths = pd.Series([l for _, l in runs])
print(run_lengths.value_counts().sort_index().head(12))

In long sequences, seeing a few eye-catching streaks is normal. Subjective “rarity” isn’t evidence of bias.

Monte Carlo simulation: another way to understand “time until you win”

Simulation doesn’t predict the next number. Its value is educational: it helps you visualize variability and typical waiting times.

import numpy as np

p = 1/100_000
N = 200_000

# Draws until the first success

times = np.random.geometric(p, size=N)

print("Median draws:", np.median(times))
print("Mean draws:", np.mean(times))
print("90th percentile:", np.percentile(times, 90))

Technical choice: using geometric directly.

Alternative: simulate draw-by-draw (repeated Bernoulli trials).
Why I skip it here: it’s slower and doesn’t add intuition for the “typical waiting time” question.

Here’s the nuance that’s usually missed:

You cannot increase the probability of matching the exact number.
But you can pick less common choices to reduce the chance of splitting the prize if you ever do win.

Examples of numbers many people buy (so they’re more likely to be shared):

Dates (01024 for 01/01/24, etc.)
Patterns (12345, 11111, 00000)
“Pretty” or symmetric combinations

This doesn’t improve your probability of winning, but it can improve your outcome conditional on winning.

Expected value (EV): connecting analysis to a real decision

If your goal is deciding whether it’s “worth playing,” the useful concept is expected value.

EV = Σ(probability of each prize × prize amount) − cost

Important limitation: with only historical winning numbers, you can’t compute full EV unless you also include the official prize table and its probabilities.

Python template for when you have the prize table:

prizes = [
    {"name": "Top prize", "p": 1/100_000, "amount": 300_000},
    {"name": "Refund", "p": 0.3, "amount": 20},
]

cost = 20

ev = sum(x["p"] * x["amount"] for x in prizes) - cost
print("EV per ticket:", ev)

What you can learn—and what you can’t “optimize”

If your goal is the top prize, the base probability dominates. History doesn’t change it.
If your goal is understanding how probability grows over time, cumulative probability and Monte Carlo explain it without tricks.
If your goal is “picking better,” the honest conclusion is:
- you can’t increase the probability of matching the number,
- but you can avoid popular choices to reduce the chance of sharing the prize if you ever win.

Repository

Code and example dataset on GitHub:

https://github.com/oshytech/lotery-basic-statistic-analysis

FAQ

Does checking historical results help you pick numbers?

Not to increase the probability of matching the exact number. It does help you understand distributions and debunk biases.

If I play for years, do I get a “meaningful” probability?

It increases, but very slowly. The correct formula is: 1 − (1 − p)^n.

Do “hot numbers” exist?

Streaks happen naturally in random processes; seeing them isn’t evidence that the system is biased.

Is there any useful optimization?

Not to win more often. The practical optimization is reducing the risk of splitting the prize by avoiding popular choices (dates/patterns).

Tags: #national lotery #probability #aplied stadistics #pandas #numpy #chi square

Back to all posts

Probability of winning the National Lottery with Python (Monte Carlo)

What does “winning” actually mean?

Getting the data (and why I work locally)

Loading into Pandas and normalizing the winning number

Minimum validations (not flashy, very necessary)

Theoretical probability: the baseline (and what history can’t change)

Playing many times: cumulative probability without fooling yourself

What historical data is good for: auditing “randomness” with simple tests

Last digit: a quick sanity check

Chi-square uniformity test (no jargon)

When it “doesn’t look random”: patterns that trick intuition

Digit sum

Evens and odds

Repeated digits (how many unique digits)

“Hot numbers” and streaks: why they happen without magic

Monte Carlo simulation: another way to understand “time until you win”

Expected value (EV): connecting analysis to a real decision

What you can learn—and what you can’t “optimize”

Repository

FAQ

Does checking historical results help you pick numbers?

If I play for years, do I get a “meaningful” probability?

Do “hot numbers” exist?

Is there any useful optimization?

Related Posts

Legal

Navigation

RRSS

Cookie Settings

Probability of winning the National Lottery with Python (Monte Carlo)

What does “winning” actually mean?

Getting the data (and why I work locally)

Loading into Pandas and normalizing the winning number

Minimum validations (not flashy, very necessary)

Theoretical probability: the baseline (and what history can’t change)

Playing many times: cumulative probability without fooling yourself

What historical data is good for: auditing “randomness” with simple tests

Last digit: a quick sanity check

Chi-square uniformity test (no jargon)

When it “doesn’t look random”: patterns that trick intuition

Digit sum

Evens and odds

Repeated digits (how many unique digits)

“Hot numbers” and streaks: why they happen without magic

Monte Carlo simulation: another way to understand “time until you win”

The only real “optimization”: reduce the chance of sharing the prize

Expected value (EV): connecting analysis to a real decision

What you can learn—and what you can’t “optimize”

Repository

FAQ

Does checking historical results help you pick numbers?

If I play for years, do I get a “meaningful” probability?

Do “hot numbers” exist?

Is there any useful optimization?

Related Posts

Legal

Navigation

RRSS