How to separate an AI PoC from a system you can actually maintain

What changes when an AI demo becomes a real product: logs, costs, evaluation, permissions, fallback and maintenance.

Roger Bosch May 18, 2026

Last week I saw a demo of an AI product that did something impressive: it took a 40-page legal document, analyzed it and generated an executive summary with the key points. In the demo it worked perfectly. Five minutes later, the founder told me they had been trying to put it into production for three months and could not. The summaries were inconsistent, costs had multiplied by six compared to the initial estimate and they had no way to know if the LLM was generating correct summaries without a lawyer reviewing them one by one.

I have come across that story more times than I would like. And every time the pattern is the same: someone builds a PoC with an LLM, it works well under controlled conditions, the company gets excited and suddenly you have to turn that into a real product. That is where everything gets complicated.

Because what separates a PoC from a production system is not the prompt quality or the model’s power. It is everything surrounding the model: logs, costs, evaluation, fallback, permissions, versioning and the ability to keep the system running when nobody is watching.

PoC vs production: what really changes

The difference between a PoC and a production system is not a matter of degree. It is a difference in nature. In a PoC you demonstrate that something is possible. In production you demonstrate that something is reliable, sustainable and maintainable over time.

Aspect	PoC	Production
Input data	Controlled, hand-picked	Whatever comes in
Volume	Tens of requests	Thousands or millions
Costs	”We will see”	Real budget line item
Model errors	”It fails sometimes, but it is cool”	Every error is a support ticket
Latency	Acceptable if under 30s	The user leaves if it exceeds 3s
Evaluation	Manual glance, “looks good”	Automated metrics, benchmarks
Logs	print() in the console	Structured observability
Prompts	A string in the code	Versioned, tested, documented
Fallback	Does not exist	Mandatory
Permissions and security	Localhost with your API key	Multi-tenant, RBAC, data isolation
Provider dependency	”We use OpenAI and that is it”	Plan B if the API goes down or raises prices
Human-in-the-loop	The developer themselves	Flow designed for non-technical users

This table is not a theoretical exercise. It is the list of things I have encountered when taking AI features in Rolsfera from “it works on my machine” to “it works every day without me keeping an eye on it.”

Real costs: the conversation nobody wants to have

In a PoC, API costs are a small number nobody questions. In production, that number becomes a budget line item that someone has to justify.

Let us do real math. Suppose a service that processes documents with an LLM:

# Estimación de costes para procesamiento con LLM
# Modelo: GPT-4o (ejemplo de precios a mayo 2026)

# Input: ~5.00$ / 1M tokens
# Output: ~15.00$ / 1M tokens

tokens_por_documento_input = 4000   # documento de ~3000 palabras
tokens_por_documento_output = 800   # resumen + clasificación
documentos_por_dia = 500

coste_input_diario = (tokens_por_documento_input * documentos_por_dia / 1_000_000) * 5.00
coste_output_diario = (tokens_por_documento_output * documentos_por_dia / 1_000_000) * 15.00
coste_diario = coste_input_diario + coste_output_diario
coste_mensual = coste_diario * 30

print(f"Coste diario:  ${coste_diario:.2f}")   # ~$16.00
print(f"Coste mensual: ${coste_mensual:.2f}")   # ~$480.00

480 dollars a month for 500 daily documents. Seems manageable. But now add:

Retries for errors. 5% of requests fail and get retried. That is 5% more cost.
Evaluation and testing. Every time you change a prompt, you need to test it against a reference dataset. Those are additional requests.
Growth. If the product succeeds, volume multiplies. 500 documents become 5000 and costs jump to $4800/month.
More powerful model. The client asks for higher accuracy. You upgrade the model and costs double.

The PoC cost multiplied by real volume does not give you the production cost. It gives you an optimistic estimate. Real costs always include retries, evaluation, traffic spikes and the inevitable temptation to use a more expensive model.

In Rolsfera I learned this quickly. My first version processed all articles with the most powerful model available. When I saw the first month’s bill, I restructured the pipeline to use small models for classification and reserve the large ones only for summaries of already-approved articles. Costs dropped by 60%.

# Estrategia de costes: modelo según la tarea
TASK_MODELS = {
    "classification": "gpt-4o-mini",      # barato, suficiente para categorizar
    "relevance_scoring": "gpt-4o-mini",   # no necesita el modelo grande
    "summary": "gpt-4o",                   # aquí sí importa la calidad
    "entity_extraction": "gpt-4o-mini",   # tarea estructurada, modelo pequeño basta
}

def get_model_for_task(task: str) -> str:
    return TASK_MODELS.get(task, "gpt-4o-mini")  # default al barato

Evaluation: how to know if the LLM is working well

This is probably the hardest problem when moving from PoC to production. In a PoC, evaluation is “I look at it and it seems correct.” In production, you need something more rigorous.

The problem of evaluating LLM outputs

An LLM is not deterministic (or at least not in practice with temperature greater than 0). The same input can generate slightly different outputs. And “slightly different” sometimes means “subtly incorrect.”

For a classification task, evaluation is relatively easy: you compare the assigned category with the correct one and calculate accuracy. For a text generation task (summaries, responses), evaluation is much more complex.

How I do it in Rolsfera

My evaluation system has three levels:

Level 1: Structural validation. The output has the expected format. If I asked for JSON, is it valid JSON? If I asked for a 2-3 sentence summary, does it have between 1 and 5 sentences? If I asked for a score from 1 to 10, is it in that range?

def validate_llm_output(output: dict, task: str) -> list[str]:
    issues = []

    if task == "classification":
        valid_categories = ["tech", "politics", "economy", "science", "other"]
        if output.get("category") not in valid_categories:
            issues.append(f"Categoría inválida: {output.get('category')}")

    if task == "summary":
        summary = output.get("summary", "")
        sentences = summary.split(".")
        if len(sentences) < 1 or len(sentences) > 6:
            issues.append(f"Resumen con {len(sentences)} frases (esperado: 1-5)")
        if len(summary) < 50:
            issues.append("Resumen demasiado corto")
        if len(summary) > 1000:
            issues.append("Resumen demasiado largo")

    if task == "relevance_scoring":
        score = output.get("relevance_score", 0)
        if not isinstance(score, (int, float)) or score < 1 or score > 10:
            issues.append(f"Puntuación fuera de rango: {score}")

    return issues

Level 2: Evaluation against a reference dataset. I maintain a set of 50 manually classified and summarized articles. When I change a prompt, I run the pipeline against that dataset and compare results.

def evaluate_against_reference(
    pipeline_fn,
    reference_data: list[dict],
) -> dict:
    correct = 0
    total = len(reference_data)

    for item in reference_data:
        result = pipeline_fn(item["article"])
        if result["category"] == item["expected_category"]:
            correct += 1

    accuracy = correct / total if total > 0 else 0
    return {
        "accuracy": accuracy,
        "correct": correct,
        "total": total,
        "threshold": 0.85,  # mínimo aceptable
        "passed": accuracy >= 0.85,
    }

Level 3: Editorial process feedback. When I review articles in Rolsfera and reject one that the AI had scored as relevant, or when I edit a summary because it was imprecise, that gets recorded. Over time, that data gives me a real measure of the AI pipeline’s quality.

Prompt versioning

Prompts are code. Or they should be treated as such. In a PoC, the prompt is a string you edit whenever you want. In production, you need to know which prompt generated which output, when it changed and why.

# prompts/classification_v3.py
PROMPT_VERSION = "3.2.1"
PROMPT_DATE = "2026-05-15"
PROMPT_CHANGELOG = """
3.2.1 - Añadida categoría 'devops', ajustado umbral de relevancia
3.2.0 - Simplificado formato de salida, eliminado campo 'confidence'
3.1.0 - Añadido ejemplo few-shot para mejorar precisión en categoría 'science'
3.0.0 - Reescritura completa del prompt de clasificación
"""

CLASSIFICATION_PROMPT = """Clasifica el siguiente artículo en una de estas categorías:
- tech: tecnología, programación, software, hardware
- politics: política nacional o internacional
- economy: economía, finanzas, mercados
- science: ciencia, investigación, medio ambiente
- devops: infraestructura, CI/CD, cloud, monitorización
- other: no encaja en ninguna anterior

Artículo:
Título: {title}
Contenido: {content}

Responde SOLO con un JSON:
{{"category": "...", "relevance_score": 1-10}}

Ejemplos:
- "Kubernetes 1.30 introduces sidecar containers" → {{"category": "devops", "relevance_score": 8}}
- "New study links sleep to cognitive performance" → {{"category": "science", "relevance_score": 4}}
"""

Every time the prompt changes, the version gets incremented. Generated outputs are stored with the prompt version that produced them. That way, if I detect a quality degradation, I can trace exactly which prompt change caused it.

def process_with_tracking(article: dict, prompt_template: str, prompt_version: str) -> dict:
    filled_prompt = prompt_template.format(
        title=article["title"],
        content=article["content"][:2500],
    )

    response = call_llm(filled_prompt)
    result = json.loads(response)

    # Guardar metadata de trazabilidad
    result["_meta"] = {
        "prompt_version": prompt_version,
        "model": "gpt-4o-mini",
        "timestamp": datetime.utcnow().isoformat(),
        "input_tokens": count_tokens(filled_prompt),
        "output_tokens": count_tokens(response),
    }

    return result

Technical risks that do not show up in the PoC

Hallucinations

In a PoC, if the LLM invents a fact, you catch it because you are looking at the output. In production, hallucinations blend in with thousands of correct outputs and go unnoticed.

My strategy for mitigating hallucinations in Rolsfera:

Bounded tasks. I do not ask the LLM to generate new information. I ask it to classify, summarize or extract information from the text I give it. That reduces the space for invention.
Cross-validation. If the summary mentions a fact that does not appear in the original text, it is suspicious. I have a basic validation that compares entities in the summary with entities in the source text.
Structured outputs. Asking for JSON with specific fields is better than asking for free-form text. The constrained format reduces the hallucination surface.

def check_hallucination_risk(original_text: str, summary: str) -> float:
    """Heurística simple: qué proporción de entidades del resumen
    aparecen en el texto original."""
    import re

    # Extraer palabras capitalizadas como proxy de entidades
    summary_entities = set(re.findall(r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\b', summary))
    original_lower = original_text.lower()

    if not summary_entities:
        return 0.0

    found = sum(1 for e in summary_entities if e.lower() in original_lower)
    return 1.0 - (found / len(summary_entities))
    # 0.0 = todas las entidades están en el original (bien)
    # 1.0 = ninguna entidad está en el original (probable alucinación)

It is not a perfect system. But it catches the most flagrant cases.

Latency

An LLM takes between 1 and 15 seconds to respond, depending on the model, input size and server load. In a PoC, you wait. In production, if the user waits 10 seconds without feedback, they leave.

Solutions I apply:

Asynchronous processing. The user does not wait for the LLM. They submit the request and receive the result when it is ready (notification, polling, websocket).
Result caching. If the same input was already processed, I return the cached result. In Rolsfera, if an article was already classified, I do not run it through the LLM again.
Strict timeouts. If the LLM API does not respond within 20 seconds, the request fails and gets queued for retry. I never block the system waiting for a response that may not come.

Provider dependency

If your system depends on a single LLM API, you have a single point of failure. The API can go down, raise prices, change its terms of service or deprecate the model you use.

# Patrón de multi-proveedor con fallback
LLM_PROVIDERS = [
    {
        "name": "openai",
        "model": "gpt-4o-mini",
        "endpoint": "https://api.openai.com/v1/chat/completions",
        "priority": 1,
    },
    {
        "name": "anthropic",
        "model": "claude-sonnet",
        "endpoint": "https://api.anthropic.com/v1/messages",
        "priority": 2,
    },
]

def call_llm_with_fallback(prompt: str) -> str:
    providers = sorted(LLM_PROVIDERS, key=lambda p: p["priority"])

    for provider in providers:
        try:
            return call_provider(provider, prompt)
        except (APIError, TimeoutError) as e:
            logger.warning(
                f"Proveedor {provider['name']} falló: {e}. "
                f"Intentando siguiente..."
            )

    raise AllProvidersFailedError("Todos los proveedores de LLM han fallado")

Human-in-the-loop: the pattern nobody wants to implement

In the PoC, the human is implicit: it is you looking at the results. In production, the human-in-the-loop has to be a designed flow, not an accident.

In Rolsfera, human review is part of the system’s design, not a patch. The LLM processes and suggests; I review and decide. That has architectural implications:

Review queue. Articles processed by AI are not published automatically. They arrive in a queue where I approve, edit or discard them. The queue is prioritized: articles from high-trust sources with high relevance scores appear first.

Feedback loop. My approval/rejection decisions are stored and used to evaluate the pipeline’s quality. If I start rejecting more articles than usual, something has changed (in the sources, the prompt or the model).

Escalation. If the LLM produces an output that does not pass structural validation, the article gets flagged for mandatory review instead of being discarded. Sometimes the problem is not the article, but the prompt.

# Lógica de decisión: cuándo publicar automáticamente vs. pedir revisión
def decide_workflow(article: dict) -> str:
    ai_meta = article.get("ai_metadata", {})
    source_trust = get_source_trust_score(article["source_name"])
    relevance = ai_meta.get("relevance_score", 0)
    validation_issues = ai_meta.get("validation_issues", [])

    # Si hay problemas de validación, siempre revisión humana
    if validation_issues:
        return "manual_review"

    # Fuente de alta confianza + alta relevancia = puede ir automático
    if source_trust > 0.9 and relevance >= 8:
        return "auto_publish"

    # El resto, a la cola de revisión
    return "manual_review"

In practice, fewer than 10% of articles go through auto-publishing. And that is fine. I would rather review more than necessary than publish something that should not have gone out.

Observability: logs that actually serve a purpose

The print() statements from the PoC have to become structured logs you can query, filter and aggregate.

import structlog
logger = structlog.get_logger()

def process_article_with_logging(article: dict) -> dict:
    log = logger.bind(article_url=article["url"], source=article["source_name"])
    log.info("processing_started")
    try:
        result = call_llm(article)
        log.info("processing_completed",
            category=result.get("category"),
            tokens_used=result.get("_meta", {}).get("input_tokens", 0),
            latency_ms=result.get("_meta", {}).get("latency_ms", 0))
        return result
    except TimeoutError:
        log.error("llm_timeout", timeout_seconds=20)
        raise
    except Exception as e:
        log.error("processing_failed", error=str(e), exc_info=True)
        raise

What I care about in production logs:

Per-request latency. If LLM requests start taking longer than usual, I want to know before users complain.
Tokens consumed. To control costs in real time, not at the end of the month.
Error rate. If the percentage of failed requests goes above 5%, something is wrong.
Category distribution. If suddenly 80% of articles are classified as “other,” the prompt probably needs adjustments.

The checklist I use before taking AI to production

After several iterations, I have settled on this verification list. It is not exhaustive, but it covers what has caused me the most problems:

Area	Key questions
Costs	Monthly estimate with real volume. Models per task. Spend alerts. Plan B if prices go up.
Evaluation	Reference dataset. Structural validation. Quality metrics. Update process.
Prompts	Versioning. Changelog. Tests when changing prompts. Output-to-prompt traceability.
Resilience	Timeouts. Retries with backoff. Multi-provider fallback. Behavior without AI.
Observability	Structured logs. Metrics (latency, tokens, errors). Alerts. Dashboard.
Human-in-the-loop	Review flow. Feedback loop. Escalation. Docs for non-technical reviewers.

Final thoughts

The PoC demonstrates what is possible. Production demonstrates what is sustainable. And between the two there is a chasm of work that is not visible in demos or Twitter threads.

I am not writing this to discourage anyone. AI is genuinely useful, and LLMs open possibilities that two years ago were science fiction. But the honest conversation is this: building a demo with an LLM takes a day. Taking that demo to production with reliability, controlled costs and reasonable maintenance takes months.

What I have learned with Rolsfera is that the success of an AI system does not depend on choosing the right model. It depends on building everything around the model: the evaluation that tells you if it works, the logs that tell you when it fails, the costs that tell you if it is sustainable and the human-in-the-loop that tells you if the output makes sense in the real world.

If you are in the PoC phase and being asked to move to production, this checklist is a good starting point. And if someone tells you it is just a matter of “putting the prompt behind an endpoint,” now you know why that is not enough.

Tags: #ai #production #llm #observability #costs #evaluation #architecture #maintainability

Back to all posts

Cover for AI Skills as living documentation for a development team

Artificial intelligence

Roger Bosch

•

May 18, 2026

AI Skills as living documentation for a development team

Cover for ADRs for small projects: how to document technical decisions without bureaucracy

Software Architecture

Roger Bosch

•

May 18, 2026

ADRs for small projects: how to document technical decisions without bureaucracy

Cover for Java vs Kotlin is not just syntax: maintainability, teams, and technical debt

Backend Engineering Java Kotlin

Roger Bosch

•

May 18, 2026

How to separate an AI PoC from a system you can actually maintain

PoC vs production: what really changes

Real costs: the conversation nobody wants to have

Evaluation: how to know if the LLM is working well

The problem of evaluating LLM outputs

How I do it in Rolsfera

Prompt versioning

Technical risks that do not show up in the PoC

Hallucinations

Latency

Provider dependency

Human-in-the-loop: the pattern nobody wants to implement

Observability: logs that actually serve a purpose

The checklist I use before taking AI to production

Final thoughts

Related Posts

AI Skills as living documentation for a development team

ADRs for small projects: how to document technical decisions without bureaucracy

Java vs Kotlin is not just syntax: maintainability, teams, and technical debt

Legal

Navigation

RRSS

Cookie Settings

How to separate an AI PoC from a system you can actually maintain

PoC vs production: what really changes

Real costs: the conversation nobody wants to have

Evaluation: how to know if the LLM is working well

The problem of evaluating LLM outputs

How I do it in Rolsfera

Prompt versioning

Technical risks that do not show up in the PoC

Hallucinations

Latency

Provider dependency

Human-in-the-loop: the pattern nobody wants to implement

Observability: logs that actually serve a purpose

The checklist I use before taking AI to production

Final thoughts

Related Posts

AI Skills as living documentation for a development team

ADRs for small projects: how to document technical decisions without bureaucracy

Java vs Kotlin is not just syntax: maintainability, teams, and technical debt

Legal

Navigation

RRSS