How to separate an AI PoC from a system you can actually maintain
What changes when an AI demo becomes a real product: logs, costs, evaluation, permissions, fallback and maintenance.

Last week I saw a demo of an AI product that did something impressive: it took a 40-page legal document, analyzed it and generated an executive summary with the key points. In the demo it worked perfectly. Five minutes later, the founder told me they had been trying to put it into production for three months and could not. The summaries were inconsistent, costs had multiplied by six compared to the initial estimate and they had no way to know if the LLM was generating correct summaries without a lawyer reviewing them one by one.
I have come across that story more times than I would like. And every time the pattern is the same: someone builds a PoC with an LLM, it works well under controlled conditions, the company gets excited and suddenly you have to turn that into a real product. That is where everything gets complicated.
Because what separates a PoC from a production system is not the prompt quality or the model’s power. It is everything surrounding the model: logs, costs, evaluation, fallback, permissions, versioning and the ability to keep the system running when nobody is watching.
PoC vs production: what really changes
The difference between a PoC and a production system is not a matter of degree. It is a difference in nature. In a PoC you demonstrate that something is possible. In production you demonstrate that something is reliable, sustainable and maintainable over time.
| Aspect | PoC | Production |
|---|---|---|
| Input data | Controlled, hand-picked | Whatever comes in |
| Volume | Tens of requests | Thousands or millions |
| Costs | ”We will see” | Real budget line item |
| Model errors | ”It fails sometimes, but it is cool” | Every error is a support ticket |
| Latency | Acceptable if under 30s | The user leaves if it exceeds 3s |
| Evaluation | Manual glance, “looks good” | Automated metrics, benchmarks |
| Logs | print() in the console | Structured observability |
| Prompts | A string in the code | Versioned, tested, documented |
| Fallback | Does not exist | Mandatory |
| Permissions and security | Localhost with your API key | Multi-tenant, RBAC, data isolation |
| Provider dependency | ”We use OpenAI and that is it” | Plan B if the API goes down or raises prices |
| Human-in-the-loop | The developer themselves | Flow designed for non-technical users |
This table is not a theoretical exercise. It is the list of things I have encountered when taking AI features in Rolsfera from “it works on my machine” to “it works every day without me keeping an eye on it.”
Real costs: the conversation nobody wants to have
In a PoC, API costs are a small number nobody questions. In production, that number becomes a budget line item that someone has to justify.
Let us do real math. Suppose a service that processes documents with an LLM:
# Estimación de costes para procesamiento con LLM
# Modelo: GPT-4o (ejemplo de precios a mayo 2026)
# Input: ~5.00$ / 1M tokens
# Output: ~15.00$ / 1M tokens
tokens_por_documento_input = 4000 # documento de ~3000 palabras
tokens_por_documento_output = 800 # resumen + clasificación
documentos_por_dia = 500
coste_input_diario = (tokens_por_documento_input * documentos_por_dia / 1_000_000) * 5.00
coste_output_diario = (tokens_por_documento_output * documentos_por_dia / 1_000_000) * 15.00
coste_diario = coste_input_diario + coste_output_diario
coste_mensual = coste_diario * 30
print(f"Coste diario: ${coste_diario:.2f}") # ~$16.00
print(f"Coste mensual: ${coste_mensual:.2f}") # ~$480.00480 dollars a month for 500 daily documents. Seems manageable. But now add:
- Retries for errors. 5% of requests fail and get retried. That is 5% more cost.
- Evaluation and testing. Every time you change a prompt, you need to test it against a reference dataset. Those are additional requests.
- Growth. If the product succeeds, volume multiplies. 500 documents become 5000 and costs jump to $4800/month.
- More powerful model. The client asks for higher accuracy. You upgrade the model and costs double.
The PoC cost multiplied by real volume does not give you the production cost. It gives you an optimistic estimate. Real costs always include retries, evaluation, traffic spikes and the inevitable temptation to use a more expensive model.
In Rolsfera I learned this quickly. My first version processed all articles with the most powerful model available. When I saw the first month’s bill, I restructured the pipeline to use small models for classification and reserve the large ones only for summaries of already-approved articles. Costs dropped by 60%.
# Estrategia de costes: modelo según la tarea
TASK_MODELS = {
"classification": "gpt-4o-mini", # barato, suficiente para categorizar
"relevance_scoring": "gpt-4o-mini", # no necesita el modelo grande
"summary": "gpt-4o", # aquí sí importa la calidad
"entity_extraction": "gpt-4o-mini", # tarea estructurada, modelo pequeño basta
}
def get_model_for_task(task: str) -> str:
return TASK_MODELS.get(task, "gpt-4o-mini") # default al baratoEvaluation: how to know if the LLM is working well
This is probably the hardest problem when moving from PoC to production. In a PoC, evaluation is “I look at it and it seems correct.” In production, you need something more rigorous.
The problem of evaluating LLM outputs
An LLM is not deterministic (or at least not in practice with temperature greater than 0). The same input can generate slightly different outputs. And “slightly different” sometimes means “subtly incorrect.”
For a classification task, evaluation is relatively easy: you compare the assigned category with the correct one and calculate accuracy. For a text generation task (summaries, responses), evaluation is much more complex.
How I do it in Rolsfera
My evaluation system has three levels:
Level 1: Structural validation. The output has the expected format. If I asked for JSON, is it valid JSON? If I asked for a 2-3 sentence summary, does it have between 1 and 5 sentences? If I asked for a score from 1 to 10, is it in that range?
def validate_llm_output(output: dict, task: str) -> list[str]:
issues = []
if task == "classification":
valid_categories = ["tech", "politics", "economy", "science", "other"]
if output.get("category") not in valid_categories:
issues.append(f"Categoría inválida: {output.get('category')}")
if task == "summary":
summary = output.get("summary", "")
sentences = summary.split(".")
if len(sentences) < 1 or len(sentences) > 6:
issues.append(f"Resumen con {len(sentences)} frases (esperado: 1-5)")
if len(summary) < 50:
issues.append("Resumen demasiado corto")
if len(summary) > 1000:
issues.append("Resumen demasiado largo")
if task == "relevance_scoring":
score = output.get("relevance_score", 0)
if not isinstance(score, (int, float)) or score < 1 or score > 10:
issues.append(f"Puntuación fuera de rango: {score}")
return issuesLevel 2: Evaluation against a reference dataset. I maintain a set of 50 manually classified and summarized articles. When I change a prompt, I run the pipeline against that dataset and compare results.
def evaluate_against_reference(
pipeline_fn,
reference_data: list[dict],
) -> dict:
correct = 0
total = len(reference_data)
for item in reference_data:
result = pipeline_fn(item["article"])
if result["category"] == item["expected_category"]:
correct += 1
accuracy = correct / total if total > 0 else 0
return {
"accuracy": accuracy,
"correct": correct,
"total": total,
"threshold": 0.85, # mínimo aceptable
"passed": accuracy >= 0.85,
}Level 3: Editorial process feedback. When I review articles in Rolsfera and reject one that the AI had scored as relevant, or when I edit a summary because it was imprecise, that gets recorded. Over time, that data gives me a real measure of the AI pipeline’s quality.
Prompt versioning
Prompts are code. Or they should be treated as such. In a PoC, the prompt is a string you edit whenever you want. In production, you need to know which prompt generated which output, when it changed and why.
# prompts/classification_v3.py
PROMPT_VERSION = "3.2.1"
PROMPT_DATE = "2026-05-15"
PROMPT_CHANGELOG = """
3.2.1 - Añadida categoría 'devops', ajustado umbral de relevancia
3.2.0 - Simplificado formato de salida, eliminado campo 'confidence'
3.1.0 - Añadido ejemplo few-shot para mejorar precisión en categoría 'science'
3.0.0 - Reescritura completa del prompt de clasificación
"""
CLASSIFICATION_PROMPT = """Clasifica el siguiente artículo en una de estas categorías:
- tech: tecnología, programación, software, hardware
- politics: política nacional o internacional
- economy: economía, finanzas, mercados
- science: ciencia, investigación, medio ambiente
- devops: infraestructura, CI/CD, cloud, monitorización
- other: no encaja en ninguna anterior
Artículo:
Título: {title}
Contenido: {content}
Responde SOLO con un JSON:
{{"category": "...", "relevance_score": 1-10}}
Ejemplos:
- "Kubernetes 1.30 introduces sidecar containers" → {{"category": "devops", "relevance_score": 8}}
- "New study links sleep to cognitive performance" → {{"category": "science", "relevance_score": 4}}
"""Every time the prompt changes, the version gets incremented. Generated outputs are stored with the prompt version that produced them. That way, if I detect a quality degradation, I can trace exactly which prompt change caused it.
def process_with_tracking(article: dict, prompt_template: str, prompt_version: str) -> dict:
filled_prompt = prompt_template.format(
title=article["title"],
content=article["content"][:2500],
)
response = call_llm(filled_prompt)
result = json.loads(response)
# Guardar metadata de trazabilidad
result["_meta"] = {
"prompt_version": prompt_version,
"model": "gpt-4o-mini",
"timestamp": datetime.utcnow().isoformat(),
"input_tokens": count_tokens(filled_prompt),
"output_tokens": count_tokens(response),
}
return resultTechnical risks that do not show up in the PoC
Hallucinations
In a PoC, if the LLM invents a fact, you catch it because you are looking at the output. In production, hallucinations blend in with thousands of correct outputs and go unnoticed.
My strategy for mitigating hallucinations in Rolsfera:
- Bounded tasks. I do not ask the LLM to generate new information. I ask it to classify, summarize or extract information from the text I give it. That reduces the space for invention.
- Cross-validation. If the summary mentions a fact that does not appear in the original text, it is suspicious. I have a basic validation that compares entities in the summary with entities in the source text.
- Structured outputs. Asking for JSON with specific fields is better than asking for free-form text. The constrained format reduces the hallucination surface.
def check_hallucination_risk(original_text: str, summary: str) -> float:
"""Heurística simple: qué proporción de entidades del resumen
aparecen en el texto original."""
import re
# Extraer palabras capitalizadas como proxy de entidades
summary_entities = set(re.findall(r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\b', summary))
original_lower = original_text.lower()
if not summary_entities:
return 0.0
found = sum(1 for e in summary_entities if e.lower() in original_lower)
return 1.0 - (found / len(summary_entities))
# 0.0 = todas las entidades están en el original (bien)
# 1.0 = ninguna entidad está en el original (probable alucinación)It is not a perfect system. But it catches the most flagrant cases.
Latency
An LLM takes between 1 and 15 seconds to respond, depending on the model, input size and server load. In a PoC, you wait. In production, if the user waits 10 seconds without feedback, they leave.
Solutions I apply:
- Asynchronous processing. The user does not wait for the LLM. They submit the request and receive the result when it is ready (notification, polling, websocket).
- Result caching. If the same input was already processed, I return the cached result. In Rolsfera, if an article was already classified, I do not run it through the LLM again.
- Strict timeouts. If the LLM API does not respond within 20 seconds, the request fails and gets queued for retry. I never block the system waiting for a response that may not come.
Provider dependency
If your system depends on a single LLM API, you have a single point of failure. The API can go down, raise prices, change its terms of service or deprecate the model you use.
# Patrón de multi-proveedor con fallback
LLM_PROVIDERS = [
{
"name": "openai",
"model": "gpt-4o-mini",
"endpoint": "https://api.openai.com/v1/chat/completions",
"priority": 1,
},
{
"name": "anthropic",
"model": "claude-sonnet",
"endpoint": "https://api.anthropic.com/v1/messages",
"priority": 2,
},
]
def call_llm_with_fallback(prompt: str) -> str:
providers = sorted(LLM_PROVIDERS, key=lambda p: p["priority"])
for provider in providers:
try:
return call_provider(provider, prompt)
except (APIError, TimeoutError) as e:
logger.warning(
f"Proveedor {provider['name']} falló: {e}. "
f"Intentando siguiente..."
)
raise AllProvidersFailedError("Todos los proveedores de LLM han fallado")Human-in-the-loop: the pattern nobody wants to implement
In the PoC, the human is implicit: it is you looking at the results. In production, the human-in-the-loop has to be a designed flow, not an accident.
In Rolsfera, human review is part of the system’s design, not a patch. The LLM processes and suggests; I review and decide. That has architectural implications:
Review queue. Articles processed by AI are not published automatically. They arrive in a queue where I approve, edit or discard them. The queue is prioritized: articles from high-trust sources with high relevance scores appear first.
Feedback loop. My approval/rejection decisions are stored and used to evaluate the pipeline’s quality. If I start rejecting more articles than usual, something has changed (in the sources, the prompt or the model).
Escalation. If the LLM produces an output that does not pass structural validation, the article gets flagged for mandatory review instead of being discarded. Sometimes the problem is not the article, but the prompt.
# Lógica de decisión: cuándo publicar automáticamente vs. pedir revisión
def decide_workflow(article: dict) -> str:
ai_meta = article.get("ai_metadata", {})
source_trust = get_source_trust_score(article["source_name"])
relevance = ai_meta.get("relevance_score", 0)
validation_issues = ai_meta.get("validation_issues", [])
# Si hay problemas de validación, siempre revisión humana
if validation_issues:
return "manual_review"
# Fuente de alta confianza + alta relevancia = puede ir automático
if source_trust > 0.9 and relevance >= 8:
return "auto_publish"
# El resto, a la cola de revisión
return "manual_review"In practice, fewer than 10% of articles go through auto-publishing. And that is fine. I would rather review more than necessary than publish something that should not have gone out.
Observability: logs that actually serve a purpose
The print() statements from the PoC have to become structured logs you can query, filter and aggregate.
import structlog
logger = structlog.get_logger()
def process_article_with_logging(article: dict) -> dict:
log = logger.bind(article_url=article["url"], source=article["source_name"])
log.info("processing_started")
try:
result = call_llm(article)
log.info("processing_completed",
category=result.get("category"),
tokens_used=result.get("_meta", {}).get("input_tokens", 0),
latency_ms=result.get("_meta", {}).get("latency_ms", 0))
return result
except TimeoutError:
log.error("llm_timeout", timeout_seconds=20)
raise
except Exception as e:
log.error("processing_failed", error=str(e), exc_info=True)
raiseWhat I care about in production logs:
- Per-request latency. If LLM requests start taking longer than usual, I want to know before users complain.
- Tokens consumed. To control costs in real time, not at the end of the month.
- Error rate. If the percentage of failed requests goes above 5%, something is wrong.
- Category distribution. If suddenly 80% of articles are classified as “other,” the prompt probably needs adjustments.
The checklist I use before taking AI to production
After several iterations, I have settled on this verification list. It is not exhaustive, but it covers what has caused me the most problems:
| Area | Key questions |
|---|---|
| Costs | Monthly estimate with real volume. Models per task. Spend alerts. Plan B if prices go up. |
| Evaluation | Reference dataset. Structural validation. Quality metrics. Update process. |
| Prompts | Versioning. Changelog. Tests when changing prompts. Output-to-prompt traceability. |
| Resilience | Timeouts. Retries with backoff. Multi-provider fallback. Behavior without AI. |
| Observability | Structured logs. Metrics (latency, tokens, errors). Alerts. Dashboard. |
| Human-in-the-loop | Review flow. Feedback loop. Escalation. Docs for non-technical reviewers. |
Final thoughts
The PoC demonstrates what is possible. Production demonstrates what is sustainable. And between the two there is a chasm of work that is not visible in demos or Twitter threads.
I am not writing this to discourage anyone. AI is genuinely useful, and LLMs open possibilities that two years ago were science fiction. But the honest conversation is this: building a demo with an LLM takes a day. Taking that demo to production with reliability, controlled costs and reasonable maintenance takes months.
What I have learned with Rolsfera is that the success of an AI system does not depend on choosing the right model. It depends on building everything around the model: the evaluation that tells you if it works, the logs that tell you when it fails, the costs that tell you if it is sustainable and the human-in-the-loop that tells you if the output makes sense in the real world.
If you are in the PoC phase and being asked to move to production, this checklist is a good starting point. And if someone tells you it is just a matter of “putting the prompt behind an endpoint,” now you know why that is not enough.


