Energy Trading Assistant
An intraday trading team wanted faster, trustworthy explanations of price moves and position changes during volatile windows. Their LLM-powered assistant pulled context from market notes, EFA calendars, strategy wikis, and trade logs, but responses sometimes “sounded right” without being grounded. We introduced a focused reliability program to raise faithfulness, cut latency at peaks, and keep costs predictable.
Objectives
-
Increase grounded-answer rate on trade and market queries.
-
Reduce unsupported claims and policy risks in operator-facing answers.
-
Hit a p95 latency budget suited to auction/renomination bursts.
-
Establish continuous oversight so quality doesn’t drift after releases.
Our Solution
We split the problem into dialogue and RAG layers. For RAG, we measured retrieval recall/precision, context sufficiency, and document freshness; for generation, we scored faithfulness and clarity, and tracked hallucination rate with targeted human spot-checks. We tuned chunking and metadata, swapped embeddings, introduced a re-ranker for time-sensitive content, and tightened prompts to enforce citations and claim boundaries. For the assistant’s behaviour, we added domain guardrails (PII, compliance phrasing, market-manipulation disclaimers) and created route-level scorecards with release thresholds.
Finally, we wired offline → canary → production evaluations into CI/CD with weekly “quality digests” for traders and product leads.
Implementation Highlights
1. RAG debugger: Side-by-side traces showing when retrieval missed vs. when generation drifted; instant links to source docs.
2. Time-aware retrieval: EFA window tags and recency re-ranking so current day notes out-weigh archived commentary.
3. Prompt refinements: Structured answer template with mandatory citations, uncertainty language for missing data, and escalation cues.
4. Performance envelope: Concurrency tests with replayed peak traffic; caching for stable snippets; intent-based model routing.
5. Safety & policy: Targeted red-team packs for jailbreaks and prompt injection; calibrated thresholds to avoid over-blocking operators.
Before → After
Before: “Gas price jump due to ‘capacity constraints’,” no source cited; 3.4s p95; two follow-ups needed.
After: “Within-day gas up 4.1% mainly from NCS maintenance (source: Ops-Alert-#412) and LDZ demand (source: Daily-Demand-Sheet). Confidence: medium; capacity data missing for Zone-B.” 2.3s p95; no follow-ups.
Collaboration
We ran short, instrumented cycles: define a reliability target, deploy a small change (retrieval, prompt, or route), and measure the movement on the scorecard. Traders reviewed weekly digests; engineering owned CI gates; product owned release thresholds.
Next Steps
Expand multilingual coverage for EU desks, add feature-flagged model mixtures for extreme volatility days, and integrate counterfactual data checks so new market notes can’t silently degrade grounding.