FILE 07.03 / SCENARIO
SIMULATED · FOR CANDIDATE EVALUATION · NOT REPRESENTATIVE OF PRODUCTION SYSTEMS
Improve the RAG
A retrieval prompt with imperfect groundedness. Improve it, then propose the metric that would catch a regression in your rewrite.
FILE 07.03.1 / SETUP
Our RAG layer powers Lena's grounded reasoning. When an operator asks Lena a question — "why is conf room 7 dropping calls" — the retrieval layer pulls the most relevant chunks from operational telemetry, structured device state, and product documentation, and the model answers with citations.
In this scenario's sample evaluation snapshot — visible in the telemetry below — the prompt clears the groundedness bar 60% of the time. That number is curated for the exercise, not a live production claim: a deliberately mid-range result so the candidate has something real to push against.
The non-grounded responses break down into a mix (see the telemetry): retrieval misses, retrieval distracts, the model ignores what it retrieved, the model confabulates from a partial doc. The mix is informative — it's not all one failure class. That's the shape of the work an AI engineer does on retrieval: read the failure decomposition, decide which lever moves the most, change one thing, measure again.
We're inviting candidates to take a swing at this. Two scopes:
-
Improve the retrieval prompt itself. The current prompt is simple: system instructions, the question, the retrieved context, an "I don't know" branch. There's a lot of room to do better — how you constrain the model, how you handle refusal, whether you make the model cite back to specific chunks, how you handle the case where retrieval is borderline.
-
Propose the eval metric that would catch a regression in your rewrite. Groundedness rate is too blunt — it catches the worst-case failures but misses partial-citation failures and over-refusal. Pick one new metric, define it precisely, and tell us the threshold you'd alarm on.
This is the kind of evaluation-and-iteration loop we run. Walk us through what you'd do.
FILE 07.03.2 / TELEMETRY
What is on the operator's screen.
Real-shaped operational data. Anonymized device IDs, real-shaped timing. The same view an on-call engineer would see in the moment.
FILE 07.03.4 / YOUR RESPONSE
Show us how you would design it.
Short and specific beats long and vague. The next step is the application form — we save what you have written here so you do not lose it.