NetSpeek

FILE 07.04 / SCENARIO

SIMULATED · FOR CANDIDATE EVALUATION · NOT REPRESENTATIVE OF PRODUCTION SYSTEMS

Stop the Regression

A bug that keeps coming back across releases. Design a CI gate and the telemetry that would have caught it earlier.

ALIGNED ROLES /QA / DevOps

FILE 07.04.1 / SETUP

Same bug. Three releases. Three customer reports. One internal monitoring catch — eventually.

The bug: when an operator triggers the "mute room" workflow, the orchestration API returns 200 ("success"), but the underlying device state remains unmuted. The operator sees green; the room is still hot. Awkward in a routine meeting. Embarrassing in a board meeting.

It first hit production in 2025.10.4 (one customer, one vendor). The fix shipped. Then 2025.11.2 (different customer, different vendor — but the same root cause class). The fix shipped again. Then 2026.01.1 — three customers, two vendors, same root cause.

Each time the symptom is the same: the API claims the workflow completed but the device-state verification at the end never actually fired (or fired but didn't fail when the state didn't match). The CI suite never caught it. Production telemetry never raised it. The customer raised it.

The team's been deliberate about not patching just the immediate cause; we want a CI gate that catches this class of bug — not just the next instance of this specific bug.

Below is the snapshot of the current CI surface, the current production telemetry coverage, and the three incident records. Your job is to design two things:

  1. The CI gate that would have caught this before release.
  2. The single missing production telemetry signal that would have cut time-to-detect dramatically.

The hard part isn't naming the signals. The hard part is keeping CI fast and trustworthy while you do it.

FILE 07.04.2 / TELEMETRY

What is on the operator's screen.

Real-shaped operational data. Anonymized device IDs, real-shaped timing. The same view an on-call engineer would see in the moment.

SOURCE · 01 / incident_logLIVE

incidents · 3

release2025.10.4
date2025-10-12
symptomOperator action 'mute room' returns 200 but device state remains unmuted
detection_sourcecustomer report
time_to_detect_hours36
release2025.11.2
date2025-11-19
symptomSame: 'mute room' returns 200 but device state remains unmuted (different vendor category)
detection_sourcecustomer report
time_to_detect_hours22
release2026.01.1
date2026-01-08
symptomSame root cause, now affecting 3 customers across 2 vendor categories
detection_sourceinternal monitoring
time_to_detect_hours9
the same incident, three times
SOURCE · 02 / ci_current_stateLIVE

unit_tests

count47
coverage_pct78

integration_tests

count12
uses_real_devicefalse
uses_mock_devicetrue

e2e_tests

count3
uses_real_devicetrue
avg_duration_min14
stability_pct71
telemetry_assertions_in_ci0
current CI gates on the mute_room workflow
SOURCE · 03 / production_telemetryLIVE
workflow_completion_eventpresent
device_state_verification_post_actionabsent
per_vendor_drift_metricabsent
operator_complaint_signalabsent
production telemetry coverage on mute_room

FILE 07.04.4 / YOUR RESPONSE

Show us how you would design it.

Short and specific beats long and vague. The next step is the application form — we save what you have written here so you do not lose it.

0 / 1000
0 / 500
0 / 600
0 / 200
ROLE / TARGETREQUIRED

This scenario maps to one role. Pick the one you want your application attached to.

SKIP / TAKE FIELD NOTE PATH

Fill in each prompt to continue. The soft minimums are guidance, not gatekeeping.