The first 15 minutes: building a breach alarm for your credential vault

You get a Slack message at 11pm. Someone thinks their laptop was compromised. Or a user reports their credentials are being used from a location they don't recognise. Or a phishing simulation caught three people in finance.

Whatever the trigger, the question is always the same: how quickly can you contain the blast radius?

For most teams using a password manager, the answer involves a frustrating sequence of manual steps — revoking access one member at a time, checking which credentials were shared with whom, and hoping the attacker hasn't already exported what they needed.

This post covers the technical architecture behind a breach alarm that collapses that response from hours to seconds.

The problem isn't that teams don't care about incident response. It's that they've never built the mechanism to act on it quickly enough for it to matter.

Anatomy of a credential breach

Before building a response mechanism, it's worth being precise about what you're responding to.

In a zero-knowledge credential vault, a breach has a specific shape. The attacker has one of three things:

An authenticated session — they've logged in as a valid user, either by compromising credentials + 2FA, or by stealing a session cookie.
A copy of vault ciphertext they can't yet decrypt — they've exfiltrated an encrypted backup but don't have the key. Time-gated threat.
Already-decrypted credentials — they copied passwords while the vault was unlocked. These are now valid regardless of what you do to the vault.

The first category is the one you can stop. The second you can slow down significantly — rotate everything before they crack it. The third is already done; your job there is damage assessment and rotation.

A breach alarm addresses all three categories, but it's most decisive against the first.

What a breach alarm actually does

The core operation is simple in concept and harder to implement correctly: instantly terminate every active session for every member of the organisation, simultaneously, without leaving gaps.

This sounds straightforward until you consider what "active session" means in a distributed system. A user might have:

A session cookie in their browser (Flask session)
A valid JWT in the browser extension
A long-lived API token in a CI/CD pipeline
A service account token with separate auth

Killing one doesn't kill the others. A real breach alarm has to reach all of them.

Implementation note

In HexVault's implementation, session termination works on two layers. The user_sessions table is cleared for all org members — this kills JWT-based API sessions. Simultaneously, session_invalidated_at is set to NOW() on every user record. The require_auth middleware checks this on every request and rejects any session cookie that was issued before this timestamp. Together, these two operations close all session vectors within a single transaction.

Killing sessions atomically

The implementation matters. A naive implementation kills sessions one at a time in a loop — if the loop fails halfway through, some sessions survive. The correct approach is a single bulk operation:

SQL
# Get all active member IDs in one query
member_ids = db.execute(
    'SELECT user_id FROM org_members WHERE org_id=%s',
    (org_id,), fetch='all'
)
uid_list = [m['user_id'] for m in member_ids]

# Single DELETE — atomic, no partial failure
ph = ','.join(['%s'] * len(uid_list))
db.execute(
    f'DELETE FROM user_sessions WHERE user_id IN ({ph})',
    uid_list
)

# Invalidate Flask session cookies
db.execute(
    f'UPDATE users SET session_invalidated_at=NOW() WHERE id IN ({ph})',
    uid_list
)

Note that placeholder lists (IN (%s, %s, %s)) are constructed from the count of IDs, not from user input — there's no SQL injection vector here. The actual values are passed as parameters to psycopg2.

Admin sessions are left intact by design. You need the admin to be able to manage the incident response. Killing admin sessions too would lock you out at the worst possible moment.

Canary credentials and automatic triggering

A manual alarm requires someone to notice something is wrong and decide to trigger it. That's a slow, fallible process. A canary system makes some breaches self-detecting.

The concept is simple: create a credential that nobody in the organisation should ever access under normal operations. It exists only as a trap. If anyone accesses it, the alarm triggers automatically.

Common examples:

A fake AWS root key for a non-existent account
A "prod database root password" that's not actually used anywhere
A "emergency admin" credential for a system that doesn't have one

In a zero-knowledge vault, reading a credential requires decrypting it client-side. But the access event — the fact that a credential was requested — can be logged server-side without ever seeing the plaintext. This is the hook for canary detection.

Python
def _check_canary_trip(org_id, credential_id, accessor_email):
    class="tok-st">"""Called on every credential access. If canary, trigger alarm."""
    cred = db.execute(
        class="tok-st">'SELECT is_canary, canary_label FROM org_passwords '
        class="tok-st">'WHERE id=%s AND org_id=%s',
        (credential_id, org_id), fetch=class="tok-st">'one'
    )
    if not cred or not cred[class="tok-st">'is_canary']:
        return

    class=class="tok-st">"tok-cm"># Check alarm not already active
    active = db.execute(
        class="tok-st">'SELECT breach_alarm_active FROM organisations WHERE id=%s',
        (org_id,), fetch=class="tok-st">'one'
    )
    if active and active[class="tok-st">'breach_alarm_active']:
        return

    class=class="tok-st">"tok-cm"># Trip the alarm
    reason = (
        fclass="tok-st">"Canary trip wire: '{cred['canary_label']}' "
        fclass="tok-st">"was accessed by {accessor_email}"
    )
    _trigger_alarm(org_id, reason=reason)

Security note

The canary check must happen server-side on every credential access request, not client-side. If you check client-side, the check can be bypassed. The server-side credential access log is the authoritative record — the canary detection runs against that log in real time.

The dead man's switch

There's a class of threat that a manual alarm and even a canary can miss: the abandoned org.

Imagine an organisation where the admin left and nobody took over. The vault still exists. Credentials are still valid. There's no active admin to notice anything. If an attacker gains access, nobody triggers the alarm because nobody is watching.

A dead man's switch inverts this: instead of requiring action to trigger the alarm, it requires action to prevent a warning. If no admin logs in for N days, the system escalates — displaying a warning banner and, in more aggressive configurations, initiating lockdown automatically.

The implementation is minimal:

SQL
-- Check last admin login on every alarm status request
SELECT MAX(created_at) AS last_login
FROM admin_audit_log
WHERE org_id = %s AND action = 'admin_login'

If last_login is older than the configured threshold, surface the warning. The threshold should be conservative — seven days catches abandoned orgs without generating noise for normal holiday periods.

Guided recovery

Triggering the alarm is the easy part. What happens next is where most incident responses fall apart.

A breach alarm that just kills sessions and leaves you with a blank screen is incomplete. The admin needs a structured, dynamic recovery checklist that reflects the actual state of the organisation — not a generic template.

What "dynamic" means here:

The checklist queries live database state, not a static template
Items are marked complete automatically when the underlying condition changes
Each item links directly to the admin section where the remediation happens

A static checklist says "kill all sessions". A dynamic one says "2 active sessions remain — click here to terminate them".

The items matter too. Beyond session termination, a complete recovery involves:

Force password resets for all members — even if their session is gone, their credentials are compromised if the attacker had enough time
Review break-glass access logs — did any emergency credentials get accessed in the window before detection?
Rotate canary credentials — if a canary tripped, the credential is now known to an attacker and must be rotated even if it's fake
Check SoD violations — did the attacker use the window to create self-approving access grants?
GDPR Article 33 notification — if personal data was accessible, you have 72 hours to notify your supervisory authority
Rotate webhook secret — if the attacker had admin access, they may have read the webhook config

Webhook integration

A breach alarm that lives only inside the password manager is a closed loop. The moment you suspect a breach, you need your entire security stack to know.

The webhook fires with a breach.alarm event containing:

Code

{
  "event": "breach.alarm",
  "org_id": 42,
  "org_name": "Acme Corp",
  "actor": "[email protected]",
  "detail": {
    "reason": "Canary trip wire: 'prod-db-root' accessed",
    "sessions_killed": 14,
    "triggered_at": "2026-04-30T23:14:07Z"
  },
  "timestamp": "2026-04-30T23:14:07Z",
  "delivery_id": "uuid"
}

This single event can trigger:

A PagerDuty incident at P1 severity
A Slack message to #security-incidents
Automatic SIEM alert creation
Runbook execution in an automation platform

The webhook is signed with HMAC-SHA256 using a per-org secret. Receiving systems should verify this signature before processing.

The full incident checklist

Here's the minimum viable incident response for a credential vault breach, in order of priority:

Trigger alarm immediately — don't investigate first. Kill sessions first, investigate with the blast radius contained.
Identify the entry vector — check login risk assessments, anomaly alerts, and the access log for the compromised credential if known.
Force password reset for all members — not just the suspected account. If the attacker had admin access, any member could be compromised.
Rotate every credential that was accessible — use the blast radius simulator to identify scope. Every credential the compromised account could see needs rotation.
Review break-glass access — these are the highest-value targets and the most likely to have been specifically targeted.
Rotate service account tokens — CI/CD tokens and API keys are persistent and survive session termination if not explicitly revoked.
Create a vault snapshot — before any rotation, snapshot the vault state for forensics. This is your pre-rotation record.
Document and notify — create an incident record, notify legal/compliance, and determine GDPR notification obligations.

The order matters. Steps 1 and 2 happen in the first 5 minutes. Steps 3–6 happen in the first hour. Steps 7–8 happen within 24 hours.

A breach response plan that exists only in a document is no plan at all. The mechanism has to be built, tested, and ready before the incident.

The technical barrier to building this is lower than most teams assume. The database operations are straightforward. The webhook pattern is well-understood. The recovery checklist is a handful of database queries. What's missing in most organisations isn't capability — it's the decision to build it before it's needed.