Skip to content

security: add Slack, JWT, Azure, and Discord patterns to sanitize.js#414

Closed
shaun0927 wants to merge 2 commits intoEvoMap:mainfrom
shaun0927:security/sanitize-additional-token-patterns
Closed

security: add Slack, JWT, Azure, and Discord patterns to sanitize.js#414
shaun0927 wants to merge 2 commits intoEvoMap:mainfrom
shaun0927:security/sanitize-additional-token-patterns

Conversation

@shaun0927
Copy link
Copy Markdown

Problem

src/gep/sanitize.js::redactString is the last line of defense
before session log excerpts (up to 2,000 chars) and per-event
reason strings are posted to config.repo by
src/gep/issueReporter.js. It already covers OpenAI sk-, GitHub
gh[pousr]_, AWS AKIA…, npm npm_, generic Bearer, and a few
others. It was missing four common formats.

See #409 for reproduction and the full rationale.

Format Example
Slack bot/user tokens xoxb-…, xoxp-…
JWT (header.payload.signature) eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxIn0.sflKxw…
Azure storage connection strings DefaultEndpointsProtocol=https;AccountName=…;AccountKey=…
Discord bot tokens ODY4…YPc-6Q…

These formats show up routinely in agent runtime logs (Slack
webhooks, OIDC id_token responses, Azure Storage SDK debug logs,
bot frameworks), so without these patterns any of them landing in a
gene's outcome.reason or in a session log reaches GitHub in
cleartext.

Fix

  • Add four regexes to REDACT_PATTERNS (destructive replace).
  • Add four matching entries to LEAK_SCANNERS (non-destructive
    detection) so the existing scanner reports the same categories.

Scope: src/gep/sanitize.js, +12 lines, no behaviour change outside
the matched substrings.

Testing

$ node test/sanitize.test.js
All sanitize tests passed (34 assertions)

Reproduction (uses 4 synthetic secrets in the public test):

const { redactString } = require('./src/gep/sanitize');
const samples = [
  'xoxb-1234567890-1234567890123-AbCdEfGhIjKlMnOpQrStUvWx',
  'eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxIn0.sflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c',
  'DefaultEndpointsProtocol=https;AccountName=acct;AccountKey=ABC123DEF456ghi789JKL==;EndpointSuffix=core.windows.net',
  'ODY4MzE2NTg4ODIwMTEyMzQ1.YPc-6Q.K3n1tfY9q9f4k_5vZl3Mw2X1AbCdefGhIjK',
];
for (const s of samples) console.log(redactString(s) === s ? 'LEAKED' : 'REDACTED');

Before this PR:

LEAKED
LEAKED
LEAKED
LEAKED

After this PR:

REDACTED
REDACTED
REDACTED
REDACTED

Prior art

PR #107 (fix: harden sanitize patterns for token leakage prevention) attempted a similar hardening and was closed without
merge on 2026-02-26. This PR keeps the scope minimal — four additive
regexes, no refactor, no reorder of existing patterns — so reviewers
can approve or deny each pattern independently. Happy to split
further if that helps.

Closes #409.

…voMap#409)

sanitize.js::redactString is the last line of defense before session
log excerpts and event reason strings are POSTed to the auto-issue
tracker by issueReporter.js. It already covers OpenAI sk-, GitHub
gh[pousr]_, AWS AKIA, npm_, generic Bearer, and a few others. It was
missing four common formats:

  xoxb-... / xoxp-...   Slack bot and user tokens
  eyJ....eyJ....{sig}   JSON Web Tokens (header.payload.signature)
  AccountKey=...;       Azure storage connection strings (key field)
  3-segment base64url   Discord bot tokens

These formats show up routinely in agent runtime logs: Slack webhooks,
OIDC id_tokens exchanged by SDKs, Azure Storage SDK emitted logs, and
bot frameworks. Without these patterns any of them landing in a
gene's outcome.reason or session log reaches GitHub in cleartext.

The REDACT_PATTERNS array gets four new regexes, and LEAK_SCANNERS
gets four matching entries so the non-destructive scanner reports the
same categories.

Testing:

  $ node test/sanitize.test.js
  All sanitize tests passed (34 assertions)

  $ node -e "const {redactString} = require('./src/gep/sanitize');
  [...4 samples above...].forEach(([n, s]) =>
    console.log(n, '->', redactString(s) === s ? 'LEAKED' : 'REDACTED'));"
  slack-bot -> REDACTED
  jwt       -> REDACTED
  azure     -> REDACTED
  discord   -> REDACTED

Note: EvoMap#107 (EvoMap#107) attempted a
similar hardening and was closed without merge. Scope here is minimal
(four additive regexes, no refactor, no behaviour change outside the
matched substrings) to keep the review surface tight.

Closes EvoMap#409
Follow-up on the initial patch in this PR. Two issues called out in
review:

1. REDACT_PATTERNS Discord regex was too broad:
     /\b[A-Za-z0-9_-]{24,}\.[A-Za-z0-9_-]{6,}\.[A-Za-z0-9_-]{27,}\b/
   This would also match lowercase dotted identifiers such as Python
   module paths (some_really_long_module_name.submod.
   another_really_long_module_name_here) and similarly structured
   hostnames. Because REDACT_PATTERNS performs a destructive
   replacement, false matches silently corrupt the sanitized output.

2. LEAK_SCANNERS discord_token still used the old narrow [MN] prefix
   and therefore did not detect the O-prefix tokens that the
   destructive pattern was intended to catch. Destructive and
   non-destructive paths reported inconsistent results.

Fix for both paths:

  /\b[MNO][A-Za-z0-9_-]{23,}\.[A-Za-z0-9_-]{6}\.[A-Za-z0-9_-]{27,}\b/

- Leading [MNO] requires a Discord-style uppercase prefix, ruling
  out lowercase dotted identifiers.
- Middle segment pinned to exactly 6 chars (matches the base64url
  encoding of the 4-byte token timestamp Discord uses).
- Third segment unchanged (27+ chars HMAC signature).

Verified:

  discord-bot (should REDACT)         -> redacted
  lowercase module path (should NOT)  -> unchanged
  lowercase hostname (should NOT)     -> unchanged
  slack / jwt / azure                 -> still redacted

  node test/sanitize.test.js
  All sanitize tests passed (34 assertions)
@autogame-17
Copy link
Copy Markdown
Collaborator

autogame-17 commented Apr 21, 2026

Thank you for the contribution, @shaun0927. As noted in SECURITY.md and our previous triage on related PRs, EvoMap/evolver is a generated distribution mirror of an internal maintainer branch and cannot accept code PRs directly -- the public tarball carries the release integrity signature that only our release pipeline can produce correctly against the internal maintainer branch.

This specific change has been recorded and will be hand-ported into the internal maintainer branch ahead of the next release, with credit to you. Closing this PR accordingly; the underlying issue will be closed or referenced when the port lands.

If you have more suggestions we welcome them as issues (not PRs) so we can triage without creating merge-path confusion.

@autogame-17
Copy link
Copy Markdown
Collaborator

Hi @shaun0927 — thanks again for this contribution. The changes inspired by this PR have shipped in v1.69.10 (release, npm).

Ported:

  • redact patterns + leak scanners for Slack tokens (xoxb/xoxp/xoxa), JSON Web Tokens, Azure storage AccountKey=, and Discord bot tokens
  • additional patterns for Azure AD client_secret= and Application Insights InstrumentationKey=
  • matching unit tests, including a negative assertion that lowercase dotted identifiers do not false-match the Discord pattern

Because the public mirror is built from an internal maintainer branch, the commit attribution uses our release commit rather than a direct merge. Credit to you is noted in the v1.69.10 release notes. Really appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

security: sanitize.js does not redact Slack, JWT, Azure, or Discord token formats

2 participants