Skip to content

Tokenization¶

Tokenization is the process of replacing secrets with deterministic, non-reversible tokens.

Token Format¶

<CATEGORY_N>

CATEGORY — Type of secret (e.g., API_KEY, AWS_SECRET)
N — Occurrence number for correlation

Examples¶

Original	Token
`sk-abc123xyz`	`<API_KEY_1>`
`AKIA1234567890`	`<AWS_KEY_1>`
`ghp_xxxxxxxxxxxx`	`<GITHUB_TOKEN_1>`

Correlation Preservation¶

The same secret always produces the same token within a bundle:

Input:
  API_KEY=sk-secret123
  curl -H "Authorization: Bearer sk-secret123"

Output:
  API_KEY=<API_KEY_1>
  curl -H "Authorization: Bearer <API_KEY_1>"

This helps understand data flow without exposing values.

How It Works¶

secret + salt → hash → token_id → <CATEGORY_N>

Salt — Unique per session, prevents cross-bundle correlation
Hash — One-way transformation (SHA-256)
Token ID — Incrementing counter per category
Token — Human-readable placeholder

Security Properties¶

Property	Guarantee
Non-reversible	Cannot recover original from token
Deterministic	Same input → same output (per session)
Session-isolated	Different sessions → different tokens
Collision-resistant	Different inputs → different tokens

API Usage¶

from bugsafe.redact.tokenizer import Tokenizer

tokenizer = Tokenizer()

# Tokenize a secret
token = tokenizer.tokenize("sk-abc123", "API_KEY")
print(token)  # <API_KEY_1>

# Same secret = same token
token2 = tokenizer.tokenize("sk-abc123", "API_KEY")
assert token == token2

# Check if string is a token
assert tokenizer.is_token("<API_KEY_1>")

See Also¶