Tokenization¶
Tokenization is the process of replacing secrets with deterministic, non-reversible tokens.
Token Format¶
- CATEGORY — Type of secret (e.g.,
API_KEY,AWS_SECRET) - N — Occurrence number for correlation
Examples¶
| Original | Token |
|---|---|
sk-abc123xyz | <API_KEY_1> |
AKIA1234567890 | <AWS_KEY_1> |
ghp_xxxxxxxxxxxx | <GITHUB_TOKEN_1> |
Correlation Preservation¶
The same secret always produces the same token within a bundle:
Input:
API_KEY=sk-secret123
curl -H "Authorization: Bearer sk-secret123"
Output:
API_KEY=<API_KEY_1>
curl -H "Authorization: Bearer <API_KEY_1>"
This helps understand data flow without exposing values.
How It Works¶
- Salt — Unique per session, prevents cross-bundle correlation
- Hash — One-way transformation (SHA-256)
- Token ID — Incrementing counter per category
- Token — Human-readable placeholder
Security Properties¶
| Property | Guarantee |
|---|---|
| Non-reversible | Cannot recover original from token |
| Deterministic | Same input → same output (per session) |
| Session-isolated | Different sessions → different tokens |
| Collision-resistant | Different inputs → different tokens |
API Usage¶
from bugsafe.redact.tokenizer import Tokenizer
tokenizer = Tokenizer()
# Tokenize a secret
token = tokenizer.tokenize("sk-abc123", "API_KEY")
print(token) # <API_KEY_1>
# Same secret = same token
token2 = tokenizer.tokenize("sk-abc123", "API_KEY")
assert token == token2
# Check if string is a token
assert tokenizer.is_token("<API_KEY_1>")