← All work

Local Data Masking Tool

Client
Data Masking Tool
Role
Product Designer & Developer
Years
2025 — 2026
Stack
  • Python
  • Claude
  • HMAC-SHA256
  • Tkinter
  • pandas
  • openpyxl

The problem

Analysts working with client media data face a familiar bind: the most useful AI-assisted analysis requires sharing actual data with a model, but client contracts and basic data hygiene say you shouldn’t paste campaign names, placement IDs, or other potentially sensitive identifiers into a third-party API.

The usual workarounds — anonymizing by hand before sharing, using only aggregate totals, or just not using AI — each sacrifice something. Manual anonymization is slow and error-prone. Sharing only aggregates strips the structural detail that makes AI analysis useful. Skipping AI entirely leaves real capacity on the table.

I wanted a tool that solved this properly: local processing, cryptographically sound masking, and a clean path back to original values for authorized use.

What I built

The Local Data Masking Tool is a cross-platform desktop app (Windows and macOS) that fits directly into an analyst’s file workflow.

Masking. Load a CSV or Excel file. Assign columns as dimensions, metrics, or ignored. Choose which dimensions to mask and how each metric should be aggregated. The app replaces sensitive dimension values with keyed HMAC-SHA256 tokens and saves two separate outputs: the masked data file (safe to share with an AI model) and an encrypted .maskvault file (kept private). Masked output and vault must go to different folders by design.

AI Context Layer. Before saving, you can add a plain-language blurb describing what the masked dimensions represent — “campaign refers to different product lines within a cookie brand’s portfolio” — without including any real values. That context travels with the masked file and gives the AI model enough to reason about the data structure without exposing identifiers.

Restore. When an authorized recipient needs original values, the Restore tab matches the masked file against its vault, decrypts in memory, and writes a new workbook with a Restored_Data sheet and a Mapper tab mapping every token back to its original value.

View the source on GitHub →

How AI shaped the build

Claude was a genuine co-developer across the full lifecycle, not just a code generator.

Security architecture. Early drafts offered selectable hash algorithms (MD5, SHA-256, SHA-512). Iterating with Claude surfaced the dictionary-attack vulnerability of unsalted hashes against predictable values — campaign names, placement labels. That analysis led directly to the keyed HMAC-SHA256 approach with a fresh random key per job.

The vault model. The original design stored masked data and lookup keys in a single combined workbook — simple, but it puts “safe to share” and “never share” material in the same file. Claude helped reason through the threat model and spec out the separate encrypted vault design that ships in v2.0.

UI and workflow. Built the scrollable blue-and-white interface iteratively with Claude — working through the batch column selection model, the dimension/metric role system, show/hide password controls, and the status bar that confirms row and column counts on load.

The Context Layer prompt. Claude drafted the auto-generated starter text that seeds the context blurb from actual column names in the loaded file. The goal: give users something concrete to edit rather than a blank textarea, while keeping the prompt tight enough that they won’t accidentally drop in real values.

Documentation. The user guide and release notes were drafted with Claude from a working spec, then corrected against actual app behavior.

Security design

A few deliberate choices distinguish this from a quick script:

The vault password is never saved or logged — used in memory for the duration of a restore session, then gone. A wrong password or mismatched vault produces a clean error, not a partially-restored file. Masked output and vault folders must be different; the app enforces this at save time to make it structurally hard to accidentally bundle them together.

The Context Layer explicitly warns against including sensitive values in the blurb. The point of the context is that it travels with the masked file — so it can’t contain what the masking is hiding.

Honest limitations

Masking dimensions reduces exposure but doesn’t automatically anonymize a dataset. Rare dimension combinations, free-text fields, dates, and small group sizes can still carry identifying signal. The user guide and README both call this out directly.

Aggregated metrics stay aggregated after restoration — row-level detail isn’t preserved if you chose to summarize. That’s a feature for some workflows and a limitation for others.