Skip to main content

PHI Detection, Masking & the Unworldly Pattern

Duration: 50 min · Level: Advanced · Module: 6. HIPAA-Compliant AI Agent Deployment · Focus: PHI-detection, masking, Presidio, Unworldly, audit-trail

Learning objectives

By the end of this lesson you will be able to explain and apply:

  • PHI detection approaches
  • Presidio (Microsoft)
  • AWS Comprehend Medical
  • Synthetic replacement
  • Autosapien Unworldly pattern

Why this matters

One of the most dangerous failure modes in healthcare AI is accidental PHI leakage into logs, error messages, or model training data.

Overview

One of the most dangerous failure modes in healthcare AI is accidental PHI leakage into logs, error messages, or model training data. Automated PHI detection and masking at the system boundary — before any PHI enters a non-HIPAA-compliant system — is an architectural requirement.

Key concepts

Key idea

PHI detection approaches: rule-based (regex for SSN, MRN, phone patterns), NLP (presidio/Microsoft, AWS Comprehend Medical), ML classifiers (fine-tuned BERT on PHI dataset); use ensemble of all three for healthcare agent logs

  • Presidio (Microsoft): open-source PII/PHI detection and anonymization library; supports 50+ entity types including healthcare-specific (NPI, MRN, DEA number); Python library, production-ready, actively maintained
  • AWS Comprehend Medical: managed service for extracting medical entities (conditions, medications, anatomy, PHI) from clinical text; HIPAA eligible; useful for parsing unstructured clinical notes
  • Synthetic replacement: when masking PHI for logs or debugging, replace with realistic synthetic values (real name → consistent fake name, real DOB → shifted DOB ±1-5 years); preserves debuggability without real PHI exposure
  • Autosapien Unworldly pattern: audit-trail-first architecture where every agent action is logged with PHI-masked summary BEFORE execution; if the action fails, the audit shows what was attempted without real PHI in logs; ISO 42001 compatible
  • Model training data controls: never use production PHI to fine-tune models without IRB approval and data use agreement; use de-identified data or synthetic data generated from de-identified distributions for fine-tuning

Check your understanding

Try to recall each answer before expanding it.

Q1. What do you know about PHI detection approaches?

rule-based (regex for SSN, MRN, phone patterns), NLP (presidio/Microsoft, AWS Comprehend Medical), ML classifiers (fine-tuned BERT on PHI dataset); use ensemble of all three for healthcare agent logs

Q2. What do you know about Presidio (Microsoft)?

open-source PII/PHI detection and anonymization library; supports 50+ entity types including healthcare-specific (NPI, MRN, DEA number); Python library, production-ready, actively maintained

Q3. What do you know about AWS Comprehend Medical?

managed service for extracting medical entities (conditions, medications, anatomy, PHI) from clinical text; HIPAA eligible; useful for parsing unstructured clinical notes

Q4. What do you know about Synthetic replacement?

when masking PHI for logs or debugging, replace with realistic synthetic values (real name → consistent fake name, real DOB → shifted DOB ±1-5 years); preserves debuggability without real PHI exposure

Q5. What do you know about Autosapien Unworldly pattern?

audit-trail-first architecture where every agent action is logged with PHI-masked summary BEFORE execution; if the action fails, the audit shows what was attempted without real PHI in logs; ISO 42001 compatible

References

  • Presidio — Data Protection and De-identification SDK — Microsoft (2023). GitHub/microsoft/presidio

← Previous: H6.1 HIPAA Technical Requirements for AI Systems · Next: H6.3 Production Safety: Guardrails, Rollback & Incident Response

Part of Module 6: HIPAA-Compliant AI Agent Deployment.