PHI Detection, Masking & the Unworldly Pattern
Duration: 50 min · Level: Advanced · Module: 6. HIPAA-Compliant AI Agent Deployment · Focus: PHI-detection, masking, Presidio, Unworldly, audit-trail
By the end of this lesson you will be able to explain and apply:
- PHI detection approaches
- Presidio (Microsoft)
- AWS Comprehend Medical
- Synthetic replacement
- Autosapien Unworldly pattern
Why this matters
One of the most dangerous failure modes in healthcare AI is accidental PHI leakage into logs, error messages, or model training data.
Overview
One of the most dangerous failure modes in healthcare AI is accidental PHI leakage into logs, error messages, or model training data. Automated PHI detection and masking at the system boundary — before any PHI enters a non-HIPAA-compliant system — is an architectural requirement.
Key concepts
PHI detection approaches: rule-based (regex for SSN, MRN, phone patterns), NLP (presidio/Microsoft, AWS Comprehend Medical), ML classifiers (fine-tuned BERT on PHI dataset); use ensemble of all three for healthcare agent logs
- Presidio (Microsoft): open-source PII/PHI detection and anonymization library; supports 50+ entity types including healthcare-specific (NPI, MRN, DEA number); Python library, production-ready, actively maintained
- AWS Comprehend Medical: managed service for extracting medical entities (conditions, medications, anatomy, PHI) from clinical text; HIPAA eligible; useful for parsing unstructured clinical notes
- Synthetic replacement: when masking PHI for logs or debugging, replace with realistic synthetic values (real name → consistent fake name, real DOB → shifted DOB ±1-5 years); preserves debuggability without real PHI exposure
- Autosapien Unworldly pattern: audit-trail-first architecture where every agent action is logged with PHI-masked summary BEFORE execution; if the action fails, the audit shows what was attempted without real PHI in logs; ISO 42001 compatible
- Model training data controls: never use production PHI to fine-tune models without IRB approval and data use agreement; use de-identified data or synthetic data generated from de-identified distributions for fine-tuning
Check your understanding
Try to recall each answer before expanding it.
Q1. What do you know about PHI detection approaches?
rule-based (regex for SSN, MRN, phone patterns), NLP (presidio/Microsoft, AWS Comprehend Medical), ML classifiers (fine-tuned BERT on PHI dataset); use ensemble of all three for healthcare agent logs
Q2. What do you know about Presidio (Microsoft)?
open-source PII/PHI detection and anonymization library; supports 50+ entity types including healthcare-specific (NPI, MRN, DEA number); Python library, production-ready, actively maintained
Q3. What do you know about AWS Comprehend Medical?
managed service for extracting medical entities (conditions, medications, anatomy, PHI) from clinical text; HIPAA eligible; useful for parsing unstructured clinical notes
Q4. What do you know about Synthetic replacement?
when masking PHI for logs or debugging, replace with realistic synthetic values (real name → consistent fake name, real DOB → shifted DOB ±1-5 years); preserves debuggability without real PHI exposure
Q5. What do you know about Autosapien Unworldly pattern?
audit-trail-first architecture where every agent action is logged with PHI-masked summary BEFORE execution; if the action fails, the audit shows what was attempted without real PHI in logs; ISO 42001 compatible
References
- Presidio — Data Protection and De-identification SDK — Microsoft (2023). GitHub/microsoft/presidio
← Previous: H6.1 HIPAA Technical Requirements for AI Systems · Next: H6.3 Production Safety: Guardrails, Rollback & Incident Response →
Part of Module 6: HIPAA-Compliant AI Agent Deployment.