Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.
翻译:大型视觉语言模型在医疗场景中越来越多地应用于光学字符识别任务,这引发了文档处理过程中受保护健康信息暴露的关键担忧。本研究首次系统评估了推理阶段视觉令牌掩码作为医疗文档OCR隐私保护机制的有效性,采用DeepSeek-OCR模型进行实验。我们提出了七种掩码策略(V3-V9),针对不同架构层(SAM编码器块、压缩层、双视觉编码器、投影融合层)进行设计,并使用100份合成医疗账单声明(从包含38,517份标注文档的语料库中抽取)及其完美标注真值,评估了HIPAA定义类别的PHI减少效果。所有掩码策略均收敛至42.9%的PHI减少率,成功抑制了长文本空间分布标识符(患者姓名、出生日期、物理地址达到100%有效性),但未能阻止短结构化标识符(病历号、社会安全号码、电子邮箱地址、账户号码保持0%有效性)。通过改变掩码扩展半径(r=1,2,3)的消融实验表明,增加空间覆盖范围无法突破此效果上限,证明结构化标识符泄露源于语言模型的上下文推理能力,而非视觉掩码不足。模拟混合架构结合视觉掩码与自然语言处理后处理,可实现88.6%的总PHI减少率(假设对剩余标识符的NLP处理准确率为80%)。这一负面结果确立了纯视觉隐私干预在VLM中的效果边界,为区分适用于视觉层面与语言层面编辑的PHI类型提供指导,并将未来研究方向引向解码器微调与混合纵深防御架构,以实现符合HIPAA标准的医疗文档处理。