ASQ-PHI (Adversarial Synthetic Queries for Protected Health Information de-identification) is a novel dataset developed to address emerging challenges in advanced AI workflows. In the current scenario, hospitals are using HIPAA-compliant large language models (LLMs) under Business Associate Agreements (BAAs) that allow clinicians to record Protected Health Information (PHI) carefully. Conversely, these AI models depend on constant training data with a temporal knowledge limit; clinicians usually require external tools like live web search for recent medical evidence. These external systems are often not covered by BAAs, creating a “safe handoff” point in which PHI should be eliminated before data leaves the safe environment.
Despite the significance of this alteration, current de-identification methods are not ideal for this. Most of them are based on long-form electronic health record (EHR) narratives like patient discharge reports, whereas real-world LLMs generally use query-style prompts.
Moreover, access to real clinician queries is restricted due to institutional oversight and privacy regulations. To address this shortcoming, ASQ-PHI presents a fully synthetic, publicly shareable dataset that simulates clinician-style queries that include PHI together with the labels of their de-identification.
The dataset has 1,051 single-turn clinical queries that were created to be like real prompts entered into clinical LLM systems. Among these, 219 (20.8%) are hard negatives, whereas 832 (79.2%) include PHI.
Across the dataset, there are 2,973 annotated PHI elements covering 13 HIPAA Safe Harbor identifier types such as names, dates, medical record numbers, phone numbers, and geographic locations. Each query is combined with system-readable annotations in JSON format that specify both the type of identifier and exact text span, which assist in precise evaluation of de-identification systems.
A crucial strength of ASQ-PHI is its structure. Every record is divided into a PHI annotation and a query section by using simple delimiters that make it easy to adapt for different evaluation tasks. The addition of both PHI-positive queries and carefully constructed hard negatives allows investigators to assess not only how well systems remove PHI but also whether they over-redact harmless information.
The dataset was produced by using an adversarial prompting pipeline built on GPT-4o through Azure OpenAI. High-temperature sampling (0.9) was used to generate challenging, complex, and diverse query phrasing and realistic test cases.
The generation process comprised automated validation steps in order to safeguard a balanced proportion of hard negatives, minimizing malformed records and managing PHI density per query. It provides an interactive Jupyter notebook and supporting code, making it reproducible and easy to adapt to domain-specific datasets.
The researchers showed practical use by evaluating a commercial PHI detection system with ASQ-PHI that demonstrated a balance between accurate PHI detection and preventing over-masking. Expert review proved high quality with 98% annotation accuracy and 96% clinical plausibility. However, the dataset is synthetic, English only, and might not have real-world diversity; hence, external validation is required. ASQ-PHI is an effective benchmark for PHI removal in clinician-like queries and can be used to facilitate safer clinical AI utilization.
Reference: Weatherhead J, Golovko G, McCaffrey P. ASQ-PHI: An adversarial synthetic data benchmark for clinical de-identification and search utility. Data Brief. 2026;65:112586. doi:10.1016/j.dib.2026.112586





