De-identifying DICOM files

DICOM files may contain sensitive and personally identifiable information (PII) about patients, including their name, date of birth, medical record number. In these cases it is essential to anonymize each file, to protect patient privacy and comply with legal and ethical regulations related to healthcare data.

🚧

Caution

This is a premium feature and running it incurs additional costs. Contact us at [email protected] to learn about pricing.

In this tutorial you will learn how to anonymize / de-identify DICOM files in two steps:

  1. Setting up the de-identification function.
  2. Using the de-identification function.

Finally, you will learn how to interpret the JSON output of the de-identification process.

ℹ️

Note

Redacted tag values are completely removed.

Set up the de-identification function

Adjust the de-identification function found here to suit your needs.

De-identifying

The Python code below is used to add criteria as well as call the de-identification function.

👍

Tip

An SSH public / private key pair is required to use the sample script below. To learn how to generate one, see our documentation here.

👍

Tip

Make sure to edit the criteria used to evaluate each file to suit your needs - any number of criteria can be used.


import multiprocessing
import time
from pathlib import Path
from typing import List
import glob
import os
from encord import EncordUserClient
from encord.objects.common import (
    DeidentifyRedactTextMode,
    SaveDeidentifiedDicomConditionNotSubstr,
    SaveDeidentifiedDicomConditionIn
)

# Replace s3://EXAMPLE-BUCKET/raw/ with the path to the file storage you're using 
def filelist_helper(dir,prefix='s3://EXAMPLE-BUCKET/raw/'):
    fl = [prefix+os.path.basename(f) for f in glob.glob(dir+"*.dcm")]
    return fl

# Add criteria to evaluate each file. See the 'Setting evaluation criteria' section below for more info. 
criteria = [
    SaveDeidentifiedDicomConditionNotSubstr("PRIMARY","ImageType"),              
    SaveDeidentifiedDicomConditionIn(["ct","pt","nm","mr","mg","pt"],"Modality") 
]

def deidentify(
    integration_title: str,
    dicom_urls: List[str],
) -> List[str]:

# Authenticate with Encord using the path to your private key
user_client = EncordUserClient.create_with_ssh_private_key(ssh_private_key_path="<private_key_path>")

    integration_hash = None

    # Find integration_hash for requested integration_title
    for integration in user_client.get_cloud_integrations():
        if integration.title == integration_title:
            integration_hash = integration.id

    if not integration_hash:
        raise Exception(f"Integration with integration_title={integration_title} not found")

    deidentified_dicom_urls = []

    # 'dicom_urls' should be a a single list containing the URLs of all instances of a series to be de-identified. Splitting a series into multiple lists might lead to inaccurate results and is therefore not recommended
    deidentified_dicom_url = user_client.deidentify_dicom_files(
        dicom_urls=dicom_urls,
        integration_hash=integration_hash,
        redact_dicom_tags = True,
        redact_pixels_mode = DeidentifyRedactTextMode.REDACT_ALL_TEXT,
        save_conditions = criteria,
        upload_dir = "s3://EXAMPLE-BUCKETt/output"
    )
    print(f"Deidentified url: {deidentified_dicom_url}")
    deidentified_dicom_urls += deidentified_dicom_url

    return deidentified_dicom_urls

# Replace MY-ENCORD-INTEGRATION with the title of your private cloud integration
_integration_title = "MY-ENCORD-INTEGRATION"

_deidentified_dicom_urls = deidentify(
    _integration_title,
    _dicom_urls,
)

Setting evaluation criteria

Evaluation criteria are conditions that determine whether a file will be de-identified or not. Criteria can take many forms, but will always return either 'true' or 'false'.

ℹ️

Note

All strings and inputs into the criteria functions are case-insensitive.

There are two distinct criteria functions:

  • "SaveDeidentifiedDicomConditionNotSubstr" will return 'true' if the first argument (PRIMARY in the example above), is not contained in the second argument (ImageType in the example above). In plain English, the example above checks whether the file's ImageType doesn't contain the word 'Primary', and returns 'true' if this condition is fulfilled.

  • "SaveDeidentifiedDicomConditionIn" will return 'true' if the first argument (["ct","pt","nm","mr","mg","pt"] in the example above) is contained in the second argument Modality. In plain English, the example above checks whether any of the strings contained in the list are contained within the file's Modality. If any one of them is, the function returns 'true'.

Output

This section will explain the output of the de-identification function. Click the dropdown below to see a sample output file

Sample JSON
{
    "url": "s3://EXAMPLE-BUCKETt/raw/should_fail_1.2.840.113619.2.283.6945.3146400.16119.1391477777.833.dcm",
    "StudyInstanceUID": "1.2.840.113619.6.283.4.679947340.8065.1391798290.307",
    "StudyInstanceUID_deid": "1.2.840.113619.6.283.4.679947340.8065.1391798290.307",
    "SeriesInstanceUID": "1.2.840.113619.2.283.6945.3146400.21673.1391477673.97",
    "SeriesInstanceUID_deid": "1.2.840.113619.2.283.6945.3146400.21673.1391477673.97",
    "save_conditions_evaluations":
    [
        {
            "condition":
            {
                "value": "PRIMARY",
                "condition_type": "NOT_SUBSTR",
                "dicom_tag": "ImageType"
            },
            "tag_value": "DEMOGRAPHICDATA",
            "evaluated_value": false
        },
        {
            "condition":
            {
                "value":
                [
                    "ct",
                    "pt",
                    "nm",
                    "mr",
                    "xr",
                    "mg"
                ],
                "condition_type": "IN",
                "dicom_tag": "Modality"
            },
            "tag_value": "MR",
            "evaluated_value": true
        }
    ],
    "save_condition": false,
    "save_condition_series_agg": false,
    "save_disabling_urls_series_agg":
    [
        "s3://EXAMPLE-BUCKETt/raw/should_fail_1.2.840.113619.2.283.6945.3146400.16119.1391477777.833.dcm"
    ],
    "url_deid": "s3://EXAMPLE-BUCKETt/Pixel-redaction-ready/deid_169089614143326592_should_fail_1.2.840.113619.2.283.6945.3146400.16119.1391477777.833.dcm"
}
KeyDescriptionNotes
urlA URL to the DICOM file
StudyInstanceUIDThe study's ID
StudyInstanceUID_deidA converted version of the study's ID used by EncordIdentical to StudyInstanceUID, unless StudyInstanceUID was invalid
SeriesInstanceUIDThe series' ID
SeriesInstanceUID_deidA converted version of the series' ID used by EncordIdentical to SeriesInstanceUID, unless SeriesInstanceUID was invalid
save_conditions_evaluationsContains a list of conditionsThese need to be met in order for a file to be de-identified
conditionContains a given condition's detailsA condition is satisfied when the value fulfils the condition_type
valueThe elements being evaluated by the condition_typeCan be thought of as the 'answer' to a condition
condition_typeThe type of condition being evaluatedCan be thought of as the 'question' to a condition
dicom_tagThe DICOM tag on which a condition is being evaluated
save_conditionDoes the condition have to be true or false for the file to pass
save_condition_series_aggThe series as a whole evaluated as true or falseFor this to be true all save_conditions need to be true For this to be false all save_conditionss need to be false
save_disabling_urls_series_aggA list of files which did not meet the required save_conditionOnly relevant if the series didn't pass
url_deidThe URL of the new, de-identified file.