Remove duplicate images

The presence of duplicate or closely similar images can introduce bias in deep learning models. Encord Active provides the capability to identify and eliminate duplicate or near-duplicate images from datasets. This process contributes to enhancing data quality by removing redundant instances, ultimately leading to improved model performance.

In this workflow, the Uniqueness quality metric is used to identify duplicate and near-duplicate images.

Uniqueness metric

The Uniqueness metric evaluates all images within the dataset and assigns a uniqueness score to each, indicating their distinctiveness.

  • The uniqueness score falls within the [0,1] range. A higher score indicates a greater level of image uniqueness. The Duplicates summary on the Data > Overview tab uses a range between 0 and 0.0001.

  • A score of zero signifies the presence of at least one identical image within the dataset. For instances with N duplicate images, N-1 of them are assigned a score of zero (with only one holding a non-zero score) to facilitate their exclusion from the dataset.

  • Near-duplicate images are labeled as Near-duplicate image and are presented side by side in the Explorer's grid view. This setup simplifies the decision-making process when selecting which image to keep and which one to remove.

Quick Tour

All the sections in the Quick Tour assume that you are already in a Project.

👍

Tip

Choose any image in the Explorer workspace and click its Similar items !Similarity button button. This displays images similar to the selected one, including any duplicates if they exist.

Explorer

The Explorer page has three areas that can help you find duplicate images in your Project.

1: Duplicates Shortcut

Found in the Overview tab, any images that have a Uniqueness value of 0 to 0.0001 are highlighted as duplicates. You can adjust this value from the Filter tab.

Duplicates shortcut

2: Sorting by `Uniqueness`

The entire Project can be sorted by Uniqueness. Sort by ascending order to display duplicates first.

Sorting by `Uniqueness`

3: Filtering by `Uniqueness`

Filter the entire project using Uniqueness.

Go to Filter tab > Add Filter > Data Quality Metrics > Uniqueness A small histogram diagram appears above the filter.

You can then change the filter settings to specify a range closer to 0.

Filtering by `Uniqueness`

Analytics

In a Project, go to the Analytics page and pick the Uniqueness quality metric for the Metric Distribution section.

![Distribution of data based on Uniqueness scores](!Duplicates shortcut)

The chart displays the distribution of data based on the Uniqueness scores.

Remove duplicate images

When you want to remove/exclude duplicate images from a dataset, tag duplicate images and create a Collection devoid of duplicates.

To remove duplicate images from your Project:
  1. Log in to the Encord platform.
    The landing page for the Encord platform appears.

  2. Click Active in the main menu.
    The landing page for Active appears.

  3. Click the Project.
    The landing page for the Project appears with the Explorer tab selected with Data selected.

  4. Click the Duplicates shortcut under the Overview tab.
    The Duplicates shortcut applies the Uniqueness filter to all images in the Project. The Uniqueness filter returns images with a Uniqueness value between 0 and 0.0001.

  5. Sort the filtered data in ascending order by Uniqueness.

  6. Adjust the Uniqueness filter from the default value to find all the duplicate images in the Project.
    As you adjust the filter the images that appear in the Explorer workspace change.

  7. Select one and then all images.

  8. Click the Add to a Collection button to create a Collection.

  9. Click New Collection.

  10. Name the Collection Duplicates.
    All selected images have the tag Duplicates applied to them.

  11. Reset all Filters.

  12. Add a Collections filter that excludes Duplicates.

  13. Select one and then all images.

  14. Click the Add to a Collection button to create a Collection.

  15. Click New Collection.

  16. Specify a meaningful name for the Collection.

  17. Go to the Collections page.

  18. Select the Collection that excludes Duplicates.

  19. Click Create Dataset.

  20. Specify a meaningful name and description for the Dataset and Project.

  21. Click Submit.
    The Dataset and Project appear in Annotate.

Incorporating this workflow into dataset management strategies can significantly enhance data quality, eliminate redundancies, and contribute to more accurate model training and evaluation.

Remove near-duplicate images

An example of near-duplicate image pairs detected with Encord Active

An example of near-duplicate image pairs detected with Encord Active

Similar to duplicates, near-duplicate images are images where one image slightly differs from another due to shifts, blurriness, or distortion. Consequently, they should also be eliminated from the dataset. However, in this scenario, a decision is required to determine which sample remains and which is discarded. These images possess scores marginally greater than 0 and are displayed alongside one another in the Explorer grid view workspace, facilitating easy comparison.

  1. Log in to the Encord platform.
    The landing page for the Encord platform appears.

  2. Click Active in the main menu.
    The landing page for Active appears.

  3. Click the Project.
    The landing page for the Project appears with the Explorer tab selected with Data selected.

  4. Click the Duplicates shortcut under the Overview tab.
    The Duplicates shortcut applies the Uniqueness filter to all images in the Project. The Uniqueness filter returns images with a Uniqueness value between 0 and 0.0001.

  5. Sort the filtered data in ascending order by Uniqueness.

  6. Adjust the Uniqueness filter from the default value to 0 to 0.05.

  7. Examine the images in the Explorer workspace and select the images you want removed from the Project.

  8. Click the Add to a Collection button to create a Collection.

  9. Click New Collection.

    ℹ️

    Note

    If you already have a Collection called Duplicates, add the images to the existing Collection and go to step 11.

  10. Name the Collection Duplicates.
    All selected images have the tag Duplicates applied to them.

  11. Reset all Filters.

  12. Add a Collections filter that excludes Duplicates.

  13. Select one and then all images.

  14. Click the Add to a Collection button to create a Collection.

  15. Click New Collection.

  16. Specify a meaningful name for the Collection.

  17. Go to the Collections page.

  18. Select the Collection that excludes Duplicates.

  19. Click Create Dataset.

  20. Specify a meaningful name and description for the Dataset and Project.

  21. Click Submit.
    The Dataset and Project appear in Annotate.

With these actions, users can efficiently manage near-duplicate images and improve dataset quality.


What’s Next