Skip to content

Commit

Permalink
Update Readme to add a new image.
Browse files Browse the repository at this point in the history
Former-commit-id: a6b3315
  • Loading branch information
tanujjain committed Oct 1, 2019
1 parent b552cc2 commit cff2981
Show file tree
Hide file tree
Showing 9 changed files with 63 additions and 161 deletions.
10 changes: 5 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contribution Guide

We welcome any contributions whether it's,
We welcome any contributions whether it is:

- Submitting feedback
- Fixing bugs
Expand All @@ -9,7 +9,7 @@ We welcome any contributions whether it's,
Please read this guide before making any contributions.

#### Submit Feedback
The feedback should be submitted by creating an issue at [GitHub issues](https://github.com/idealo/image-dedup/issues).
The feedback should be submitted by creating an issue on [GitHub issues](https://github.com/idealo/image-dedup/issues).
Select the related template (bug report, feature request, or custom) and add the corresponding labels.

#### Fix Bugs:
Expand All @@ -19,9 +19,9 @@ You may look through the [GitHub issues](https://github.com/idealo/image-dedup/i
You may look through the [GitHub issues](https://github.com/idealo/image-dedup/issues) for feature requests.

## Pull Requests (PR)
1. Fork the repository and a create a new branch from the master branch.
2. For bug fixes, add new tests and for new features please add changes to the documentation.
3. Do a PR from your new branch to our `dev` branch of the original Image Super-Resolution repo.
1. Fork the repository and create a new branch from the master branch.
2. For bug fixes, add new tests and for new features, please add changes to the documentation.
3. Do a PR from your new branch to our `dev` branch of the original Imagededup repo.

## Documentation
- Make sure any new function or class you introduce has proper docstrings.
Expand Down
200 changes: 51 additions & 149 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,34 @@
# imagededup
# Image Deduplicator (imagededup)

Finding duplicates in an image dataset is a recurring task. imagededup is a python package that provides functionality
to carry out this task effectively. The deduplication problem generally caters to 2 broad issues:

* Finding exact duplicates

<p align="center">
<img src="readme_figures/103500.jpg" width="300" />
<img src="readme_figures/103500.jpg" width="300" />
</p>

* Finding near duplicates
imagededup is a python package that simplifies the task of finding **exact** and **near duplicates** in an image collection.

<p align="center">
<img src="readme_figures/103500.jpg" width="300" />
<img src="readme_figures/103501.jpg" width="300" />
<img src="readme_figures/mona_lisa.png" width="600" />
</p>

Traditional methods such as hashing algorithms are particularly good at finding exact duplicates while more modern
methods involving convolutional neural networks are also adept at finding near duplicates due to their ability to
capture basic contours in images.
This package provides functionality to make use of hashing algorithms that are particularly good at finding exact
duplicates as well as convolutional neural networks which are also adept at finding near duplicates. Additionally, an
evaluation framework is also provided to judge the quality of deduplication for a given dataset.

This package provides functionality to address both problems. Additionally, an evaluation framework is also provided to
judge the quality of deduplication. Following details the functionality provided by the package:
Following details the functionality provided by the package:

- Finding duplicates in a directory using one of the following algorithms:
- [Convolutional Neural Network](https://arxiv.org/abs/1704.04861)
- [Perceptual hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html)
- [Difference hashing](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html)
- [Wavelet hashing](https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5)
- [Average hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html)
- Generation of features for images using one of the above stated algorithms.
- [Convolutional Neural Network](https://arxiv.org/abs/1704.04861) (CNN)
- [Perceptual hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) (PHash)
- [Difference hashing](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html) (DHash)
- [Wavelet hashing](https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5) (WHash)
- [Average hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) (AHash)
- Generation of encodings for images using one of the above stated algorithms.
- Framework to evaluate effectiveness of deduplication given a ground truth mapping.
- Plotting duplicates found for a given image file.

Detailed documentation for the package can be found at: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/)

imagededup is compatible with Python 3.6 and is distributed under the Apache 2.0 license.

## Table of contents
## Contents
- [Installation](#installation)
- [Finding duplicates](#finding-duplicates)
- [Feature generation](#feature-generation)
- [Evaluation of deduplication](#evaluation-of-deduplication-quality)
- [Plotting duplicates](#plotting-duplicates-of-an-image)
- [Quick Start](#quick-start)
- [Contribute](#contribute)
- [Citation](#citation)
- [Maintainers](#maintainers)
Expand All @@ -59,150 +46,64 @@ pip install imagededup
Install imagededup from the GitHub source:

```
git clone https://github.com/idealo/image-dedup.git
cd image-dedup
git clone https://github.com/idealo/imagededup.git
cd imagededup
python setup.py install
```

## Quick start
### Finding duplicates
There are two methods available to perform deduplication:

- [find_duplicates()](#find_duplicates)
- [find_duplicates_to_remove()](#find_duplicates_to_remove)
In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:

- Import perceptual hashing method

#### find_duplicates
To deduplicate an image directory using perceptual hashing:
```python
from imagededup.methods import PHash
phasher = PHash()
duplicates = phasher.find_duplicates(image_dir='path/to/image/directory', max_distance_threshold=15)
```
Other hashing methods can be used instead of PHash: Ahash, DHash, WHash

To deduplicate an image directory using cnn:
```python
from imagededup.methods import CNN
cnn_encoder = CNN()
duplicates = cnn_encoder.find_duplicates(image_dir='path/to/image/directory', min_similarity_threshold=0.85)
```
where the returned variable *duplicates* is a dictionary with the following content:
```
{
'image1.jpg': ['image1_duplicate1.jpg',
'image1_duplicate2.jpg'],
'image2.jpg': [..],
..
}
```
Each key in the *duplicates* dictionary corresponds to a file in the image directory passed to the *image_dir* parameter
of the *find_duplicates* function. The value is a list of all file names in the image directory that were found to be
duplicates for the key file.

For an advanced usage, look at the user guide.

#### find_duplicates_to_remove
Returns a list of files in the image directory that are considered as duplicates. Does **NOT** remove the said files.
- Generate encodings for all images in an image directory

The api is similar to *find_duplicates* function (except the *score* attribute in *find_duplicates*). This function
allows the return of a single list of file names in directory that are found to be duplicates.

To deduplicate an image directory using cnn:
```python
from imagededup.methods import CNN
cnn_encoder = CNN()
duplicates = cnn_encoder.find_duplicates_to_remove(image_dir='path/to/image/directory', min_similarity_threshold=0.85)
```
*duplicates* is a list containing the name of image files that are found to be
duplicates of some file in the directory:
encodings = phasher.encode_images(image_dir='path/to/image/directory')
```
[
'image1_duplicate1.jpg',
'image1_duplicate2.jpg'
,..
]
```

For an advanced usage, look at the user guide.

### Feature generation
To only generate the hashes/cnn encodings for a given image or all images in the directory:

- [Feature generation for all images in a directory](#feature-generation-for-all-images-in-a-directory)
- [Feature generation for a single image](#feature-generation-for-a-single-image)


#### Feature generation for all images in a directory
*encode_images* function can be used here:
- Find duplicates using the generated encodings
```python
from imagededup.methods import Dhash
dhasher = Dhash()
encodings = dhasher.encode_images(image_dir='path/to/image/directory')
duplicates = phasher.find_duplicates(encoding_map=encodings)
```
where the returned *encodings*:
```
{
'image1.jpg': <feature-image-1>,
'image2.jpg': <feature-image-2>,
..
}
```
For hashing algorithms, the features are 64 bit hashes represented as 16 character hexadecimal strings.

For cnn, the features are numpy array with shape (1, 1024).

#### Feature generation for a single image
To generate encodings for a single image *encode_image* function can be used:
- Plot duplicates obtained for a given file using the duplicates dictionary
```python
from imagededup.methods import AHash
ahasher = AHash()
encoding = ahasher.encode_image(image_file='path/to/image/file')
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
duplicate_map=duplicates,
filename='ukbench00120.jpg')
```
where the returned variable *encoding* is either a hexadecimal string if a hashing method is used or a (1, 1024) numpy
array if cnn is used.

### Evaluation of deduplication quality
To determine the quality of deduplication algorithm and the corresponding threshold, an evaluation framework is provided.
The output looks as below:

Given a ground truth mapping consisting of file names and a list of duplicates for each file along with a retrieved
mapping from the deduplication algorithm for the same files, the following metrics can be obtained using the framework:
![figs](readme_figures/plot_dups.png)

- [Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP)
- [Mean Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (NDCG)
- [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index)
- Per class [Precision](https://en.wikipedia.org/wiki/Precision_and_recall) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
- Per class [Recall](https://en.wikipedia.org/wiki/Precision_and_recall) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
- Per class [f1-score](https://en.wikipedia.org/wiki/F1_score) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)

The api for obtaining these metrics is as below:
The complete code for the workflow is:
```python
from imagededup.evaluation import evaluate
metrics = evaluate(ground_truth_map, retrieved_map, metric='<metric-name>')
```
where the returned variable *metrics* is a dictionary containing the following content:
```
{
'map': <map>,
'ndcg': <mean ndcg>,
'jaccard': <mean jaccard index>,
'precision': <numpy array having per class precision>,
'recall': <numpy array having per class recall>,
'f1-score': <numpy array having per class f1-score>,
'support': <numpy array having per class support>
}
```
from imagededup.methods import PHash
phasher = PHash()

### Plotting duplicates of an image
Duplicates for an image can be plotted using *plot_duplicates* method as below:
```python
# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir, duplicate_map, filename)
plot_duplicates(image_dir='path/to/image/directory',
duplicate_map=duplicates,
filename='ukbench00120.jpg')
```
where *duplicate_map* is the duplicate map obtained after running [find_duplicates()](#find_duplicates) and *filename* is the file for which duplicates are to be plotted.

The output looks as below:

![figs](readme_figures/plot_dups.png)
For more detailed usage of the package functionality, refer: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/)

## Contribute
We welcome all kinds of contributions.
Expand All @@ -213,15 +114,16 @@ Please cite Imagededup in your publications if this is useful for your research.
```
@misc{idealods2019imagededup,
title={Imagededup},
author={Tanuj Jain and Christopher Lennan and Zubin John},
author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
year={2019},
howpublished={\url{https://github.com/idealo/image-dedup}},
howpublished={\url{https://github.com/idealo/imagededup}},
}
```

## Maintainers
* Tanuj Jain, github: [tanujjain](https://github.com/tanujjain)
* Christopher Lennan, github: [clennan](https://github.com/clennan)
* Dat Tran, github: [datitran](https://github.com/datitran)

## Copyright
See [LICENSE](LICENSE) for details.
See [LICENSE](LICENSE) for details.
2 changes: 1 addition & 1 deletion mkdocs/build_docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
cp ../README.md docs/index.md
cp ../CONTRIBUTING.md docs/CONTRIBUTING.md
cp ../LICENSE docs/LICENSE.md
cp -R ../_readme_figures docs/
cp -R ../readme_figures docs/
python autogen.py
mkdir ../docs
mkdocs build -c -d ../docs/
Binary file added mkdocs/docs/readme_figures/mona_lisa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions mkdocs/docs/user_guide/finding_duplicates.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The 'method-name' corresponds to one of the deduplication methods available and

- *encoding_map*: Optional, used instead of *image_dir* attribute. Set it equal to the dictionary of file names and
corresponding features (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding
[*encode_images*](feature_generation.md) method.
[*encode_images*](encoding_generation.md) method.
- *scores*: Setting it to *True* returns the scores representing the hamming distance (for hashing) or cosine similarity
(for cnn) of each of the duplicate file names from the key file. In this case, the returned 'duplicates' dictionary has
the following content:
Expand Down Expand Up @@ -149,7 +149,7 @@ The 'method-name' corresponds to one of the deduplication methods available and

- *encoding_map*: Optional, used instead of image_dir attribute. Set it equal to the dictionary of file names and
corresponding features (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding
[*encode_images*](feature_generation.md) method. Each key in the 'duplicates' dictionary corresponds to a file in the image directory passed to
[*encode_images*](encoding_generation.md) method. Each key in the 'duplicates' dictionary corresponds to a file in the image directory passed to
the image_dir parameter of the find_duplicates function. The value is a list of all tuples representing the file names
and corresponding scores in the image directory that were found to be duplicates for the key file.

Expand Down
4 changes: 2 additions & 2 deletions mkdocs/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
site_name: Image-dedup
site_name: Imagededup
site_author: idealo Data Science Team

nav:
- Home: index.md
- User Guide:
- Finding duplicates: user_guide/finding_duplicates.md
- Feature generation: user_guide/feature_generation.md
- Encoding generation: user_guide/encoding_generation.md
- Evaluating performance: user_guide/evaluating_performance.md
- Plotting duplicates: user_guide/plotting_duplicates.md
- API reference:
Expand Down
Binary file added readme_figures/mona_lisa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
url='',
long_description=long_description,
license='Apache 2.0',
author='Tanuj Jain, Christopher Lennan, Zubin John',
author_email='[email protected], [email protected], [email protected]',
author='Tanuj Jain, Christopher Lennan, Zubin John, Dat Tran',
author_email='[email protected], [email protected], [email protected], [email protected]',
description='Package for image deduplication',
install_requires=[
'numpy==1.16.3',
Expand Down

0 comments on commit cff2981

Please sign in to comment.