diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 05a58ab3..4da521a3 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,6 +1,6 @@ # Contribution Guide -We welcome any contributions whether it's, +We welcome any contributions whether it is: - Submitting feedback - Fixing bugs @@ -9,7 +9,7 @@ We welcome any contributions whether it's, Please read this guide before making any contributions. #### Submit Feedback -The feedback should be submitted by creating an issue at [GitHub issues](https://github.com/idealo/image-dedup/issues). +The feedback should be submitted by creating an issue on [GitHub issues](https://github.com/idealo/image-dedup/issues). Select the related template (bug report, feature request, or custom) and add the corresponding labels. #### Fix Bugs: @@ -19,9 +19,9 @@ You may look through the [GitHub issues](https://github.com/idealo/image-dedup/i You may look through the [GitHub issues](https://github.com/idealo/image-dedup/issues) for feature requests. ## Pull Requests (PR) -1. Fork the repository and a create a new branch from the master branch. -2. For bug fixes, add new tests and for new features please add changes to the documentation. -3. Do a PR from your new branch to our `dev` branch of the original Image Super-Resolution repo. +1. Fork the repository and create a new branch from the master branch. +2. For bug fixes, add new tests and for new features, please add changes to the documentation. +3. Do a PR from your new branch to our `dev` branch of the original Imagededup repo. ## Documentation - Make sure any new function or class you introduce has proper docstrings. diff --git a/README.md b/README.md index a1ca5ecf..1cf18de3 100644 --- a/README.md +++ b/README.md @@ -1,47 +1,34 @@ -# imagededup +# Image Deduplicator (imagededup) -Finding duplicates in an image dataset is a recurring task. imagededup is a python package that provides functionality -to carry out this task effectively. The deduplication problem generally caters to 2 broad issues: - -* Finding exact duplicates - -

- - -

- -* Finding near duplicates +imagededup is a python package that simplifies the task of finding **exact** and **near duplicates** in an image collection.

- - +

-Traditional methods such as hashing algorithms are particularly good at finding exact duplicates while more modern -methods involving convolutional neural networks are also adept at finding near duplicates due to their ability to -capture basic contours in images. +This package provides functionality to make use of hashing algorithms that are particularly good at finding exact +duplicates as well as convolutional neural networks which are also adept at finding near duplicates. Additionally, an +evaluation framework is also provided to judge the quality of deduplication for a given dataset. -This package provides functionality to address both problems. Additionally, an evaluation framework is also provided to -judge the quality of deduplication. Following details the functionality provided by the package: +Following details the functionality provided by the package: - Finding duplicates in a directory using one of the following algorithms: - - [Convolutional Neural Network](https://arxiv.org/abs/1704.04861) - - [Perceptual hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) - - [Difference hashing](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html) - - [Wavelet hashing](https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5) - - [Average hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) -- Generation of features for images using one of the above stated algorithms. + - [Convolutional Neural Network](https://arxiv.org/abs/1704.04861) (CNN) + - [Perceptual hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) (PHash) + - [Difference hashing](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html) (DHash) + - [Wavelet hashing](https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5) (WHash) + - [Average hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) (AHash) +- Generation of encodings for images using one of the above stated algorithms. - Framework to evaluate effectiveness of deduplication given a ground truth mapping. - Plotting duplicates found for a given image file. +Detailed documentation for the package can be found at: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/) + imagededup is compatible with Python 3.6 and is distributed under the Apache 2.0 license. -## Table of contents +## Contents - [Installation](#installation) -- [Finding duplicates](#finding-duplicates) -- [Feature generation](#feature-generation) -- [Evaluation of deduplication](#evaluation-of-deduplication-quality) -- [Plotting duplicates](#plotting-duplicates-of-an-image) +- [Quick Start](#quick-start) - [Contribute](#contribute) - [Citation](#citation) - [Maintainers](#maintainers) @@ -59,150 +46,64 @@ pip install imagededup Install imagededup from the GitHub source: ``` -git clone https://github.com/idealo/image-dedup.git -cd image-dedup +git clone https://github.com/idealo/imagededup.git +cd imagededup python setup.py install ``` ## Quick start -### Finding duplicates -There are two methods available to perform deduplication: -- [find_duplicates()](#find_duplicates) -- [find_duplicates_to_remove()](#find_duplicates_to_remove) +In order to find duplicates in an image directory using perceptual hashing, following workflow can be used: + +- Import perceptual hashing method -#### find_duplicates -To deduplicate an image directory using perceptual hashing: ```python from imagededup.methods import PHash phasher = PHash() -duplicates = phasher.find_duplicates(image_dir='path/to/image/directory', max_distance_threshold=15) ``` -Other hashing methods can be used instead of PHash: Ahash, DHash, WHash -To deduplicate an image directory using cnn: -```python -from imagededup.methods import CNN -cnn_encoder = CNN() -duplicates = cnn_encoder.find_duplicates(image_dir='path/to/image/directory', min_similarity_threshold=0.85) -``` -where the returned variable *duplicates* is a dictionary with the following content: -``` -{ - 'image1.jpg': ['image1_duplicate1.jpg', - 'image1_duplicate2.jpg'], - 'image2.jpg': [..], - .. -} -``` -Each key in the *duplicates* dictionary corresponds to a file in the image directory passed to the *image_dir* parameter -of the *find_duplicates* function. The value is a list of all file names in the image directory that were found to be -duplicates for the key file. - -For an advanced usage, look at the user guide. - -#### find_duplicates_to_remove -Returns a list of files in the image directory that are considered as duplicates. Does **NOT** remove the said files. +- Generate encodings for all images in an image directory -The api is similar to *find_duplicates* function (except the *score* attribute in *find_duplicates*). This function -allows the return of a single list of file names in directory that are found to be duplicates. - -To deduplicate an image directory using cnn: ```python -from imagededup.methods import CNN -cnn_encoder = CNN() -duplicates = cnn_encoder.find_duplicates_to_remove(image_dir='path/to/image/directory', min_similarity_threshold=0.85) -``` -*duplicates* is a list containing the name of image files that are found to be -duplicates of some file in the directory: +encodings = phasher.encode_images(image_dir='path/to/image/directory') ``` -[ - 'image1_duplicate1.jpg', - 'image1_duplicate2.jpg' - ,.. -] -``` - -For an advanced usage, look at the user guide. - -### Feature generation -To only generate the hashes/cnn encodings for a given image or all images in the directory: - -- [Feature generation for all images in a directory](#feature-generation-for-all-images-in-a-directory) -- [Feature generation for a single image](#feature-generation-for-a-single-image) - -#### Feature generation for all images in a directory -*encode_images* function can be used here: +- Find duplicates using the generated encodings ```python -from imagededup.methods import Dhash -dhasher = Dhash() -encodings = dhasher.encode_images(image_dir='path/to/image/directory') +duplicates = phasher.find_duplicates(encoding_map=encodings) ``` -where the returned *encodings*: -``` -{ - 'image1.jpg': , - 'image2.jpg': , - .. -} -``` -For hashing algorithms, the features are 64 bit hashes represented as 16 character hexadecimal strings. -For cnn, the features are numpy array with shape (1, 1024). - -#### Feature generation for a single image -To generate encodings for a single image *encode_image* function can be used: +- Plot duplicates obtained for a given file using the duplicates dictionary ```python -from imagededup.methods import AHash -ahasher = AHash() -encoding = ahasher.encode_image(image_file='path/to/image/file') +from imagededup.utils import plot_duplicates +plot_duplicates(image_dir='path/to/image/directory', + duplicate_map=duplicates, + filename='ukbench00120.jpg') ``` -where the returned variable *encoding* is either a hexadecimal string if a hashing method is used or a (1, 1024) numpy -array if cnn is used. - -### Evaluation of deduplication quality -To determine the quality of deduplication algorithm and the corresponding threshold, an evaluation framework is provided. +The output looks as below: -Given a ground truth mapping consisting of file names and a list of duplicates for each file along with a retrieved -mapping from the deduplication algorithm for the same files, the following metrics can be obtained using the framework: +![figs](readme_figures/plot_dups.png) -- [Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP) -- [Mean Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (NDCG) -- [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index) -- Per class [Precision](https://en.wikipedia.org/wiki/Precision_and_recall) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs) -- Per class [Recall](https://en.wikipedia.org/wiki/Precision_and_recall) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs) -- Per class [f1-score](https://en.wikipedia.org/wiki/F1_score) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs) -The api for obtaining these metrics is as below: +The complete code for the workflow is: ```python -from imagededup.evaluation import evaluate -metrics = evaluate(ground_truth_map, retrieved_map, metric='') -``` -where the returned variable *metrics* is a dictionary containing the following content: -``` -{ - 'map': , - 'ndcg': , - 'jaccard': , - 'precision': , - 'recall': , - 'f1-score': , - 'support': -} -``` +from imagededup.methods import PHash +phasher = PHash() -### Plotting duplicates of an image -Duplicates for an image can be plotted using *plot_duplicates* method as below: -```python +# Generate encodings for all images in an image directory +encodings = phasher.encode_images(image_dir='path/to/image/directory') + +# Find duplicates using the generated encodings +duplicates = phasher.find_duplicates(encoding_map=encodings) + +# plot duplicates obtained for a given file using the duplicates dictionary from imagededup.utils import plot_duplicates -plot_duplicates(image_dir, duplicate_map, filename) +plot_duplicates(image_dir='path/to/image/directory', + duplicate_map=duplicates, + filename='ukbench00120.jpg') ``` -where *duplicate_map* is the duplicate map obtained after running [find_duplicates()](#find_duplicates) and *filename* is the file for which duplicates are to be plotted. -The output looks as below: - -![figs](readme_figures/plot_dups.png) +For more detailed usage of the package functionality, refer: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/) ## Contribute We welcome all kinds of contributions. @@ -213,15 +114,16 @@ Please cite Imagededup in your publications if this is useful for your research. ``` @misc{idealods2019imagededup, title={Imagededup}, - author={Tanuj Jain and Christopher Lennan and Zubin John}, + author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran}, year={2019}, - howpublished={\url{https://github.com/idealo/image-dedup}}, + howpublished={\url{https://github.com/idealo/imagededup}}, } ``` ## Maintainers * Tanuj Jain, github: [tanujjain](https://github.com/tanujjain) * Christopher Lennan, github: [clennan](https://github.com/clennan) +* Dat Tran, github: [datitran](https://github.com/datitran) ## Copyright -See [LICENSE](LICENSE) for details. \ No newline at end of file +See [LICENSE](LICENSE) for details. diff --git a/mkdocs/build_docs.sh b/mkdocs/build_docs.sh index 7135832b..bcf12e47 100755 --- a/mkdocs/build_docs.sh +++ b/mkdocs/build_docs.sh @@ -3,7 +3,7 @@ cp ../README.md docs/index.md cp ../CONTRIBUTING.md docs/CONTRIBUTING.md cp ../LICENSE docs/LICENSE.md -cp -R ../_readme_figures docs/ +cp -R ../readme_figures docs/ python autogen.py mkdir ../docs mkdocs build -c -d ../docs/ \ No newline at end of file diff --git a/mkdocs/docs/readme_figures/mona_lisa.png b/mkdocs/docs/readme_figures/mona_lisa.png new file mode 100644 index 00000000..06440bb9 Binary files /dev/null and b/mkdocs/docs/readme_figures/mona_lisa.png differ diff --git a/mkdocs/docs/user_guide/feature_generation.md b/mkdocs/docs/user_guide/encoding_generation.md similarity index 100% rename from mkdocs/docs/user_guide/feature_generation.md rename to mkdocs/docs/user_guide/encoding_generation.md diff --git a/mkdocs/docs/user_guide/finding_duplicates.md b/mkdocs/docs/user_guide/finding_duplicates.md index 13daf0a0..080eabb0 100644 --- a/mkdocs/docs/user_guide/finding_duplicates.md +++ b/mkdocs/docs/user_guide/finding_duplicates.md @@ -46,7 +46,7 @@ The 'method-name' corresponds to one of the deduplication methods available and - *encoding_map*: Optional, used instead of *image_dir* attribute. Set it equal to the dictionary of file names and corresponding features (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding -[*encode_images*](feature_generation.md) method. +[*encode_images*](encoding_generation.md) method. - *scores*: Setting it to *True* returns the scores representing the hamming distance (for hashing) or cosine similarity (for cnn) of each of the duplicate file names from the key file. In this case, the returned 'duplicates' dictionary has the following content: @@ -149,7 +149,7 @@ The 'method-name' corresponds to one of the deduplication methods available and - *encoding_map*: Optional, used instead of image_dir attribute. Set it equal to the dictionary of file names and corresponding features (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding -[*encode_images*](feature_generation.md) method. Each key in the 'duplicates' dictionary corresponds to a file in the image directory passed to +[*encode_images*](encoding_generation.md) method. Each key in the 'duplicates' dictionary corresponds to a file in the image directory passed to the image_dir parameter of the find_duplicates function. The value is a list of all tuples representing the file names and corresponding scores in the image directory that were found to be duplicates for the key file. diff --git a/mkdocs/mkdocs.yml b/mkdocs/mkdocs.yml index fcb96a4a..8d13c6a3 100644 --- a/mkdocs/mkdocs.yml +++ b/mkdocs/mkdocs.yml @@ -1,11 +1,11 @@ -site_name: Image-dedup +site_name: Imagededup site_author: idealo Data Science Team nav: - Home: index.md - User Guide: - Finding duplicates: user_guide/finding_duplicates.md - - Feature generation: user_guide/feature_generation.md + - Encoding generation: user_guide/encoding_generation.md - Evaluating performance: user_guide/evaluating_performance.md - Plotting duplicates: user_guide/plotting_duplicates.md - API reference: diff --git a/readme_figures/mona_lisa.png b/readme_figures/mona_lisa.png new file mode 100644 index 00000000..06440bb9 Binary files /dev/null and b/readme_figures/mona_lisa.png differ diff --git a/setup.py b/setup.py index 1ed4378a..91bd7375 100644 --- a/setup.py +++ b/setup.py @@ -23,8 +23,8 @@ url='', long_description=long_description, license='Apache 2.0', - author='Tanuj Jain, Christopher Lennan, Zubin John', - author_email='tanuj.jain.10@gmail.com, christopherlennan@gmail.com, zrjohn@yahoo.com', + author='Tanuj Jain, Christopher Lennan, Zubin John, Dat Tran', + author_email='tanuj.jain.10@gmail.com, christopherlennan@gmail.com, zrjohn@yahoo.com, datitran@gmail.com', description='Package for image deduplication', install_requires=[ 'numpy==1.16.3',