Update Readme to add a new image.

Former-commit-id: a6b3315
EduardKononov · Oct 1, 2019 · cff2981 · cff2981
1 parent b552cc2
commit cff2981
Show file tree

Hide file tree

Showing 9 changed files with 63 additions and 161 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,6 +1,6 @@
 # Contribution Guide
 
-We welcome any contributions whether it's,
+We welcome any contributions whether it is:
 
 - Submitting feedback
 - Fixing bugs
@@ -9,7 +9,7 @@ We welcome any contributions whether it's,
 Please read this guide before making any contributions.
 
 #### Submit Feedback
-The feedback should be submitted by creating an issue at [GitHub issues](https://github.com/idealo/image-dedup/issues).
+The feedback should be submitted by creating an issue on [GitHub issues](https://github.com/idealo/image-dedup/issues).
 Select the related template (bug report, feature request, or custom) and add the corresponding labels.
 
 #### Fix Bugs:
@@ -19,9 +19,9 @@ You may look through the [GitHub issues](https://github.com/idealo/image-dedup/i
 You may look through the [GitHub issues](https://github.com/idealo/image-dedup/issues) for feature requests.
 
 ## Pull Requests (PR)
-1. Fork the repository and a create a new branch from the master branch.
-2. For bug fixes, add new tests and for new features please add changes to the documentation.
-3. Do a PR from your new branch to our `dev` branch of the original Image Super-Resolution repo.
+1. Fork the repository and create a new branch from the master branch.
+2. For bug fixes, add new tests and for new features, please add changes to the documentation.
+3. Do a PR from your new branch to our `dev` branch of the original Imagededup repo.
 
 ## Documentation
 - Make sure any new function or class you introduce has proper docstrings.

diff --git a/README.md b/README.md
@@ -1,47 +1,34 @@
-# imagededup
+# Image Deduplicator (imagededup)
 
-Finding duplicates in an image dataset is a recurring task. imagededup is a python package that provides functionality
-to carry out this task effectively. The deduplication problem generally caters to 2 broad issues:
-
-* Finding exact duplicates
-
-<p align="center">
-  <img src="readme_figures/103500.jpg" width="300" />
-  <img src="readme_figures/103500.jpg" width="300" />
-</p>
-
-* Finding near duplicates
+imagededup is a python package that simplifies the task of finding **exact** and **near duplicates** in an image collection.
 
 <p align="center">
-  <img src="readme_figures/103500.jpg" width="300" />
-  <img src="readme_figures/103501.jpg" width="300" />
+  <img src="readme_figures/mona_lisa.png" width="600" />
 </p>
 
-Traditional methods such as hashing algorithms are particularly good at finding exact duplicates while more modern 
-methods involving convolutional neural networks are also adept at finding near duplicates due to their ability to 
-capture basic contours in images.
+This package provides functionality to make use of hashing algorithms that are particularly good at finding exact 
+duplicates as well as convolutional neural networks which are also adept at finding near duplicates. Additionally, an 
+evaluation framework is also provided to judge the quality of deduplication for a given dataset.
 
-This package provides functionality to address both problems. Additionally, an evaluation framework is also provided to
-judge the quality of deduplication. Following details the functionality provided by the package:
+Following details the functionality provided by the package:
 
 - Finding duplicates in a directory using one of the following algorithms:
-    - [Convolutional Neural Network](https://arxiv.org/abs/1704.04861)
-    - [Perceptual hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html)
-    - [Difference hashing](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html)
-    - [Wavelet hashing](https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5)
-    - [Average hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html)
-- Generation of features for images using one of the above stated algorithms.
+    - [Convolutional Neural Network](https://arxiv.org/abs/1704.04861) (CNN)
+    - [Perceptual hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) (PHash)
+    - [Difference hashing](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html) (DHash)
+    - [Wavelet hashing](https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5) (WHash)
+    - [Average hashing](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html) (AHash)
+- Generation of encodings for images using one of the above stated algorithms.
 - Framework to evaluate effectiveness of deduplication  given a ground truth mapping.
 - Plotting duplicates found for a given image file.
 
+Detailed documentation for the package can be found at: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/)
+
 imagededup is compatible with Python 3.6 and is distributed under the Apache 2.0 license.
 
-## Table of contents
+## Contents
 - [Installation](#installation)
-- [Finding duplicates](#finding-duplicates)
-- [Feature generation](#feature-generation)
-- [Evaluation of deduplication](#evaluation-of-deduplication-quality)
-- [Plotting duplicates](#plotting-duplicates-of-an-image)
+- [Quick Start](#quick-start)
 - [Contribute](#contribute)
 - [Citation](#citation)
 - [Maintainers](#maintainers)
@@ -59,150 +46,64 @@ pip install imagededup
 Install imagededup from the GitHub source:
 
 ```
-git clone https://github.com/idealo/image-dedup.git
-cd image-dedup  
+git clone https://github.com/idealo/imagededup.git
+cd imagededup  
 python setup.py install
 ```  
 
 ## Quick start
-### Finding duplicates
-There are two methods available to perform deduplication:
 
-- [find_duplicates()](#find_duplicates)
-- [find_duplicates_to_remove()](#find_duplicates_to_remove)
+In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:
+
+- Import perceptual hashing method
 
-#### find_duplicates
-To deduplicate an image directory using perceptual hashing:
 ```python
 from imagededup.methods import PHash
 phasher = PHash()
-duplicates = phasher.find_duplicates(image_dir='path/to/image/directory', max_distance_threshold=15)
 ```
-Other hashing methods can be used instead of PHash: Ahash, DHash, WHash
 
-To deduplicate an image directory using cnn:
-```python
-from imagededup.methods import CNN
-cnn_encoder = CNN()
-duplicates = cnn_encoder.find_duplicates(image_dir='path/to/image/directory', min_similarity_threshold=0.85)
-```
-where the returned variable *duplicates* is a dictionary with the following content:
-```
-{
-  'image1.jpg': ['image1_duplicate1.jpg',
-                'image1_duplicate2.jpg'],
-  'image2.jpg': [..],
-  ..
-}
-```
-Each key in the *duplicates* dictionary corresponds to a file in the image directory passed to the *image_dir* parameter
-of the *find_duplicates* function. The value is a list of all file names in the image directory that were found to be 
-duplicates for the key file.
-
-For an advanced usage, look at the user guide.
-
-#### find_duplicates_to_remove
-Returns a list of files in the image directory that are considered as duplicates. Does **NOT** remove the said files.
+- Generate encodings for all images in an image directory
 
-The api is similar to *find_duplicates* function (except the *score* attribute in *find_duplicates*). This function 
-allows the return of a single list of file names in directory that are found to be duplicates.
-
-To deduplicate an image directory using cnn:
 ```python
-from imagededup.methods import CNN
-cnn_encoder = CNN()
-duplicates = cnn_encoder.find_duplicates_to_remove(image_dir='path/to/image/directory', min_similarity_threshold=0.85)
-```
-*duplicates* is a list containing the name of image files that are found to be 
-duplicates of some file in the directory:
+encodings = phasher.encode_images(image_dir='path/to/image/directory')
 ```
-[
-  'image1_duplicate1.jpg',
-  'image1_duplicate2.jpg'
-  ,..
-]
-```
-
-For an advanced usage, look at the user guide.
-
-### Feature generation
-To only generate the hashes/cnn encodings for a given image or all images in the directory:
-
-- [Feature generation for all images in a directory](#feature-generation-for-all-images-in-a-directory)
-- [Feature generation for a single image](#feature-generation-for-a-single-image)
-
 
-#### Feature generation for all images in a directory
-*encode_images* function can be used here:
+- Find duplicates using the generated encodings
 ```python
-from imagededup.methods import Dhash
-dhasher = Dhash()
-encodings = dhasher.encode_images(image_dir='path/to/image/directory')
+duplicates = phasher.find_duplicates(encoding_map=encodings)
 ```
-where the returned *encodings*:
-```
-{
-  'image1.jpg': <feature-image-1>,
-  'image2.jpg': <feature-image-2>,
-   ..
-}
-```
-For hashing algorithms, the features are 64 bit hashes represented as 16 character hexadecimal strings.
 
-For cnn, the features are numpy array with shape (1, 1024).
-
-#### Feature generation for a single image
-To generate encodings for a single image *encode_image* function can be used:
+- Plot duplicates obtained for a given file using the duplicates dictionary
 ```python
-from imagededup.methods import AHash
-ahasher = AHash()
-encoding = ahasher.encode_image(image_file='path/to/image/file')
+from imagededup.utils import plot_duplicates
+plot_duplicates(image_dir='path/to/image/directory', 
+                duplicate_map=duplicates, 
+                filename='ukbench00120.jpg')
 ```
-where the returned variable *encoding* is either a hexadecimal string if a hashing method is used or a (1, 1024) numpy 
-array if cnn is used.
-
-### Evaluation of deduplication quality
-To determine the quality of deduplication algorithm and the corresponding threshold, an evaluation framework is provided.
+The output looks as below:
 
-Given a ground truth mapping consisting of file names and a list of duplicates for each file along with a retrieved 
-mapping from the deduplication algorithm for the same files, the following metrics can be obtained using the framework:
+![figs](readme_figures/plot_dups.png)
 
-- [Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP)
-- [Mean Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (NDCG)
-- [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index)
-- Per class [Precision](https://en.wikipedia.org/wiki/Precision_and_recall) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
-- Per class [Recall](https://en.wikipedia.org/wiki/Precision_and_recall) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
-- Per class [f1-score](https://en.wikipedia.org/wiki/F1_score) (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
 
-The api for obtaining these metrics  is as below:
+The complete code for the workflow is:
 ```python
-from imagededup.evaluation import evaluate
-metrics = evaluate(ground_truth_map, retrieved_map, metric='<metric-name>')
-```
-where the returned variable *metrics* is a dictionary containing the following content:
-```
-{
-  'map': <map>,
-  'ndcg': <mean ndcg>,
-  'jaccard': <mean jaccard index>,
-  'precision': <numpy array having per class precision>,
-  'recall': <numpy array having per class recall>,
-  'f1-score': <numpy array having per class f1-score>,
-  'support': <numpy array having per class support>
-}
-```
+from imagededup.methods import PHash
+phasher = PHash()
 
-### Plotting duplicates of an image
-Duplicates for an image can be plotted using *plot_duplicates* method as below:
-```python
+# Generate encodings for all images in an image directory
+encodings = phasher.encode_images(image_dir='path/to/image/directory')
+
+# Find duplicates using the generated encodings
+duplicates = phasher.find_duplicates(encoding_map=encodings)
+
+# plot duplicates obtained for a given file using the duplicates dictionary
 from imagededup.utils import plot_duplicates
-plot_duplicates(image_dir, duplicate_map, filename)
+plot_duplicates(image_dir='path/to/image/directory', 
+                duplicate_map=duplicates, 
+                filename='ukbench00120.jpg')
 ```
-where *duplicate_map* is the duplicate map obtained after running [find_duplicates()](#find_duplicates) and  *filename* is the file for which duplicates are to be plotted.
 
-The output looks as below:
-
-![figs](readme_figures/plot_dups.png)
+For more detailed usage of the package functionality, refer: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/)
 
 ## Contribute
 We welcome all kinds of contributions.
@@ -213,15 +114,16 @@ Please cite Imagededup in your publications if this is useful for your research.
 ```
 @misc{idealods2019imagededup,
   title={Imagededup},
-  author={Tanuj Jain and Christopher Lennan and Zubin John},
+  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
   year={2019},
-  howpublished={\url{https://github.com/idealo/image-dedup}},
+  howpublished={\url{https://github.com/idealo/imagededup}},
 }
 ```
 
 ## Maintainers
 * Tanuj Jain, github: [tanujjain](https://github.com/tanujjain)
 * Christopher Lennan, github: [clennan](https://github.com/clennan)
+* Dat Tran, github: [datitran](https://github.com/datitran)
 
 ## Copyright
-See [LICENSE](LICENSE) for details.
+See [LICENSE](LICENSE) for details.
diff --git a/mkdocs/build_docs.sh b/mkdocs/build_docs.sh
@@ -3,7 +3,7 @@
 cp ../README.md docs/index.md
 cp ../CONTRIBUTING.md docs/CONTRIBUTING.md
 cp ../LICENSE docs/LICENSE.md
-cp -R ../_readme_figures docs/
+cp -R ../readme_figures docs/
 python autogen.py
 mkdir ../docs
 mkdocs build -c -d ../docs/
diff --git a/mkdocs/docs/readme_figures/mona_lisa.png b/mkdocs/docs/readme_figures/mona_lisa.png
diff --git a/mkdocs/docs/user_guide/feature_generation.md → ...cs/docs/user_guide/encoding_generation.md b/mkdocs/docs/user_guide/feature_generation.md → ...cs/docs/user_guide/encoding_generation.md
diff --git a/mkdocs/docs/user_guide/finding_duplicates.md b/mkdocs/docs/user_guide/finding_duplicates.md
@@ -46,7 +46,7 @@ The 'method-name' corresponds to one of the deduplication methods available and
 
 - *encoding_map*: Optional, used instead of *image_dir* attribute. Set it equal to the dictionary of file names and 
 corresponding features (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding 
-[*encode_images*](feature_generation.md) method.
+[*encode_images*](encoding_generation.md) method.
 - *scores*: Setting it to *True* returns the scores representing the hamming distance (for hashing) or cosine similarity
  (for cnn) of each of the duplicate file names from the key file. In this case, the returned 'duplicates' dictionary has
   the following content:
@@ -149,7 +149,7 @@ The 'method-name' corresponds to one of the deduplication methods available and
 
 - *encoding_map*: Optional, used instead of image_dir attribute. Set it equal to the dictionary of file names and 
 corresponding features (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding 
-[*encode_images*](feature_generation.md) method. Each key in the 'duplicates' dictionary corresponds to a file in the image directory passed to 
+[*encode_images*](encoding_generation.md) method. Each key in the 'duplicates' dictionary corresponds to a file in the image directory passed to 
 the image_dir parameter of the find_duplicates function. The value is a list of all tuples representing the file names 
 and corresponding scores in the image directory that were found to be duplicates for the key file.
 

diff --git a/mkdocs/mkdocs.yml b/mkdocs/mkdocs.yml
@@ -1,11 +1,11 @@
-site_name: Image-dedup
+site_name: Imagededup
 site_author: idealo Data Science Team
 
 nav:
     - Home: index.md
     - User Guide:
         - Finding duplicates: user_guide/finding_duplicates.md
-        - Feature generation: user_guide/feature_generation.md
+        - Encoding generation: user_guide/encoding_generation.md
         - Evaluating performance: user_guide/evaluating_performance.md
         - Plotting duplicates: user_guide/plotting_duplicates.md
     - API reference:

diff --git a/readme_figures/mona_lisa.png b/readme_figures/mona_lisa.png
diff --git a/setup.py b/setup.py
@@ -23,8 +23,8 @@
     url='',
     long_description=long_description,
     license='Apache 2.0',
-    author='Tanuj Jain, Christopher Lennan, Zubin John',
-    author_email='[email protected], [email protected], [email protected]',
+    author='Tanuj Jain, Christopher Lennan, Zubin John, Dat Tran',
+    author_email='[email protected], [email protected], [email protected], [email protected]',
     description='Package for image deduplication',
     install_requires=[
         'numpy==1.16.3',