ICME

Intelligent Catalogue Management for E-Commerce.

Problem Statement
Project Structure
Preparing dataset
Deduplication Strategies
Zeroth law of Deduplication
Initial Results
TO-DO
License

1. Problem Statement

E-commerce companies constantly strive for an efficient catalogue management for better customer experience and inventory management.

A key issue in catalogue management is duplicate product listing. Duplicate listings occur due to couple of reasons, for one, seller could upload the same listings across multiple e-commerce sites and other being the seller uploading the same listings multiple times within a e-commerce site for reasons only known to them.

In this project we focus on this problem of duplicate detection and try to identify an optimal strategy in terms of both speed and accuracy.

2. Project Structure

The project data and codes are arranged in the following manner:

icme
  ├── src  
  |   ├── main.py
  |   ├── config.py
  |   ├── huew/       {dataset specific pre-processing techniques}
  |   ├── dedupe/     {different deduplication strategies}
  |   ├── networks/   {architectures used for feature extraction}
  |   └── utils/
  ├── data
  |    ├── 2oq-c1r.zip
  |    ├── tops.csv
  |    ├── refine_tops.csv
  |    ├── images/ {86K+ images of TOPS category}
  |    ├── dedup_sample_submission_file.json  {submission fromat}
  |    ├── duplicates/ {sample results}
  |    ├── pretrained/ {weights for feature ext models}
  |

Data:
the data folder is not a part of this git project as it was very heavy (30GB+). The same can be downloaded from the csv file using this script

main.py is the driver file for all processing. The configuration for different experiment runs can be controlled using the config.py.

3. Preparing dataset

The dataset given is large (4057189 data-points with 32 features each) keeping in mind the constraints on time and resources available. For algorithm demonstration we are taking a subset of the dataset with product category as "TOPS".

For extracting "TOPS" category from the dataset, we check for all categories that ends with ">TOPS". The categories obtained are:

Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops,
Apparels>Women>Fusion Wear>Shirts, Tops & Tunics>Tops,
Apparels>Women>Maternity Wear>Shirts, Tops & Tunics>Tops,
Apparels>Kids>Girls>T-Shirts & Tops>Tops,
Apparels>Kids>Infants>Baby Girls>T-Shirts & Tops>Tops

The scripts for extraction can be found here

After TOPS extraction 347694 data-points remain (shape: 347694x32).

The imageUrlStr had more than one representative image for each product ID (PID). To represent each PID with one image, imageUrlStr was split to extract the first link (titled PrimaryImageUrlStr). The script for this can be found here

Some obvious pre-processing techniques included dropping duplicates from some fileds like productId, productUrl, and primaryImageUrlStr. The script for this can be found here

After this 87968 data-points remain under the TOPS category (previously 347694)

Further each data-point has 32 feature points and all of them do not seem useful. After manual examination, the following features were removed:

['imageUrlStr','description', 'categories','sellingPrice', 'specialPrice', 'inStock', 'codAvailable', 'offers', 'discount', 'shippingCharges', 'deliveryTime', 'sizeUnit','storage', 'displaySize','detailedSpecsStr', 'specificationList','sellerName', 'sellerAverageRating', 'sellerNoOfRatings', 'sellerNoOfReviews', 'sleeve', 'neck', 'idealFor']

This is done to implement word2vec (among other strategies) on the features to identify duplicates.

The final saved file (titled: refine_tops.csv) was of shape: 87968x10

For downloading images from "TOPS" category, the script can be found here.

4. Deduplication strategy

For finding duplicate product listings different strategies have been tried and some are in to-be tried stage.

1. Image Hashing
For each image a "difference hash" or dHash is generated based on based on Neal Krawetz's dHash algorithm. The file hashes are then compared. The implementation can be found here

This method seems to be fast when compared to Kmeans or Feature extraction. However, it could identify only the exact duplicates and cannot be used for identifying near duplicate files or files with minimal variation.

2. Kmeans using custom distance function
I tried using K-means with custom distance function. This was inspired by the implementation here

Presently Structural Similarity Index (SSI) is used for comparing the distance between vectors. Other possibilities need to be explored. The convergence for cluster definition seems to take a very long time for the 80k+ images.

3. Feature extraction using CNN framework followed by cosine distance

For feature extraction using Deep neural networks we adopted the Squeezenet network pretrained on imagenet.

Squeezenet has several advantages over other networks:

it is light weight (Model size is less than 5MB)
Accuracy is comparable to other heavy architectures like inceptionv3.
For the same accuracy of AlexNet, SqueezeNet can be 3 times faster and 500 times smaller.
The fire module used is Squeezenet does feature extraction at different resolutions. This takes care of the CNN limitation of not having a global context of the images.

SqueezeNet generates a 512-bit feature vector for each image. First feature for all the images are extracted. Next, in a sorted loop, each image is compared with other remaining images in the loop. So, in a loop of n images, the mth image is compared with (n-m) images (at max).

4. Haar PSI
The Haar wavelet-based perceptual similarity index (HaarPSI) is a similarity measure for images that aims to correctly assess the perceptual similarity between two images with respect to a human viewer.

Here local similarity between two images is obtained from high-frequency Haar wavelet filter responses. The similarity values are then passed through a sigmoid function to introduce non-linearity.

The HaarPSI expresses the perceptual similarity of two digital images in the interval [0,1]. For implementation refer here.

5. Zeroth Law of Deduplication

Search optimisation for deduplication strategy
I have extended the Zeroth Laws of thermodynamics for the optimisation of deduplication strategy. Lets call it the Zero Law of Deduplication

When we have 3 images: A, B and C. If C is a duplicate of A and B is not a duplicate of C, then B cannot be a duplicate of A

Following this whenever one image is marked as duplicate in the loop, it is removed from the queue for comparing further images. This reduces the search space for the next images to come.

6. Initial Results

Some sample results can be seen below:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
sample-results		sample-results
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICME

1. Problem Statement

2. Project Structure

3. Preparing dataset

4. Deduplication strategy

5. Zeroth Law of Deduplication

6. Initial Results

7. To-Do

About

Releases

Packages

Languages

License

skyimager/intelligent_catalogue_management

Folders and files

Latest commit

History

Repository files navigation

ICME

1. Problem Statement

2. Project Structure

3. Preparing dataset

4. Deduplication strategy

5. Zeroth Law of Deduplication

6. Initial Results

7. To-Do

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages