Product Quantization for n dimensional dataset using Inverted Multi-Index
This project implements a product quantization-based approach for approximate nearest neighbor search. The idea is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately. A vector is represented by a short code composed of its subspace quantization indices. The L1 (manhattan) distance between two vectors can be efficiently estimated from their codes. An asymmetric version increases precision, as it computes the approximate distance between a vector and a code. Experimental results show that this approach searches for nearest neighbors efficiently, in particular in combination with an inverted file system. Results for SIFT and GIST image descriptors show excellent search accuracy, outperforming three stateof-the-art approaches. The scalability of our approach is validated on a data set of two billion vectors.
A new data structure for efficient similarity search in very large datasets of high-dimensional vectors is implemented. This structure called the inverted multi-index generalizes the inverted index idea by replacing the standard quantization within inverted indices with product quantization. For very similar retrieval complexity and pre-processing time, inverted multi-indices achieve a much denser subdivision of the search space compared to inverted indices, while retaining their memory efficiency. Experiments with large datasets of SIFT and GIST vectors demonstrate that because of the denser subdivision, inverted multi-indices are able to return much shorter candidate lists with higher recall, according to the research given below. Augmented with a suitable reranking procedure, multi-indices were able to significantly improve the speed of approximate nearest neighbor search on the dataset of 1 billion SIFT vectors compared to the best previously published systems, while achieving better recall and incurring only few percent of memory overhead.
$ python -m venv ./venv
$ venv\Scripts\activate.bat | $ source ./venv/bin/activate
$ pip install -r requirements.txt
max_iter
iterations.The pq()
method returns a codebook and codes for the data vectors, where
Each datapoint is partitioned P times to get sub-vectors. The idea is to find the cluster point which is closest to the sub-vector using Manhattan distance (L1/’cityblock’). Running that for all sub-vectors gives us the information of which cluster centroid represents which points. Once a sub-vector is assigned a cluster, the cluster center changes so as to best represent the vectors that belong in it. New cluster center is calculated by taking median of all the member sub-vectors. We chose L1-norm because the L2 norm squares values, so it increases the cost of outliers exponentially; the L1 norm only takes the absolute value, so it considers them linearly. The exponentiation amplifies the effect of outliers. This entire process is run for a maximum of ‘max_iter’ iterations which is provided with input. At the end, we get our codebooks with values(maybe different from initial values) which will be used to calculate the codes.
Codes is the information of sub-vectors and shows which cluster the sub-vector belongs to. Values are obtained by running the above process once and store the centroid index which is closest to the sub-vector. Codes are returned as ‘uint8’ data type along with the codebooks.
codebooks
returned by pq()
in part 1.codes
returned by pq()
in part 1.The query()
method returns an array contains the candidates for each query. Specifically, it returns
Part 2 implements the query method using inverted multi-index list with L1-norm. The objective is to query the codebook and obtain at least T candidates for each query. A multi-index list for each query is maintained that stores the distance of the query from the corresponding centroids. The distance is Manhattan distance(L1/’cityblock’) as per the reasons mentioned above. This list is sorted in ascending order of distance to get the centroid that is closest to the query. We also maintain a ‘subvectors_clusters’ python dictionary whose key is the tuple of centroids and value is the data points that belong to these centroids calculated from codes.
For algorithm 3.1, the closest combination of centroids is chosen to get the candidates. In order to prevent repetition of combination, a ‘dedup’ set is maintained. We maintain a stencil matrix so as to systematically navigate and pick the next combination of closest centroids. A min-Heap of Nodes (complex object) is maintained to always fetch the minimum farthest combination of centroid. Node is a custom Data Structure that stores the combinations centroids for all sub-vectors of query and the respective sum of distance. This structure exists simply to aide in retrieval of combination(centroid indexes) and distance. Each navigation lands us to a new combination and we add it to the heap. The min-Heap is sorted in ascending order hence the smallest element on the basis of distance (which we want) is the first element. Should the combination not exist in ‘dedup’ set, then it is added and we look up ‘subvectors_clusters’ to get the member data points. A spread, which we define as the point from which we need to start looking at the next closest combination. The closest point of one iteration becomes the new point of spread from where we navigate along one step in each dimension (maintained via stencil matrix). This process of finding the closest centroid combination (and consequently the data points) is performed till we have at least ‘T’ data points per query. The process boils down to finding the members of the cluster the query point belongs to. Should the result be less than ‘T’ candidates then we go for the next best cluster(closest cluster) and get its data points.
import submission
import pickle
import time
# How to run your implementation for Part 1
with open('./toy_example/Data_File', 'rb') as f:
Data_File = pickle.load(f, encoding = 'bytes')
with open('./toy_example/Centroids_File', 'rb') as f:
Centroids_File = pickle.load(f, encoding = 'bytes')
start = time.time()
codebooks, codes = submission.pq(data, P=2, init_centroids=centroids, max_iter = 20)
end = time.time()
time_cost_1 = end - start
# How to run your implementation for Part 2
with open('./toy_example/Query_File', 'rb') as f:
queries = pickle.load(f, encoding = 'bytes')
start = time.time()
candidates = submission.query(queries, codebooks, codes, T=10)
end = time.time()
time_cost_2 = end - start
# output for part 2.
print(candidates)
[{3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}]
To understand the algorithms and implement them. These algorithms are widely used in many Data Science packages.