# Summary: Privacy-Preserving Multi-Keyword Fuzzy Search over Encrypted Data in the Cloud

This is a summary of the paper “Privacy-Preserving Multi-Keyword Fuzzy Search over Encrypted Data in the Cloud” by Wang et al. (INFOCOM 2014). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

This work locates itself in the scenario of outsourcing the storage and processing of data which needs to be encrypted before being uploaded in order to preserve its privacy. It mentions previous works that enables search over encrypted data, highlighting solutions that provide single keyword search, multi-keyword search and fuzzy single keyword search. To tackle to problem of fuzzy multi-keyword search, they propose a solution that uses LSH functions to create Bloom filters which are used as indices for the files to be uploaded to the cloud.

Their security model assumes a honest-but-curious server, trusted users and two threat models: known ciphertext model (limited server knowledge) and known background model (background information available).

The first proposal, Basic Multi-Keyword Fuzzy Search, is a scheme that creates one index per file. Each index is a Bloom filter that contains all the keywords extracted from the file. This scheme has the following three steps:

1. Bigram vector representation of keyword: This steps enables the use of LSH functions. It transforms each keyword extracted from the file into a bigram set, which is are stored in a $26^2$-bit vector.
2. Bloom filter representation: Both indices and queries are represented as Bloom filters in order to be compared. Instead of using regular hash functions, Bloom filters are constructed using LSH functions which will hash similar inputs to the same hash values with a high probability. In this work they use the p-stable family of functions.
3. Inner product based matching: The result of the inner product between the index vector for a given file  and the query vector can be indicate if the file contains the keywords of the query. High values for the inner product mean that more keywords are present in the file.

The solution assures security by encrypting the index and the trapdoors used to query the uploaded data. The encryption is done by splitting the index according to a random vector, and then multiplying each part by previously created invertible matrices.

The Basic Scheme is secure under the Known Ciphertext Model, but it is vulnerable under the Known Background Model. This is because the adversary can make associations between keywords and their trapdoors, which can lead to the recovery of indices.

The proposed solution for this vulnerability is the Enhanced Multi-keyword Fuzzy Search Scheme. It introduces a new security layer in the form of a pseudo-random function that is applied in the encryption step of the indices and the queries. This function does not affect the result of the inner product, since pseudo-random functions are collision free.

The experimental evaluation shows that the time to generate indices and the trapdoors grows linearly with the number of keywords in the file or in the query, since each keyword has to be hashed many times. The search time also grows linearly with the number of documents, since the index for each file has to be compared to the query. On the other hand, the number of keywords in the file does not affect the search time, since the size of the compared Bloom filters is independent from the number of keywords. Finally, the accuracy is measured in terms of precision and recall. The precision for a larger number of keywords decreases slightly due to the accumulation of false positives for each keyword.

## Contributions

The paper proposes a multi-keyword fuzzy search scheme that does not require a predefined keyword dictionary.

It uses LSH to construct Bloom filters, and the similarity between the index vector and the query vector is calculated using inner product. Although it is not mentioned in the paper, the computation of inner products allow parallel processing, which can greatly improve the performance of the system.

## Limitations

The use of Bloom filters can generate false positive results, since items that are not similar can be hashed into the same value. Similarly, LSH functions can create false positives and false negatives. The paper provides a theoretical analysis of both false positives and false negatives, and highlights that the trade-off between both can be tuned by varying the size of the Bloom filters and the number of LSH functions used to create it.

## Final Notes

A simple presentation about this paper is available at SlideShare: