# Summary: Inverted Index Based Multi-Keyword Public-key Searchable Encryption with Strong Privacy Guarantee

This is a summary of the paper “Inverted Index Based Multi-Keyword Public-key Searchable Encryption with Strong Privacy Guarantee” by Wang et al. (INFOCOM 2015). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

This work proposes a searchable encryption protocol based on inverted indices. The authors highlight the advantages of inverted indices, especially for indexing large amounts of text documents. Based on these advantages and on its wide adoption in the context of plaintext data, this research presents a system that encrypts the existing inverted indices while maintaining their search capabilities.

The focus of the paper is on overcoming two important limitations presented by the previous works:

1. Since the server can see the patterns of the encrypted array representing the index for a given keyword, the index must be rebuilt after a query uses that keywords.
2. Lack of support for conjunctive multi-keyword search.

The main building blocks for this work are inverted indices themselves, private set intersection and blind storage:

• A inverted index has a number of inverted lists (one for each keyword), and each list has the IDs of the documents that contain that keyword.
• Private set intersection can securely find the intersection of the inverted lists that satisfy each of the keywords in the query.
• The blind storage protocol securely retrieves documents from the cloud server after the document IDs have been obtained using the searchable encryption protocol.

There are three main entities in the system:

1. A data owner who constructs the index, encrypts the documents and issues trapdoors for queries.
2. A honest-but-curious cloud server that stores the encrypted documents and the encrypted index. The server performs the search using the index, but should not learn anything about the documents, the index or the queries.
3. Trusted users that request trapdoors from the data owner and then use those trapdoors to perform queries on the server.

The main steps of the protocol are as follows:

1. It is assumed that the data owner already has a plaintext inverted index for the documents. The encryption of the index is done by creating polynomials for each inverted list and encrypting the coefficients using the Paillier cryptosystem. The index and the encrypted documents are then sent to the cloud server.
2. A user contacts the data owner to obtain a trapdoor for a given query. The trapdoor contains the coefficients of the query polynomial, which is padded with random terms to avoid leakage of the number of keywords in the query.
3. The server uses the encrypted index provided by the data owner and the trapdoor provided by the user to calculate the encrypted intersection of the inverted lists that contain the query keywords.
4. The user decrypts the result with the help of the data owner, obtaining the IDs of the documents that satisfy the query.
5. Using a blind storage protocol, the user can retrieve the documents from the cloud server without leaking any additional information.

A experimental evaluation is performed using datasets of up to 12000 documents:

1. The index generation step happens only once and its cost depends mainly on the size of the encryption keys used in the system.
2. The trapdoor generation can become expensive due to matrix multiplications, but the authors claim that it can be optimized with special matrix multiplication algorithms.
3. In the computation done by the server, the encryption takes most of the time, but it can be parallelized for better performance.

## Contributions

Assuming a existing inverted index contributes to an easier application of this method to systems that already use inverted indices over plaintext data.

The use of cheaper operations (e.g.: multiplication and exponentiation) instead of more expensive ones (e.g.: pairing) contributes to a better performance in comparison to other works.

Allowing conjunctive multi-keyword queries is a desirable feature for the use of this system in real-world scenarios.

## Limitations

The data owner must always be online and participating in the system by issuing trapdoors for the queries and helping in the decryption of results. This can become a bottleneck for the scalability of the system.

Considering that the data owner has to participate in the protocol anyway, it is not clear why the processing to find the documents’ IDs should be done at the cloud server and not at the data owner. In other words, it seems the cloud could be responsible only for the storage and the distribution of documents for the authorized users, but the search using the inverted index could be done at the data owner.

The proposal does not support disjunctive multi-keyword queries, which are also frequently used in real-word applications.

Final Notes

A presentation introducing this work is available at SlideShare.

# Summary: Privacy Preserving Record Linkage with PPJoin

This is a summary of the paper “Privacy Preserving Record Linkage with PPJoin” by Sehili et al. (BTW 2015). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

This work deals with the problem of joining records that are similar while maintaining privacy and security. The proposal, called P4Join, extends the algorithm PPJoin, which is one of the most effective similarity join algorithms.

The encryption is done by applying cryptographic hash functions (MD5 and SHA-1) to n-grams and hashing them into a Bloom filter. After the Bloom filter is constructed, in order to reduce the quadratic complexity of comparing all pairs of items, the filtering techniques of PPJoin (namely, length filter, prefix filter and position filter) prune pairs that would be unlikely to match.

The authors implement the solution using off-the-shelf GPUs. In the GPU kernel, each record from the first relation is compared to the records on the other relation and checked by the length filter. The pairs that pass the length filtering are then checked by the prefix filter.

The experiments use synthetic data and the datasets are split into partitions of a maximum size of 2,000 records in order to fit in the GPUs memory. Compared to the naive nested-loop join approach, P4Join saves most of the time due to the length filter. On the other hand, the prefix filtering does not yield any performance improvement due to the overhead of comparing large prefixes found in the Bloom filters.

In general, the solution achieved speedups of 20 times compared to the single-thread CPU implementation, and another 10-20% improvement was obtained when using GPU and multi-thread CPUs.

## Contributions

The authors present an adaptation of an efficient similarity join algorithm (PPJoin) to the scenario where the data is encrypted.

By using GPUs for parallel processing, the algorithm achieved speedups of 20 times compared to the serial execution.

## Limitations

Although there is the mention of record linkage, the proposal focus on the matching step and the joining of the records.

The comparison between two Bloom filters can only be done because they were encrypted deterministically, which can lead to dictionary and frequency attacks.

There are no experiments comparing the GPU implementation with multi-thread CPU implementation or with a parallel nested-loop join implementation.

## Final Notes

A presentation introducing this work is available at SlideShare.

# Summary: Efficient Similarity Search over Encrypted Data

This is a summary of the paper “Efficient Similarity Search over Encrypted Data” by Kuzu et al. (ICDE 2012). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

This paper proposes a method to perform similarity search over encrypted data by using locality sensitive hashing (LSH).

The similarity searchable encryption scheme used is a non-deterministic method that can generate a trapdoor $T$ for a feature $f$ of a particular data item. A secure index $I$ is created for the data items and it is used to find items that have a specific feature.

Three schemes are proposed: Basic single server, two-round multi-server and one-round multi-server.

The basic single server search scheme has six main steps:

1. Key generation: When private keys are generated.
2. Index construction: A secure index $I$ is created for the collection of data items
3. Data encryption: The data items are encrypted and uploaded, and the keys and hash functions are shared with the users.
4. Trapdoor construction: Using LSH, a trapdoor is constructed for the query feature and it is sent to the server.
5. Search: The server searches on the index using the received trapdoor and returns the bit vectors that represent items that can be part of the result. The user decrypts the bit vectors, ranks the identifiers and sends the desired identifiers to the server. Finally, the server returns the encrypted items.
6. Data decryption: The user decrypts the items that belong to the result.

The basic scheme can leak the association between identifiers and trapdoors, which can be used to perform statistical attacks. The solution for this uses two servers: One server stores the index and the other stores the data items. After users decrypt and rank the bit vectors from the server holding the index, they can request the items from the other server by using the desired identifiers.

Although the multi-server approach is more secure, it still requires the user to do a considerable amount of work to decrypt the bit vectors and rank the identifiers. To reduce the client computation, a one-round search scheme uses the Paillier cryptosystem, which is non-deterministic and homomorphic additive, to rank the items. In this scheme, the server holding the index receives a trapdoor from the user and computes scores between the trapdoor and the other bit vectors. The encrypted scores are sent directly to the other server, which decrypts them and ranks the items according to the scores. After that, the encrypted results are sent to the user, who decrypts them and obtains the desired plaintexts.

In general terms, the two-round multi-server scheme is good for scenarios that have a large number of data items, and the one-round scheme is better if the items have a large number of features.

## Contributions

The index constructed using LSH is secure under an adapted definition of semantic security.

The fuzzy search provides a fault-tolerant way to search for keywords, allowing the user to find the desired results even if the query contains typos or other errors.

Information leakages (e.g., the association between identifiers and trapdoors) that allow statistical attacks are also considered by this work.

## Limitations

The multi-server model assumes that the servers are honest-but-curious and do not collaborate with each other, with might be unrealistic in some scenarios.

Although the one-round scheme minimizes the computation perform by the user, it incurs in an extra communication cost between the servers.

# Summary: Privacy-preserving Search for Chemical Compound Databases

This is a summary of the paper “Privacy-preserving Search for Chemical Compound Databases” by Shimizu et al. (BMC Bioinformatics 2015). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

The proposal assumes a two party computation scenario in which a user holding a chemical compound fingerprint wants to verify if there are similar compounds stored in a database. The user should not know anything other than the result of the query (database privacy), and the server should not know anything about the user’s query compound (user privacy). The similarity is calculated using the Tversky Index, which is a more general form of the Jaccard Index and the Dice Coefficient.

The compounds are assumed to be stored in plaintext and only encrypted when the search is performed. To make a search, the user encrypts its compound with a public key (from a additive homomorphic cryptosystem) and sends it to the server. The server exploits the additive homomorphic property to check whether the Tversky Index between the query compound and the other compounds in the database is greater than a given threshold.
After that, the server encrypts the Tversky Index using the user’s public key, and send the ciphertext back to the user.

Since the user receives the Tversky Index, it is possible to perform regression attacks using multiple queries and discover more information about a particular compound based on its distance to other query compounds. To prevent such attacks, the user should only know the sign of the Tversky Index (non-negative means that the similarity is higher than the threshold). The server shuffles the real Tversky Index among dummy values and sends their encryption to the user along with the number of non-negative dummy values. The user can decrypt the ciphertexts and count the total number of non-negative values. If the total number of non-negative values subtracted by the number of non-negative dummies is the value 1, then the user knows that the query compound is similar to a compound in the server.

In addition, to guarantee that the protocol works even against malicious users, zero knowledge proofs can be used to check the validity of the query.

## Contributions

This work uses additive-homomorphic cryptosystems to present a secure way to query databases for similarity of chemical compounds.

The user privacy is assured by the non-deterministic encryption using the user’s public key.

To ensure database privacy, the result of the comparison (how many similar items were found in the database) is shuffled with dummy values.

## Limitations

The data in the server has to be stored in plaintext and then encrypted with the querying user’s public key in order to perform the matching. Therefore, this model cannot be applied in the case the server is honest-but-curious.

Furthermore, the process of returning dummy data to the user has to be performed to all the pairs of items, thus incurring a large processing and communication overhead.

# Summary: Privacy-Preserving Multi-Keyword Fuzzy Search over Encrypted Data in the Cloud

This is a summary of the paper “Privacy-Preserving Multi-Keyword Fuzzy Search over Encrypted Data in the Cloud” by Wang et al. (INFOCOM 2014). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

This work locates itself in the scenario of outsourcing the storage and processing of data which needs to be encrypted before being uploaded in order to preserve its privacy. It mentions previous works that enables search over encrypted data, highlighting solutions that provide single keyword search, multi-keyword search and fuzzy single keyword search. To tackle to problem of fuzzy multi-keyword search, they propose a solution that uses LSH functions to create Bloom filters which are used as indices for the files to be uploaded to the cloud.

Their security model assumes a honest-but-curious server, trusted users and two threat models: known ciphertext model (limited server knowledge) and known background model (background information available).

The first proposal, Basic Multi-Keyword Fuzzy Search, is a scheme that creates one index per file. Each index is a Bloom filter that contains all the keywords extracted from the file. This scheme has the following three steps:

1. Bigram vector representation of keyword: This steps enables the use of LSH functions. It transforms each keyword extracted from the file into a bigram set, which is are stored in a $26^2$-bit vector.
2. Bloom filter representation: Both indices and queries are represented as Bloom filters in order to be compared. Instead of using regular hash functions, Bloom filters are constructed using LSH functions which will hash similar inputs to the same hash values with a high probability. In this work they use the p-stable family of functions.
3. Inner product based matching: The result of the inner product between the index vector for a given file  and the query vector can be indicate if the file contains the keywords of the query. High values for the inner product mean that more keywords are present in the file.

The solution assures security by encrypting the index and the trapdoors used to query the uploaded data. The encryption is done by splitting the index according to a random vector, and then multiplying each part by previously created invertible matrices.

The Basic Scheme is secure under the Known Ciphertext Model, but it is vulnerable under the Known Background Model. This is because the adversary can make associations between keywords and their trapdoors, which can lead to the recovery of indices.

The proposed solution for this vulnerability is the Enhanced Multi-keyword Fuzzy Search Scheme. It introduces a new security layer in the form of a pseudo-random function that is applied in the encryption step of the indices and the queries. This function does not affect the result of the inner product, since pseudo-random functions are collision free.

The experimental evaluation shows that the time to generate indices and the trapdoors grows linearly with the number of keywords in the file or in the query, since each keyword has to be hashed many times. The search time also grows linearly with the number of documents, since the index for each file has to be compared to the query. On the other hand, the number of keywords in the file does not affect the search time, since the size of the compared Bloom filters is independent from the number of keywords. Finally, the accuracy is measured in terms of precision and recall. The precision for a larger number of keywords decreases slightly due to the accumulation of false positives for each keyword.

## Contributions

The paper proposes a multi-keyword fuzzy search scheme that does not require a predefined keyword dictionary.

It uses LSH to construct Bloom filters, and the similarity between the index vector and the query vector is calculated using inner product. Although it is not mentioned in the paper, the computation of inner products allow parallel processing, which can greatly improve the performance of the system.

## Limitations

The use of Bloom filters can generate false positive results, since items that are not similar can be hashed into the same value. Similarly, LSH functions can create false positives and false negatives. The paper provides a theoretical analysis of both false positives and false negatives, and highlights that the trade-off between both can be tuned by varying the size of the Bloom filters and the number of LSH functions used to create it.

# Summary: Fuzzy Keyword Search over Encrypted Data in Cloud Computing

This is a summary of the paper “Fuzzy Keyword Search over Encrypted Data in Cloud Computing” by Li et al. (INFOCOM 2010). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

This work is motivated by the security concerns present in the cloud computing environment. As a solution for protecting data privacy, the data is encrypted before being uploaded. Since searchable encryption schemes only provide exact keyword match, there is a need for fuzzy search over encrypted data.

The fuzzy search can identify items that are not exactly the same, but similar enough to satisfy a given similarity threshold. This solution considers edit distance as the similarity metric, defined as the number of operations (substitution, insertion and deletion) required to transform one string into another. The result of the fuzzy search is a set containing file IDs whose corresponding files contain words that are similar to the query keyword (i.e., $ed(w, w') \leq k$, for query keyword $w$ and the similarity threshold $k$). The file IDs are encrypted using symmetric encryption and are retrieved using trapdoors calculated using the same encryption key.

Considering all the operations (substitution, insertion and deletion) separately can create large fuzzy sets, thus increasing the storage requirements. The wildcard-based fuzzy search uses wildcards to denote all the operations at a given position in the word. Using this approach, the number of elements in the fuzzy sets are reduced from $(2 \ell + 1) \times + 1$ to $2 \ell + 1 + 1$ ($\ell$ is the length of the keyword).

## Contributions

The wildcard-based approach reduces the size of the fuzzy sets used to find candidates for matching the query keyword.

The scheme uses symmetric encryption, which offers good performance.

Search privacy is assured by the system.

## Limitations

Even reducing the size of the fuzzy sets, the number of possibilities is still large.

The system requires a predefined set of query keywords.

Multiple keywords are not supported in one query.

The proposal does not present an experimental evaluation.

# Summary: Fast, Private and Verifiable: Server-aided Approximate Similarity Computation over Large-Scale Datasets

This is a summary of the paper “Fast, Private and Verifiable: Server-aided Approximate Similarity Computation over Large-Scale Datasets” by Qiu et al. (SCC 2016). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

The main objective of this work is to compute the Jaccard similarity (JS) of two sets in an efficient and secure way. Jaccard similarity is a similarity metric used in many applications, ranging from plagiarism detection to recommendation systems. To tackle the efficiency aspect, the MinHash technique is used to approximate the Jaccard similarity. Regarding security, the solution uses deterministic encryption and performs computations over the ciphertexts.

The core idea of MinHash is to use hash functions to create signatures for the sets. These signatures are a compact representation of the original sets, thus saving storage and computation. On the other hand, the value obtained is an approximation of the real similarity and therefore has an approximation bias ($\epsilon = \frac{1}{\sqrt{k}}$, where $k$ is the number of hash functions to construct the signatures).

Two protocols are proposed in this work: Protocol 1 assumes a honest-but-curious server that computes the similarity between sets $A$ and $B$. Protocol 2 considers the possibility of an malicious server and performs a consistency check to see whether the server is really malicious.

The steps of Protocol 1 are as follows:

1. Each client calculates the signatures for each set
2. Each client encrypts the signatures using deterministic encryption
3. The encrypted signatures are uploaded to the server
4. The server compares the signatures and calculates the similarity between the sets
5. The server outputs the similarity value to the clients

In Protocol 1, the hash functions used to calculate the signatures are shared between the clients before the protocol starts. In addition, the secret key must be shared in order to allow comparisons of ciphertext.

Protocol 2 builds on top of Protocol 1 and tries to verify whether the server is malicious by doing the following steps:

1. Use Protocol 1 to calculate $\mathit{JS}(A,B)$
2. Use Protocol 2 to calculate the similarity between dummied sets $D_A$ and $D_B$, $\mathit{JS}(D_A, D_B)$
3. The clients can perform a consistency check to verify if the values $\mathit{JS}(A,B)$ and $\mathit{JS}(D_A, D_B)$ are consistent

The consistency checks returns $1$ (i.e., the server is not malicious) if the server returns a value for $\mathit{JS}(D_A, D_B)$ that is inside an error range from the value $\mathit{JS}(A, B)$. If the value is not inside this range, the server is considered malicious and the check returns $0$.

Due to the use of approximations, there might be cases of false positives (i.e., the server is honest, but the check says it is malicious) and false negatives (i.e., the server is malicious, but the check says it is honest).

The experiments evaluate the efficiency comparing Protocol 1 and Protocol 2 in a sequential and in a parallel implementation. The parallelism is especially useful to speedup the computation of the MinHash signatures, which can be calculated concurrently. The parallel version has speedups of about five times compared to the sequential version.

Regarding the accuracy of the consistency checks, experiments show that the False Positive Rate (FPR) does not change if the number of hash functions used ($k$), but it does change based on the similarity of the sets. In other words, situations in which sets that are more similar give a higher FPR. On the other hand, the number of hash functions used $k$ has an impact on the False Negative Rate (FNR), and higher values of $k$ yield lower numbers of false negatives. However, increasing $k$ also increases the storage requirements and the computation time, so there is a trade-off between accuracy and efficiency.

## Contributions

The main contribution of this work is the consistency check mechanism to verify whether the server is malicious.

The experimental evaluation supports the theoretical claims and raises an important practical issue, namely the trade-off between accuracy and efficiency by varying the number of hash functions used to construct the signatures.

## Limitations

The solution assumes no collusion between the parties (clients and server) in the protocol.

Furthermore, it is assumed that the server does not have background knowledge, and that is why it is possible to use deterministic encryption. If the server have background knowledge, using deterministic encryption can create vulnerabilities to dictionary and frequency attacks.

Finally, since the server also knows the value of the calculated similarity, it may perform regression attacks to infer the similarity between other sets.

## Final Notes

A presentation introducing this work is available at SlideShare: