# Summary: Fast, Private and Verifiable: Server-aided Approximate Similarity Computation over Large-Scale Datasets

This is a summary of the paper “Fast, Private and Verifiable: Server-aided Approximate Similarity Computation over Large-Scale Datasets” by Qiu et al. (SCC 2016). All the information is based on my understanding of the paper and may contain inaccuracies.

## Overview

The main objective of this work is to compute the Jaccard similarity (JS) of two sets in an efficient and secure way. Jaccard similarity is a similarity metric used in many applications, ranging from plagiarism detection to recommendation systems. To tackle the efficiency aspect, the MinHash technique is used to approximate the Jaccard similarity. Regarding security, the solution uses deterministic encryption and performs computations over the ciphertexts.

The core idea of MinHash is to use hash functions to create signatures for the sets. These signatures are a compact representation of the original sets, thus saving storage and computation. On the other hand, the value obtained is an approximation of the real similarity and therefore has an approximation bias ($\epsilon = \frac{1}{\sqrt{k}}$, where $k$ is the number of hash functions to construct the signatures).

Two protocols are proposed in this work: Protocol 1 assumes a honest-but-curious server that computes the similarity between sets $A$ and $B$. Protocol 2 considers the possibility of an malicious server and performs a consistency check to see whether the server is really malicious.

The steps of Protocol 1 are as follows:

1. Each client calculates the signatures for each set
2. Each client encrypts the signatures using deterministic encryption
3. The encrypted signatures are uploaded to the server
4. The server compares the signatures and calculates the similarity between the sets
5. The server outputs the similarity value to the clients

In Protocol 1, the hash functions used to calculate the signatures are shared between the clients before the protocol starts. In addition, the secret key must be shared in order to allow comparisons of ciphertext.

Protocol 2 builds on top of Protocol 1 and tries to verify whether the server is malicious by doing the following steps:

1. Use Protocol 1 to calculate $\mathit{JS}(A,B)$
2. Use Protocol 2 to calculate the similarity between dummied sets $D_A$ and $D_B$, $\mathit{JS}(D_A, D_B)$
3. The clients can perform a consistency check to verify if the values $\mathit{JS}(A,B)$ and $\mathit{JS}(D_A, D_B)$ are consistent

The consistency checks returns $1$ (i.e., the server is not malicious) if the server returns a value for $\mathit{JS}(D_A, D_B)$ that is inside an error range from the value $\mathit{JS}(A, B)$. If the value is not inside this range, the server is considered malicious and the check returns $0$.

Due to the use of approximations, there might be cases of false positives (i.e., the server is honest, but the check says it is malicious) and false negatives (i.e., the server is malicious, but the check says it is honest).

The experiments evaluate the efficiency comparing Protocol 1 and Protocol 2 in a sequential and in a parallel implementation. The parallelism is especially useful to speedup the computation of the MinHash signatures, which can be calculated concurrently. The parallel version has speedups of about five times compared to the sequential version.

Regarding the accuracy of the consistency checks, experiments show that the False Positive Rate (FPR) does not change if the number of hash functions used ($k$), but it does change based on the similarity of the sets. In other words, situations in which sets that are more similar give a higher FPR. On the other hand, the number of hash functions used $k$ has an impact on the False Negative Rate (FNR), and higher values of $k$ yield lower numbers of false negatives. However, increasing $k$ also increases the storage requirements and the computation time, so there is a trade-off between accuracy and efficiency.

## Contributions

The main contribution of this work is the consistency check mechanism to verify whether the server is malicious.

The experimental evaluation supports the theoretical claims and raises an important practical issue, namely the trade-off between accuracy and efficiency by varying the number of hash functions used to construct the signatures.

## Limitations

The solution assumes no collusion between the parties (clients and server) in the protocol.

Furthermore, it is assumed that the server does not have background knowledge, and that is why it is possible to use deterministic encryption. If the server have background knowledge, using deterministic encryption can create vulnerabilities to dictionary and frequency attacks.

Finally, since the server also knows the value of the calculated similarity, it may perform regression attacks to infer the similarity between other sets.

## Final Notes

A presentation introducing this work is available at SlideShare: