This is a summary of the paper “Privacy-preserving Search for Chemical Compound Databases” by Shimizu et al. (BMC Bioinformatics 2015). All the information is based on my understanding of the paper and may contain inaccuracies.
The proposal assumes a two party computation scenario in which a user holding a chemical compound fingerprint wants to verify if there are similar compounds stored in a database. The user should not know anything other than the result of the query (database privacy), and the server should not know anything about the user’s query compound (user privacy). The similarity is calculated using the Tversky Index, which is a more general form of the Jaccard Index and the Dice Coefficient.
The compounds are assumed to be stored in plaintext and only encrypted when the search is performed. To make a search, the user encrypts its compound with a public key (from a additive homomorphic cryptosystem) and sends it to the server. The server exploits the additive homomorphic property to check whether the Tversky Index between the query compound and the other compounds in the database is greater than a given threshold.
After that, the server encrypts the Tversky Index using the user’s public key, and send the ciphertext back to the user.
Since the user receives the Tversky Index, it is possible to perform regression attacks using multiple queries and discover more information about a particular compound based on its distance to other query compounds. To prevent such attacks, the user should only know the sign of the Tversky Index (non-negative means that the similarity is higher than the threshold). The server shuffles the real Tversky Index among dummy values and sends their encryption to the user along with the number of non-negative dummy values. The user can decrypt the ciphertexts and count the total number of non-negative values. If the total number of non-negative values subtracted by the number of non-negative dummies is the value 1, then the user knows that the query compound is similar to a compound in the server.
In addition, to guarantee that the protocol works even against malicious users, zero knowledge proofs can be used to check the validity of the query.
This work uses additive-homomorphic cryptosystems to present a secure way to query databases for similarity of chemical compounds.
The user privacy is assured by the non-deterministic encryption using the user’s public key.
To ensure database privacy, the result of the comparison (how many similar items were found in the database) is shuffled with dummy values.
The data in the server has to be stored in plaintext and then encrypted with the querying user’s public key in order to perform the matching. Therefore, this model cannot be applied in the case the server is honest-but-curious.
Furthermore, the process of returning dummy data to the user has to be performed to all the pairs of items, thus incurring a large processing and communication overhead.
I prepared a simple presentation about this work and made it available on SlideShare: