This is a summary of the paper “Privacy Preserving Record Linkage with PPJoin” by Sehili et al. (BTW 2015). All the information is based on my understanding of the paper and may contain inaccuracies.
This work deals with the problem of joining records that are similar while maintaining privacy and security. The proposal, called P4Join, extends the algorithm PPJoin, which is one of the most effective similarity join algorithms.
The encryption is done by applying cryptographic hash functions (MD5 and SHA-1) to n-grams and hashing them into a Bloom filter. After the Bloom filter is constructed, in order to reduce the quadratic complexity of comparing all pairs of items, the filtering techniques of PPJoin (namely, length filter, prefix filter and position filter) prune pairs that would be unlikely to match.
The authors implement the solution using off-the-shelf GPUs. In the GPU kernel, each record from the first relation is compared to the records on the other relation and checked by the length filter. The pairs that pass the length filtering are then checked by the prefix filter.
The experiments use synthetic data and the datasets are split into partitions of a maximum size of 2,000 records in order to fit in the GPUs memory. Compared to the naive nested-loop join approach, P4Join saves most of the time due to the length filter. On the other hand, the prefix filtering does not yield any performance improvement due to the overhead of comparing large prefixes found in the Bloom filters.
In general, the solution achieved speedups of 20 times compared to the single-thread CPU implementation, and another 10-20% improvement was obtained when using GPU and multi-thread CPUs.
The authors present an adaptation of an efficient similarity join algorithm (PPJoin) to the scenario where the data is encrypted.
By using GPUs for parallel processing, the algorithm achieved speedups of 20 times compared to the serial execution.
Although there is the mention of record linkage, the proposal focus on the matching step and the joining of the records.
The comparison between two Bloom filters can only be done because they were encrypted deterministically, which can lead to dictionary and frequency attacks.
There are no experiments comparing the GPU implementation with multi-thread CPU implementation or with a parallel nested-loop join implementation.
A presentation introducing this work is available at SlideShare.