Sum of Encrypted Vectors

Recently I received a question about what would be a good way to do the element wise summation between two encrypted vectors.

I have written before about how to do the secure sum of single values using HElib and using Paillier lib, and also about how to do the secure scalar product between vectors using HElib, but I believe summing vectors is also a common thing to do, so I decided to try a few different ways to do it and compared them.

The first three methods are the same as the ones I mentioned in the inner product using HElib post (i.e., no packing, polynomial packing and subfield packing), and I also tried to do it using Paillier lib.

The source code is available at GitHub.

For all the methods, what is being done is basically initializing two vectors of fixed size (VEC\_SIZE). One contains even numbers, and the other contains odd numbers:

// Array u will have even numbers, and array v will have odd numbers
  for (int i = 0; i < VEC_SIZE; i++) {
    u[i] = 2*i;
    v[i] = 2*i + 1;
}

After that, we do three steps which depend on the cryptosystem: encryption, element wise sum and decryption.

Method 1

The first method uses HElib and encrypts the elements of the vectors individually, so we have one ciphertext for each element.

// Vectors to hold the ciphertexts created from the elements of u and v
  std::vector<Ctxt> encU(VEC_SIZE, Ctxt(pk));
  std::vector<Ctxt> encV(VEC_SIZE, Ctxt(pk));

// Each element is encrypted individually
  for (int i = 0; i < VEC_SIZE; i++) {
    Ctxt tempU(pk);
    pk.Encrypt(tempU, to_ZZX(u[i]));
    encU[i] = tempU;

    Ctxt tempV(pk);
    pk.Encrypt(tempV, to_ZZX(v[i]));
    encV[i] = tempV;
}

The sum and the decryption also work on each vector position independently:

// Sum vectors element wise. The sum is stored in U
for (int i = 0; i < VEC_SIZE; i++) {
  encU[i] += encV[i];
}

// Decrypt result
for (int i = 0; i < VEC_SIZE; i++) {
  ZZX element;
  sk.Decrypt(element, encU[i]);
  result[i] = conv<long>(element[0]);
}

Method 2

For Method 2 and Method 3, we use HElib’s very useful ciphertext packing feature to reduce the number of ciphertexts generated and, consequently, reduce the overall execution time.

The second method uses the polynomial packing, which embeds a vector’s elements in the coefficients of a polynomial:

// ZZX is a class for polynomials from the NTL library, on top of which HElib is built
ZZX U, V;                               

// Set the length of the polynomial U(x) and V(x)
U.SetLength(VEC_SIZE);
V.SetLength(VEC_SIZE);

// Set the coefficients of the polynomials U(x) and V(x).
for (int i = 0; i < VEC_SIZE; i++) {
  SetCoeff(U, i, u[i]);
  SetCoeff(V, i, v[i]);
}

The polynomials are then encrypted as just one ciphertext each:

// Ciphertexts that will hold the polynomials encrypted using public key pk
Ctxt encU(pk);
Ctxt encV(pk);                          

// Encrypt the polynomials into the ciphertexts
pk.Encrypt(encU, U);
pk.Encrypt(encV, V);

After that, we do the sum and the decryption of the resulting polynomial:

// Multiply the ciphertexts and store the result into encU
encU += encV;

// Decrypt the multiplied ciphertext into a polynomial using the secret key sk
ZZX resultPoly;
sk.Decrypt(resultPoly, encU);

The final step for this method is to unpack the polynomial into the original result form (array of integers):

// Assign the values of the polynomial's coefficients to the result vector
for (int i = 0; i < VEC_SIZE; i++) {
  result[i] = conv<long>(resultPoly[i]);
}

Method 3

The third method is similar to the second one, but uses HElib’s subfield packing. A way to think about subfield packing is that we still have the vector representation of our data (instead of transforming them to polynomials), but all the elements are packed into one single ciphertext.

Despite the similarities with the second method, one difference is that we need to create an EncryptedArray object that helps us work with this kind of packing. Another difference is that the length of the EncryptedArray depends on the context we are using (which depends on the initialization parameters) and not on the vectors we want to process. This means that if our vectors are shorter than the EncryptedArray, we just fill the rest of the positions with zero until the lengths of our vectors and of the EncryptedArray are the same:

// Creates a helper object based on the context
EncryptedArray ea(context, context.alMod.getFactorsOverZZ()[0]); 

// The vectors should have the same size as the EncryptedArray (ea.size()),
// so fill the other positions with 0 which won't change the result
std::vector<long> U(u, u + VEC_SIZE);
std::vector<long> V(v, v + VEC_SIZE);
for (int i = VEC_SIZE; i < ea.size(); i++) {
  U.push_back(0);
  V.push_back(0);
}

The encryption works in a way similar to the second method:

// Ciphertexts that will hold the encrypted vectors
Ctxt encU(pk);
Ctxt encV(pk);

// Encrypt the whole vector into one ciphertext using packing
ea.encrypt(encU, pk, U);
ea.encrypt(encV, pk, V);

After that, we do the summation, the decryption and the unpacking just like before:

// Sum ciphertexts and set the result to encU
encU += encV;

// Decrypt the result
std::vector<long> res(ea.size(), 0);
ea.decrypt(encU, sk, res);

// Assign the values of the polynomial's coefficients to the result vector
for (int i = 0; i < VEC_SIZE; i++) {
  result[i] = conv<long>(res[i]);
}

Method 4

The final method I tested is to use Paillier lib instead of HElib. Since we are just doing summation, we don’t need the homomorphic multiplication provided by HElib, and the homomorphic addition property from Paillier lib is enough.

I don’t know if there are ways to do ciphertext packing using the Paillier cryptosystem (let me know in the comments if there is), so, like the first one, the fourth method also creates one ciphertext per element:

// Vectors that will hold the ciphertexts
std::vector<paillier_ciphertext_t*> encU(VEC_SIZE);
std::vector<paillier_ciphertext_t*> encV(VEC_SIZE);    

// Encrypt elements
for (int i = 0; i < VEC_SIZE; i++) {
  paillier_plaintext_t* ptxtU = paillier_plaintext_from_ui((int)u[i]);
  paillier_ciphertext_t* ctxtU = paillier_enc(NULL, pubKey, ptxtU, paillier_get_rand_devurandom);
  encU[i] = ctxtU;

  paillier_plaintext_t* ptxtV = paillier_plaintext_from_ui((int)v[i]);
  paillier_ciphertext_t* ctxtV = paillier_enc(NULL, pubKey, ptxtV, paillier_get_rand_devurandom);
  encV[i] = ctxtV;
}

Then, we use the paillier\_mul function on each pair of encrypted elements to give us the sum of the plaintext values:

// Sum encrypted vectors element wise
for (int i = 0; i < VEC_SIZE; i++) {
  paillier_ciphertext_t* encryptedSum = paillier_create_enc_zero();
  paillier_mul(pubKey, encryptedSum, encU[i], encV[i]);
  encU[i] = encryptedSum;
}

Finally, we do the decryption:

// Decrypt the result
for (int i = 0; i < VEC_SIZE; i++) { paillier_plaintext_t* dec; dec = paillier_dec(NULL, pubKey, secKey, encU[i]); result[i] = mpz_get_ui(dec->m);
}

I did not include memory management code in the snippets, but the code at GitHub does the proper cleaning.

Checking Results

For all the methods, we can check the result of the summation to confirm that it is correct:

// Checking results
  for (int i = 0; i < VEC_SIZE; i++) {
  std::cout << u[i] << "+" << v[i] << " = " << result[i] << std::endl;
}

Evaluation

Both cryptosystems used are probabilistic and give us semantic security.

Regarding the performance, I ran a few experiments to compare the methods in terms of execution time (in seconds) when summing vectors of different lengths.

The server used was an Intel(R) Core(TM) i7-2600 3.40Ghz 8GB RAM and the next figure summarizes the results:

Approaches Serial Comparison

Although all the approaches using HElib take much more time than the Paillier lib, the breakdown of the execution time (in seconds) for summing vectors of length 1000 gives an interesting insight on the reason for that:

Steps Method 1 Method 2 Method 3 Method 4
Initialization 9.4 9.4 9.4 0.003
Encryption 80.63 0.08 0.13 1.50
Sum 0.15 0.0002 0.0002 0.0004
Decryption 18.09 0.02 0.37 0.03

So we can see that HElib’s initialization takes a lot of time, and constitutes most of the execution time in Method 2 and Method 3. In fact, if we don’t consider the initialization step, using HElib with ciphertext packing gives a better performance than even the partially homomorphic Paillier cryptosystem.

Also, we see that varying the length of the vectors doesn’t affect much the execution time for the methods that use packing. Since the whole vector is encrypted as a single ciphertext, the number of elements in it didn’t have much effect using the values from 100 to 1000.

But I would imagine that this has a limit, and eventually a vector length would be too much for that size of ciphertext, so we’d need to have a bigger ciphertext that would take more time to process. I’m not sure about this though, but I’ll update this post if I find some more details about it.

Parallelization

The sum of vectors is an operation that offers opportunities for parallel processing, because the processing of vector elements is independent from each other.

I tried to do a basic form of parallelization using OpenMP, by just adding #pragma omp parallel for a few times here and there (check the code at GitHub to see where). I know there are other (and probably better) ways to do this, but I just wanted to make it a simple test that doesn’t take our focus from the security to the performance.

The following figure shows the result of the comparison between the serial and the parallel versions using 8 threads.

comparisonSerialParallel

From the figure we can see that Method 1 benefits the most from parallelization, and the others don’t show much improvement. I believe the main reason for this is the length of vectors being too short, so the performance gains don’t stand out much because the initialization phase is still the bottleneck.

Conclusion

Although ciphertext packing using HElib gives us a good scalability, the initialization phase takes a long time and, for the short vectors used in the tests, the benefits of packing are hidden by the initialization.

The comment section welcomes suggestions about other methods to do the summation of encrypted vectors as a whole or about other topics in this post.

Using Paillier Library

The Paillier cryptosystem is a probabilistic asymmetric encryption scheme commonly cited in the literature related to encrypted databases. The main reason for that is the additive homomorphic property offered by it. Given two plaintext messages m_1 and m_2, which can be encrypted (E()) and decrypted (D()) using the Paillier cryptosystem, it is possible to multiply the ciphertexts in order to obtain the sum of the plaintexts: D(E(m_1) * E(m_2)) = m_1 + m_2.

This property is useful to database systems because it allows the summation (e.g., SQL’s SUM()) of values in a column to be executed in a secure way. In fact, CryptDB implements this cryptosystem to allow this kind of operation.

Among the various implementations of the Paillier cryptosystem, I tried a C++ library called Paillier Library. It offers the features of key generation, encryption, decryption, homomorphic additon and import/export of keys and ciphertexts.

Although the library has only a few functions, I could not find a more step-by-step set of examples that might help get used to the library. For this reason, in this post I’ll give an overview on how to use the above features through samples. The complete samples are available at GitHub.

This post assumes the Paillier Library and its dependencies (e.g., GMP) are available on the system. This can be done either by installing the package in the official page or by just downloading the files paillier.h and paillier.c.

Generating, Exporting and Importing Keys

Since all the operations need either the public or the secret key, let’s start by creating those. After that, we export them to files to emulate someone sharing the public key with others.

First, we define the security parameter (n) that represents the number of bits of the modulus used in the cryptosytem:

// Security parameter (number of bits of the modulus)
const long n = 1024;

This library is built on top of GMP and uses some of its data structures (e.g., mpz\_t), but wraps them using some structs. This is the case for the public and secret keys, which are presented as the structs paillier\_pubkey\_t and paillier\_prvkey\_t. The paillier\_keygen function then generates the keys by taking the number of bits of the modulo and random input from the file /dev/urandom:

// Generate public and secret keys
paillier_pubkey_t* pubKey;
paillier_prvkey_t* secKey;
paillier_keygen(n, &pubKey, &secKey, paillier_get_rand_devurandom);

And this is all that is needed for generating the keys. After that, we can transform them to the hexadecimal notation (using the library’s functions paillier\_pubkey\_to\_hex() and paillier\_prvkey\_to\_hex()) and export them to files:

// Export keys to file
std::fstream secKeyFile("seckey.txt", std::fstream::out|std::fstream::trunc);
std::fstream pubKeyFile("pubkey.txt", std::fstream::out|std::fstream::trunc);

char* hexSecKey = paillier_prvkey_to_hex(secKey);
char* hexPubKey = paillier_pubkey_to_hex(pubKey);

secKeyFile << hexSecKey;
pubKeyFile << hexPubKey;

The file for the public key can then be given to others to allow encryption of messages which can be decrypted using the private key.

The steps to recover keys from a file are similar to the ones used to export them. First we read the files and then convert the keys from hexadecimal notation (using the functions paillier\_pubkey\_from\_hex() and paillier\_prvkey\_from\_hex()):

// Read public key from disk and initialize it
std::fstream pubKeyFile("pubkey.txt", std::fstream::in);
std::fstream secKeyFile("seckey.txt", std::fstream::in);    
    
std::string hexPubKey;
std::string hexSecKey;    
std::getline(pubKeyFile, hexPubKey);
std::getline(secKeyFile, hexSecKey);    

paillier_pubkey_t* pubKey = paillier_pubkey_from_hex(&hexPubKey[0]);
paillier_prvkey_t* secKey = paillier_prvkey_from_hex(&hexSecKey[0], pubKey);

Encryption and Decryption

After generating the keys, we can use them to encrypt plaintext messages and decrypt ciphertexts. Paillier Library offers the paillier\_plaintext\_t struct which will hold plaintexts, and the paillier\_ciphertext\_t which will hold ciphertexts.

Suppose the first plaintext we want to encrypt is the integer 2. The first step to do that is the initialization of the plaintext as follows:

// Plaintext initialization
paillier_plaintext_t* m;
m = paillier_plaintext_from_ui(2);

The paillier\_plaintext\_from\_ui() function is used for integers, but similar functions are available for array of bytes or strings.

Next, the encryption itself is done with the paillier\_enc function, which also receives a pointer to where the result ciphertext should be stored (if it is NULL, then the functions allocates the ciphertext and return it), the public key, the number of bits of the security parameter and the randomness from /dev/urandom:

// Encrypt the message
paillier_ciphertext_t* ctxt;
ctxt = paillier_enc(NULL, pubKey, m, paillier_get_rand_devurandom);

The decryption follows a similar pattern:

// Decrypt the ciphertext
paillier_plaintext_t* dec;
dec = paillier_dec(NULL, pubKey, secKey, ctxt);

Exporting and Importing Ciphertexts

After creating a ciphertext, we might want to send it to someone else, like to the person who holds the secret key. Paillier Library supports exporting ciphertexts by provides the function paillier\_ciphertext\_to\_bytes() to convert the ciphertext into bytes. The first parameter of this function is the length of the ciphertext to be exported. In the Paillier cryptosystem, the plaintext space is Z_n and the ciphertext space is Z_{n^2}. This means that the length of the ciphertext is twice the length of the plaintext (hence the multiplication by 2 in the sample code). Since the length should be passed in bytes, we use the macro PAILLIER\_BITS\_TO\_BYTES() provided by Paillier Library to make this conversion:

std::fstream ctxt1File("ciphertext1.txt", std::fstream::out|std::fstream::trunc|std::fstream::binary);

// The length of the ciphertext is twice the length of the key
char* byteCtxt1 = (char*)paillier_ciphertext_to_bytes(PAILLIER_BITS_TO_BYTES(pubKey->bits)*2, ctxt1);

ctxt1File.write(byteCtxt1, PAILLIER_BITS_TO_BYTES(pubKey->bits)*2);

To import the ciphertext from a file, we do some similar steps:

std::fstream ctxt1File("ciphertext1.txt", std::fstream::in|std::fstream::binary);

// The length of the ciphertext is twice the length of the key
char* byteCtxt1 = (char*)malloc(PAILLIER_BITS_TO_BYTES(pubKey->bits)*2);

ctxt1File.read(byteCtxt1, PAILLIER_BITS_TO_BYTES(pubKey->bits)*2);

paillier_ciphertext_t* ctxt1 = paillier_ciphertext_from_bytes((void*)byteCtxt1, PAILLIER_BITS_TO_BYTES(pubKey->bits)*2);

Homomorphic Addition

One of the best features of the Paillier cryptosystem is its homomorphic additive property. The Paillier Library provides the paillier\_mul() function to multiply two ciphertexts, which, after decryption, will give us the sum of the corresponding plaintext messages (D(E(m_1) * E(m_2)) = m_1 + m_2). The parameters for the paillier\_mul() are the public key, the ciphertext which will hold the result, and the two ciphertexts to be multiplied. To initialize the result ciphertext, we just create a encryption of zero (0) using the auxiliary function paillier\_create\_enc\_zero():

// Initialize the ciphertext that will hold the sum with an encryption of zero
paillier_ciphertext_t* encryptedSum = paillier_create_enc_zero();

// Sum the encrypted values by multiplying the ciphertexts
paillier_mul(pubKey, encryptedSum, ctxt1, ctxt2);

After this, the sum will be stored in the encryptedSum ciphertext, which can be decrypted as explained before.

Printing Values

We might want to check the values of plaintext messsages and ciphertexts in any point of our programs, but the way to print it is not so straightforward. This is because they are represented with GMP’s mpz\_t struct, so we should also use GMP’s way to print them with the gmp\_printf() function:

// Decrypt and print the ciphertext (encryptedSum)
paillier_plaintext_t* dec;
dec = paillier_dec(NULL, pubKey, secKey, encryptedSum);
gmp_printf("Decrypted sum: %Zd\n", dec);

Final Notes

This tutorial covered basic operations that can be done using Paillier Library. As always, the source code for all the samples is available on GitHub, and questions and other comments are welcome.

I/O of Keys and Ciphertexts Using HElib

When using a public key cryptosystem, it is common to have the public key available for other parties to encrypt messages. Usually the tutorials using HElib focus on a particular operation and do not consider that the encryption and decryption may be performed by different parties. This post will focus on three steps related to the I/O of keys and ciphertexts using HElib:

  1. Generate and output the public and secret keys to a file;
  2. Read the public key from a file and use it to encrypt a message, writing the ciphertext to a file;
  3. Read a ciphertext from a file and use the secret key to decrypt it.

Although HElib already comes with an example of I/O, I found it a bit too complex for a first view on how to do I/O. For this reason, I tried to divide the above steps in different programs and run them separately as if they were done by different parties. The complete code is available at GitHub.

For the rest of the post, let’s assume the existence of two parties: Alice and Bob. Bob wants to send a message to Alice and decides to use HElib to encrypt the message.

Step 1: Keys Generation

The first step (keyGenerator.cpp) consists in Alice generating the secret and public keys that will be used in the protocol. After doing the usual initialization when working with HElib (i.e., setting parameters and generating the context from those parameters), the first thing that should be exported to file is the context. The context will be used to recreate the keys when reading them from the files.

// Files that will contain the public and secret keys
fstream secKeyFile("seckey.txt", fstream::out|fstream::trunc);
fstream pubKeyFile("pubkey.txt", fstream::out|fstream::trunc);
assert(secKeyFile.is_open());
assert(pubKeyFile.is_open());	

// Write the context to the files that will contain the keys
// The context information is used to recreate the keys later
writeContextBase(secKeyFile, context);
writeContextBase(pubKeyFile, context);
secKeyFile << context << std::endl;
pubKeyFile << context << std::endl;

After that, Alice can generate the keys and export them as well to the created files (which already contain the context):

// Generates the secret key
FHESecKey secretKey(context);
const FHEPubKey&amp; publicKey = secretKey;
secretKey.GenSecKey(w);
addSome1DMatrices(secretKey);

// Writes both the secret and the public keys to files
secKeyFile << secretKey << std::endl;
pubKeyFile << publicKey << std::endl;

This step will generate two new files: seckey.txt and pubkey.txt. Alice can then send the pubkey.txt file to Bob so he encrypts the message to be transmitted. This encryption is done in the second step.

Step 2: Encryption

The second step (encrypt.cpp) considers that Bob received Alice’s public key, and now he wants to encrypt a message (message.txt) using this key. Before reading the public key itself, Bob needs to read the context information that will be used to construct the context of this public key:

// Parameters needed to reconstruct the context
unsigned long m, p, r;
vector<long> gens, ords;

fstream pubKeyFile("pubkey.txt", fstream::in);
assert(pubKeyFile.is_open());

// Initializes a context object with some parameters from the file
readContextBase(pubKeyFile, m, p, r, gens, ords);
FHEcontext context(m, p, r, gens, ords);

// Reads the context itself
pubKeyFile >> context;

Next, Bob can initialize a public key object using the context and read the public key from the file as well:

FHEPubKey publicKey(context);
pubKeyFile >> publicKey;

After that, Bob encode the plaintext message in a vector of integers and encrypt it. The encoding of the message can vary (e.g., hash n-grams into a Bloom filter) and I don’t cover that in this post. The size of the vector depends on the size of the EncryptedArray, which is defined by the initialization parameters:

// Creates the EncryptedArray object based on the context previously read from the file
EncryptedArray ea(context);
uint nslots = ea.size();

std::fstream messageFile("message.txt", fstream::in);
assert(messageFile.is_open());

// Tokenizes the message (a collection of integers)
// Each integer will be in a different position in the array to be encrypted
uint count = 0;
std::vector<long int> plaintext(nslots, 0);
for (;;) {
  std::string token;
  if(!(messageFile>>token)) break;
  plaintext[count] = std::stoi(token);
  count++;
}
messageFile.close();

// Encrypts the plaintext vector
Ctxt ctxt(publicKey);
ea.encrypt(ctxt, publicKey, plaintext);

The last part of this step consists in outputting the ciphertext to a file (ciphertext.txt):

std::fstream ciphertext("ciphertext.txt", fstream::out|fstream::trunc);
assert(ciphertextFile.is_open());
ciphertext << ctxt;
ciphertext.close();

So now that Bob that created the ciphertext, he can send it to Alice. In the last step, she will then use her secret key to decrypt this ciphertext and obtain the original message that Bob wanted to transmit.

Step 3: Decryption

Alice starts the third step (decrypt.cpp) by reading the file containing the secret key that she generated previously in Step 1. This is very similar to what Bob did when reading the public key:

// Parameters used to reconstruct the context
unsigned long m, p, r;
vector<long> gens, ords;

// Read the context to reconstruct the secret key
fstream secKeyFile("seckey.txt", fstream::in);
readContextBase(secKeyFile, m, p, r, gens, ords);
FHEcontext context(m, p, r, gens, ords);
secKeyFile >> context;

// Initializes the secret key object using the context and reads the key from the file
FHESecKey secretKey(context);
const FHEPubKey&amp; publicKey = secretKey;
secKeyFile >> secretKey;

Note that the public key was not read, but it was generated from the context. It will be used to initialize a ciphertext object later.

Alice then reads the ciphertext to be decrypted:

// Initializes a ciphertext object using the public key
fstream ciphertextFile("ciphertext.txt", fstream::in);
Ctxt ctxt(publicKey);
ciphertextFile >> ctxt;

Finally, Alice decrypts the ciphertext using a EncryptedArray that was constructed with the context she also read from the secret key file. She can check the decrypted vector to see the message that Bob wanted to send her:

EncryptedArray ea(context);
long nslots = ea.size();

std::vector<long int> decrypted(nslots, 0);
ea.decrypt(ctxt, secretKey, decrypted);

Final Notes

This example can be improved in many ways (e.g., the coding and the protocol itself), but I believe it shows the basics of I/O of keys and ciphertexts using HElib.

Regarding the files created (seckey.txt, pubkey.txt and ciphertext.txt), they can be very large, and their sizes depend on the initialization parameters.

As always, the comments section welcomes questions and suggestions, and the code is available at GitHub.

Simple Line Plots: Using ggplot2

This post belongs to a series of tutorials about how to draw simple line plots for academic papers. Information about the data used is available in the first post of the series, and the source code is on GitHub. Here I will focus on using ggplot2, which is a R package. The other two implementations are done using matplotlib and using PGFPLOTS.

Compared to the two other tools in this series (matplotlib and PGFPLOTS), importing the data using ggplot2 was a bit tricky for me due to the use of data frames, which is a data structure that is passed to the ggplot() function to draw the plot. It helped me to imagine that “data frames are like matrices, but with named columns of different types (similar to database tables)“. Knowing that, the next thing was to create a single data frame with the data from the three .csv files.

For that, we use the merge() function:

#Read experiment data
experimentsResults = data.frame()
xvalues = c(10:19)
implementation = c("gpu", "cpuParallel", "cpuSerial")
for (i in 1:length(implementation)) {
    fileName = paste(c(implementation[i], ".csv"), collapse="")
    csvFile = read.csv(file=fileName, head=TRUE, sep = ",")
    csvFile["implementation"] = i
    if (is.data.frame(experimentsResults) && nrow(experimentsResults)==0) {
        experimentsResults = csvFile
    }
    experimentsResults = merge(experimentsResults, csvFile, all=TRUE)
}

Note that there’s a check to see whether the data frame (named experimentResults) is empty. After the first file data is inserted in the data frame, we just “append” the other files to it. I found helpful to think of merge() as a database join. So, if we think about it as a database table, there are three columns now (one more than in the files): time, size and implementation.

Having the data frame ready, we can pass it to ggplot2 so it draws the plot:

# Draw the plot
p = ggplot(experimentsResults, aes(x = size, y = time, group=implementation))
p = p + geom_line(size=1.5)
p = p + geom_point(aes(shape=factor(implementation, labels=c("GPU", "CPU Parallel", "CPU Serial"))), size = 7, fill="black")

It seems it is a common practice to assign the plot to a variable (p in this case) and then “increment” it with changes (hence the p = p + ... notation).

The ggplot() function takes the data frame as the first parameter, and we can specify what columns will be the X and Y coordinates. The group argument does something like a GROUPBY in SQL, so we can divide the data in three groups corresponding to the implementations. This may sound strange since we first joined the files just to separate them now, but we want three separated lines representing the three different implementations.

The data is actually drawn using geoms (geometric objects) and in this case we create a line geom (geom\_line()) and a point geom (geom\_point()) for each implementation. The line will be the same for all of them, but the geom\_point() function will receive the implementations as parameter to the shapes of the points (i.e., three different point shapes for three different implementations). Other than arguments for the size of the points and their colors, there is label, which will be used in the legend.

The result of this first step is the following. The figure looks weird because the title of the legend (“factor(implementation, labels = c(“GPU”, “CPU Parallel”, “CPU Serial”)) is too long and squeezes the plot itself.

ggplot2step1

To fix the problem with the legend, we can just remove its title. We also do a few other changes regarding the legend (e.g., changing the size of the shapes, the position of the legend, the background color and the background color). The final option in the following code makes the order of the legend match the order of the plotted lines (i.e., “CPU Serial” on top, “CPU Parallel” in the middle and “GPU” on the bottom):

# Format the legend
p = p + theme(legend.title=element_blank())
p = p + theme(legend.key=element_blank())
p = p + theme(legend.key.width=unit(40, "pt"))
p = p + theme(legend.key.height=unit(30, "pt"))
p = p + theme(legend.position=c(0.25,0.85))
p = p + theme(legend.background=element_rect(fill="transparent"))
p = p + theme(legend.text = element_text(size=30))
p = p + guides(shape = guide_legend(reverse=TRUE))

And the result of this second formatting step is:

ggplot2step2

Next, we make some changes regarding the whole figure (e.g., color theme of the plot, font and grids):

# Format font, background and grids
p = p + theme_bw()
p = p + theme(text=element_text(family="Times New Roman"))
p = p + theme(panel.grid.minor.x = element_blank())
p = p + theme(panel.grid.major.x = element_blank())

This gives us:

ggplot2step3

It looks almost ready now, but there are still a few changes to be done, especially regarding the axis (e.g., titles, colors, sizes, limits and scale):

# Format axis
p = p + xlab("|R|")
p = p + ylab("Elapsed time (s)")
p = p + theme(axis.line = element_line(color="black"))
p = p + theme(axis.text = element_text(size=30))
p = p + theme(axis.title = element_text(size=30))
p = p + theme(axis.title.y = element_text(vjust = 1.2))
p = p + theme(axis.title.x = element_text(vjust = -0.4))
p = p + theme(aspect.ratio=1)
p = p + coord_cartesian(ylim = c(0.1, 40000))
p = p + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
p = p + scale_x_continuous(breaks = xvalues, labels = math_format(2^.x)(xvalues))

And that gets us to the final result we wanted:

ggplot2final

To save the figure, we use the ggsave() function and to show it on the screen we use print():

# Save figure
ggsave(plot = p, filename="ggplot2Final.png")

# Show on screen
print(p)

Now, for the sake of the explanation I used a particular formatting order. But if one uses the same order (i.e., plot->legend->general->axis), the result will not be the same. The reason for that is that the legend formatting will be overwritten by the general formatting. So the order that actually works is plot->general->axis->legend, as can be seen in the complete source code.

The full source file is available at GitHub. The comment section is open for discussion and suggestions about the design choices for the plot or about the way they were implemented in this tutorial.

Simple Line Plots: Using PGFPLOTS

This post belongs to a series of tutorials about how to draw simple line plots for academic papers. Information about the data used is available in the first post of the series, and the source code is on GitHub. Here I will focus on using PGFPLOTS, which is a package for TeX. The other two implementations are done using matplotlib and using ggplot2.

The preamble used in this implementation imports the PGFPLOTS package and has a couple of options to enable the plot to be exported.

\documentclass{article}
\usepackage{pgfplots}
\usepgfplotslibrary{external}
\tikzexternalize

The data to be plotted is divided in three files, but this times there is no need to combine them before the actual drawing. We can just pass the file name and the points in the file will be plotted by PGFPLOTS:

\begin{figure}
  \begin{tikzpicture}
    \begin{semilogyaxis}[
      \addplot [mark=*] table [x=size, y=time, col sep = comma]{gpu.csv};
      \addplot [mark=triangle*] table [x=size, y=time, col sep = comma]{cpuParallel.csv};
      \addplot [mark=square*] table [x=size, y=time, col sep = comma]{cpuSerial.csv};
    \end{semilogyaxis}
  \end{tikzpicture}
\end{figure}

It is needed, though, to place the plots in a axis tag (in this case, semilogyaxis), which is inside a tikzpicture tag, which is inside a figure tag. And everything belongs to inside a document tag. The result of this first step is as follows:

pgfstep1

Next, we make the fonts and the markers bigger, as well as some options for the legend that will be added later:

\tikzset{
    every plot/.append style={mark size=4pt, ultra thick, color=black},
    every tick label/.append style={font=\Large},
    every axis label/.append style={font=\Large},
    every axis legend/.append style={legend pos=north west, draw=none, fill=none, very thick, font=\Large, nodes=right},
}

These changes are inserted in the tikzset tag, which is placed before the actual plot. Although we need to go back in the code a bit, that’s the easiest way I found to explain the steps. The result is:

pgfsetp2

Finally, we make some modifications like the size of the figure, the limits of the axes, removing the minor axis ticks, inserting a horizontal grid, specifying the axes’ labels and the legend entries. To do that, we actually have to modify the previous code a bit and pass these options to the axis tag (semilogyaxis in this case):

\begin{semilogyaxis}[
  width=250pt,
  height=250pt,
  ymin=0.05,
  ymax=50000,
  xtick=data,
  yminorticks=false,
  ymajorgrids,
  major grid style={dotted},
  tick pos = left,
  xlabel = {$|R|$},
  ylabel = {Elapsed time (s)},
  xticklabels={$2^{10}$, $2^{11}$, $2^{12}$, $2^{13}$, $2^{14}$, $2^{15}$, $2^{16}$, $2^{17}$, $2^{18}$, $2^{19}$},
  legend entries={GPU, CPU (Parallel), (CPU Serial)},
  reverse legend
]
% ...
\end{semilogyaxis}

This gives us the final version of the plot:

pgffinalAs mentioned, the full source file is available at GitHub. The comment section is open for discussion and suggestions about the design choices for the plot or about the way they were implemented in this tutorial.

Simple Line Plots: Using matplotlib

This post belongs to a series of tutorials about how to draw simple line plots for academic papers. Information about the data used is available in the first post of the series, and the source code is on GitHub. Here I will focus on using matplotlib, which is a Python library. The other two implementations are done using PGFPLOTS and using ggplot2.

Since the data is divided in three files, the first thing to do is read those files and create some lists to hold the data:


import csv
import matplotlib.pyplot as plt

implementations = ["gpu", "cpuParallel", "cpuSerial"]
input_file_suffix = ".csv"
x_values = range(10, 20)
y_values = []
for i in implementations:
  implementation_y_values = []
  csv_file = open(i + input_file_suffix)
  reader = csv.reader(csv_file)
  # Skip the first line with the columns' names
  next(reader)
  for row in reader:
    implementation_y_values.append(row[0])
  y_values.append(implementation_y_values)

Now we have two structures with the coordinates for the plot: x\_values (a list with dataset sizes) and y\_values (a list containing one list of values for each implementation).

Next, a couple of configurations regarding the size and the background of the canvas where the plot will be drawn:

# Format figure
plt.rcParams["figure.figsize"] = [10, 10]
plt.rcParams["figure.facecolor"] = 'white'

After that, the actual drawing is as follows:

# Construct the plot with lines for different implementations
gpuLine, = plt.plot(x_values, y_values[0],'-ko', linewidth=4, markersize=20)
cpuParallelLine, = plt.plot(x_values, y_values[1],'-k^', linewidth=4, markersize=20)
cpuSerialLine, = plt.plot(x_values, y_values[2], '-ks', linewidth=4, markersize=20)

To see how it looks, we can use the plt.show() command. This give us the following partial result:

pythonstep1

Note that each implementation calls the plot() function, but with a few different parameters: the values of the Y coordinates and the style of the markers for that implementation (-ko for a solid line with circles, -k\wedge for a solid line with triangles, -ks for a solid line with squares). For all the plots, I choose a thicker line and a larger marker than the default since I believe it makes it easier to read the plot.

The next step is to format the axes to change the labels of the axes, the labels of the ticks, the fonts used for both sets of labels, leave only the ticks on the left and the bottom, the scale of the Y axis to the logarithmic scale, set lower/upper bounds for the axes and add horizontal grid lines.:

# Format axes
plt.ylabel("Elapsed time (s)", fontname='Times New Roman', fontsize=35)
plt.xlabel("|R|", fontname='Times New Roman', fontsize=35)
x_labels = ["$\mathregular{2^{" + str(label) + "}}$" for label in x_values]
plt.xticks(x_values, x_labels, fontname='Times New Roman')
plt.yticks(fontname='Times New Roman')
plt.tick_params(axis='x', top='off', which='both', labelsize=35, pad=15)
plt.tick_params(axis='y', right='off', which='both', labelsize=35)
plt.tick_params(axis='y', left='off', which='minor')
plt.yscale('log', nonposy='clip')
plt.axis([9, 20, 0.05, 50000])
plt.axes().yaxis.grid()
plt.tight_layout()

It looks almost done, but the legend is still missing:

pythonStep2.png

We can fix that with the following configurations:

# Format legend
plt.legend([cpuSerialLine, cpuParallelLine, gpuLine], ["CPU (Serial)", "CPU (Parallel)", "GPU"], frameon=False, fontsize=30, prop={'family': 'Times New Roman', 'size': 32}, numpoints=1, loc=2)

This will give us the look we wanted and now we can save it to a .pdf file:

# Export figure
plt.savefig(‘linePlot.pdf’, format=’pdf’)

pythonFinal.png

The full source file is available at GitHub. The comment section is open for discussion and suggestions about the design choices for the plot or about the way they were implemented in this tutorial.

Simple Line Plots

I would like to share a few things I’ve learned about drawing plots for academic papers and applied in my own papers. As different fields usually have different preferences in terms of presentation of results (e.g., textual description, plots, tables), this might not be applicable in all areas, but I believe many of these details can be generalized.

There are many good documents with tips on how to make good plots, and I won’t cover all points here. Instead, I’ll focus on how to implement the graphs using three different tools: matplotlib, PGFPLOTS and ggplot2 (I assume the reader has basic knowledge about Python, TeX and R).

Input Data

The data used in the tutorials are the measured times for running a similarity join algorithm implemented in three different ways: GPU, CPU Parallel and CPU Serial. To check the scalability of the proposal, the data used in the experiments was composed by collections of text documents with size varying from 1,024 to 524,288 documents. Based on that, the main idea I wanted to show with this plot is how each of the implementations perform (“Elapsed time” in Y axis) for growing size of data (“|R|” in X axis, where |R| represents the cardinality of one of the relations used in the join).

The data is divided in three CSV files: gpu.csv, cpuParallel.csv, cpuSerial.csv. Each file has a time column and a size column, and from that it is possible to know how much time each implementation took for a given dataset size. Although this is probably not the best way to have the data in the first place, it will serve for illustrating this tutorial.

Goal

Since the used tools have different ways to work with the data itself, I will divide the post in three posts (Using matplotlib, Using PGFPLOTS, Using ggplot2), but all of the code will be in the same repository. Also, the final result (i.e., a .pdf file with the plot ready to be inserted in a .tex document) should look similar, no matter what tool was used:

samplePlot.png

A few comments about the design choices:

  • The font size in the plot is the same as in the rest of the paper.
  • The size of the points and the thickness of the lines make it easy to identify the different implementations and their trends.
  • Since the implementations have similar elapsed times for small datasets and due to the difference between the time taken for small and for large datasets, using logarithmic scale on the Y axis helps identify those differences more clearly.
  • The legend inside the plot saves a bit of space compared to if it were on the right or on the top. This space can be used to make the whole plot a bit bigger.
  • Smaller ticks on the axes can be distracting (especially for the logarithmic scale), so keeping only the bigger ticks makes the plot look cleaner.
  • Similarly to the ticks, grid lines can make some readers lose the focus. Keeping the main horizontal grid lines only strengthens the idea that the varying parameter is on the X axis.
  • The order of the legend matches the order of the plotted points (CPU Serial on top, CPU Parallel in the middle and GPU on the bottom) and avoid confusions about what line represents what implementation.

A few improvements that can be done:

  • The font family should also be the same as the text’s font family (Times New Roman in the paper).
  • The background of the legend should be transparent to avoid covering parts of the grid lines.
  • The scale of the X axis could be changed to the result of the exponentiation (i.e., 1024, 2048, 4096, …) to facilitate the reading.
  • Maybe the lines aren’t needed at all and just a simple plot with the points might look better.

These design choices are based on what I’ve read about making plots and on my observations reading papers. Other people might have different preferences, and the comment section welcomes discussions about that.

Using the Tools

Here are the links for the posts for each tool: