I would like to share a few things I’ve learned about drawing plots for academic papers and applied in my own papers. As different fields usually have different preferences in terms of presentation of results (e.g., textual description, plots, tables), this might not be applicable in all areas, but I believe many of these details can be generalized.
There are many good documents with tips on how to make good plots, and I won’t cover all points here. Instead, I’ll focus on how to implement the graphs using three different tools: matplotlib, PGFPLOTS and ggplot2 (I assume the reader has basic knowledge about Python, TeX and R).
The data used in the tutorials are the measured times for running a similarity join algorithm implemented in three different ways: GPU, CPU Parallel and CPU Serial. To check the scalability of the proposal, the data used in the experiments was composed by collections of text documents with size varying from 1,024 to 524,288 documents. Based on that, the main idea I wanted to show with this plot is how each of the implementations perform (“Elapsed time” in Y axis) for growing size of data (“|R|” in X axis, where |R| represents the cardinality of one of the relations used in the join).
The data is divided in three CSV files: gpu.csv, cpuParallel.csv, cpuSerial.csv. Each file has a time column and a size column, and from that it is possible to know how much time each implementation took for a given dataset size. Although this is probably not the best way to have the data in the first place, it will serve for illustrating this tutorial.
Since the used tools have different ways to work with the data itself, I will divide the post in three posts (Using matplotlib, Using PGFPLOTS, Using ggplot2), but all of the code will be in the same repository. Also, the final result (i.e., a .pdf file with the plot ready to be inserted in a .tex document) should look similar, no matter what tool was used:
A few comments about the design choices:
- The font size in the plot is the same as in the rest of the paper.
- The size of the points and the thickness of the lines make it easy to identify the different implementations and their trends.
- Since the implementations have similar elapsed times for small datasets and due to the difference between the time taken for small and for large datasets, using logarithmic scale on the Y axis helps identify those differences more clearly.
- The legend inside the plot saves a bit of space compared to if it were on the right or on the top. This space can be used to make the whole plot a bit bigger.
- Smaller ticks on the axes can be distracting (especially for the logarithmic scale), so keeping only the bigger ticks makes the plot look cleaner.
- Similarly to the ticks, grid lines can make some readers lose the focus. Keeping the main horizontal grid lines only strengthens the idea that the varying parameter is on the X axis.
- The order of the legend matches the order of the plotted points (CPU Serial on top, CPU Parallel in the middle and GPU on the bottom) and avoid confusions about what line represents what implementation.
A few improvements that can be done:
- The font family should also be the same as the text’s font family (Times New Roman in the paper).
- The background of the legend should be transparent to avoid covering parts of the grid lines.
- The scale of the X axis could be changed to the result of the exponentiation (i.e., 1024, 2048, 4096, …) to facilitate the reading.
- Maybe the lines aren’t needed at all and just a simple plot with the points might look better.
These design choices are based on what I’ve read about making plots and on my observations reading papers. Other people might have different preferences, and the comment section welcomes discussions about that.
Using the Tools
Here are the links for the posts for each tool: