Simple Line Plots: Using ggplot2

This post belongs to a series of tutorials about how to draw simple line plots for academic papers. Information about the data used is available in the first post of the series, and the source code is on GitHub. Here I will focus on using ggplot2, which is a R package. The other two implementations are done using matplotlib and using PGFPLOTS.

Compared to the two other tools in this series (matplotlib and PGFPLOTS), importing the data using ggplot2 was a bit tricky for me due to the use of data frames, which is a data structure that is passed to the ggplot() function to draw the plot. It helped me to imagine that “data frames are like matrices, but with named columns of different types (similar to database tables)“. Knowing that, the next thing was to create a single data frame with the data from the three .csv files.

For that, we use the merge() function:

#Read experiment data
experimentsResults = data.frame()
xvalues = c(10:19)
implementation = c("gpu", "cpuParallel", "cpuSerial")
for (i in 1:length(implementation)) {
    fileName = paste(c(implementation[i], ".csv"), collapse="")
    csvFile = read.csv(file=fileName, head=TRUE, sep = ",")
    csvFile["implementation"] = i
    if (is.data.frame(experimentsResults) && nrow(experimentsResults)==0) {
        experimentsResults = csvFile
    }
    experimentsResults = merge(experimentsResults, csvFile, all=TRUE)
}

Note that there’s a check to see whether the data frame (named experimentResults) is empty. After the first file data is inserted in the data frame, we just “append” the other files to it. I found helpful to think of merge() as a database join. So, if we think about it as a database table, there are three columns now (one more than in the files): time, size and implementation.

Having the data frame ready, we can pass it to ggplot2 so it draws the plot:

# Draw the plot
p = ggplot(experimentsResults, aes(x = size, y = time, group=implementation))
p = p + geom_line(size=1.5)
p = p + geom_point(aes(shape=factor(implementation, labels=c("GPU", "CPU Parallel", "CPU Serial"))), size = 7, fill="black")

It seems it is a common practice to assign the plot to a variable (p in this case) and then “increment” it with changes (hence the p = p + ... notation).

The ggplot() function takes the data frame as the first parameter, and we can specify what columns will be the X and Y coordinates. The group argument does something like a GROUPBY in SQL, so we can divide the data in three groups corresponding to the implementations. This may sound strange since we first joined the files just to separate them now, but we want three separated lines representing the three different implementations.

The data is actually drawn using geoms (geometric objects) and in this case we create a line geom (geom\_line()) and a point geom (geom\_point()) for each implementation. The line will be the same for all of them, but the geom\_point() function will receive the implementations as parameter to the shapes of the points (i.e., three different point shapes for three different implementations). Other than arguments for the size of the points and their colors, there is label, which will be used in the legend.

The result of this first step is the following. The figure looks weird because the title of the legend (“factor(implementation, labels = c(“GPU”, “CPU Parallel”, “CPU Serial”)) is too long and squeezes the plot itself.

ggplot2step1

To fix the problem with the legend, we can just remove its title. We also do a few other changes regarding the legend (e.g., changing the size of the shapes, the position of the legend, the background color and the background color). The final option in the following code makes the order of the legend match the order of the plotted lines (i.e., “CPU Serial” on top, “CPU Parallel” in the middle and “GPU” on the bottom):

# Format the legend
p = p + theme(legend.title=element_blank())
p = p + theme(legend.key=element_blank())
p = p + theme(legend.key.width=unit(40, "pt"))
p = p + theme(legend.key.height=unit(30, "pt"))
p = p + theme(legend.position=c(0.25,0.85))
p = p + theme(legend.background=element_rect(fill="transparent"))
p = p + theme(legend.text = element_text(size=30))
p = p + guides(shape = guide_legend(reverse=TRUE))

And the result of this second formatting step is:

ggplot2step2

Next, we make some changes regarding the whole figure (e.g., color theme of the plot, font and grids):

# Format font, background and grids
p = p + theme_bw()
p = p + theme(text=element_text(family="Times New Roman"))
p = p + theme(panel.grid.minor.x = element_blank())
p = p + theme(panel.grid.major.x = element_blank())

This gives us:

ggplot2step3

It looks almost ready now, but there are still a few changes to be done, especially regarding the axis (e.g., titles, colors, sizes, limits and scale):

# Format axis
p = p + xlab("|R|")
p = p + ylab("Elapsed time (s)")
p = p + theme(axis.line = element_line(color="black"))
p = p + theme(axis.text = element_text(size=30))
p = p + theme(axis.title = element_text(size=30))
p = p + theme(axis.title.y = element_text(vjust = 1.2))
p = p + theme(axis.title.x = element_text(vjust = -0.4))
p = p + theme(aspect.ratio=1)
p = p + coord_cartesian(ylim = c(0.1, 40000))
p = p + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
p = p + scale_x_continuous(breaks = xvalues, labels = math_format(2^.x)(xvalues))

And that gets us to the final result we wanted:

ggplot2final

To save the figure, we use the ggsave() function and to show it on the screen we use print():

# Save figure
ggsave(plot = p, filename="ggplot2Final.png")

# Show on screen
print(p)

Now, for the sake of the explanation I used a particular formatting order. But if one uses the same order (i.e., plot->legend->general->axis), the result will not be the same. The reason for that is that the legend formatting will be overwritten by the general formatting. So the order that actually works is plot->general->axis->legend, as can be seen in the complete source code.

The full source file is available at GitHub. The comment section is open for discussion and suggestions about the design choices for the plot or about the way they were implemented in this tutorial.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s