Richard A. Lent

Updated 01 December 2016


This is a mockup of a scientific paper (although the field data are real) to illustrate the use of R Notebooks as a means of creating reproducible research. The objective is to produce publication-quality output, in HTML, PDF, and Microsoft Word formats, with text, literature citations, a formatted bibliography, statistical analyses, tables, and graphics, all from one, plain-text R Notebook. It illustrates the basic components for creating reproducible research using the tools available in R and RStudio.

The R Notebook file sites.Rmd that produced the rendered document you are now reading contains all of the R code that produced the accompanying analysis and graphics. Within the R code chunks embedded in the text of the notebook, hash symbols (#) are used to denote comments. Comments are ignored by the R interpreter and are used to document the code. Thus the main text of the notebook describes the research, using the format of a traditional scientific paper, and the embedded R code and comments document how the analyses and graphics were produced. Additional comments in the main text are placed inside of HTML comment tags (<!-- -->). An additional file, sessioninfo.txt, was produced using the R sessionInfo() function to document the R environment (R version, computer platform and operating system, and installed R packages) used to produce this document, the analysis, and graphics. By storing the raw data file (sites.csv) along with a companion metadata file (sites.metadata.txt) together with the R Notebook and sessioninfo.txt, we can provide other researchers with all of the materials required to replicate our research.

A zip archive that packages all of the above-mentioned files, along with publication-quality documents rendered from sites.Rmd in PDF and Microsoft Word format, can be downloaded from here. When extracted, the zip archive will produce a folder named sites that contains all of the data, code, and documentation required to replicate the analyses and produce the graphics. It is thus a self-contained package of reproducible research.

An HTML notebook (sites.nb.html) is automatically created whenever the R Notebook sites.Rmd is modified and saved in RStudio, and is the usual way to share reproducible research documents created with RStudio. The HTML notebook contains the rendered HTML version of the R Markdown file sites.Rmd plus all R output, graphics, code, and a downloadable copy of sites.Rmd

What follows is a presentation of the research in standard scientific format, followed by some closing comments in the Epilogue.


The satyrid butterfly Coenonympha tullia, the Common Ringlet, has recently expanded its range southward into New England.1 Formerly restricted to the north and western United States and Canada, the Ringlet is now found throughout much of New England.2 Its range continues to expand in the eastern United States.3 A small butterfly, with maximum wingspan around 1.5 inches, Ringlets are quite phenotypically variable (Figure 1), particularly in background color and in the size and number of eyespots and other markings on the wing surfaces.4–6 This phenotypic variation, in combination with the range expansion, suggests the possibility of segregation by habitat of different phenotypes, with resultant variation in the forces of natural selection.1,7 Because the Ringlet is a weak flier and is specific in its habitat requirements, selecting early-successional fields and pastures containing certain species of grasses (e.g., Festuca, Poa pratensis, Stipa, Agrostis), it exists as a metapopulation, consisting of many locally isolated colonies.8 The Ringlet system is thus ideally suited for field studies in population ecology and evolutionary biology.


In summer 1994 we9–11 visited 11 grassland sites located from north-central Massachusetts to the Canadian border (Figure 2). Habitat structure was measured at each site along a 50-meter transect running through the center of the site (Figure 3). Habitat structure variables are shown in Table 1. Values for the eight habitat variables are the percentage of 50 transect points at which each habitat type (grass, moss, sedge, forb, fern, bare ground, woody herbaceous plants, and shrub stems) was present. A habitat type was considered present at a transect point if it touched a thin metal rod passed down vertically through the vegetation at the point.12

Specimens of male Common Ringlets were collected from each site and their phenotypes measured (Table 2). The five phenotypic variables in Table 2 are length of right forewing measured from insertion on the thorax to the wing apex (fwlength, mm), length of thorax (thorax, mm), number of eyespots or ocelli on the undersides of the right fore- and hindwings (spotting, count), length of right forewing ray (rfray, ordinal 0 through 3, where 0 codes for absent and 1 - 3 code for small, medium, and large), and length of right hindwing ray (rhray, ordinal 0 through 3). The final variable, n, is the number of male Ringlets measured at each site. Values of the phenotypic variables in Table 2 are means for each site calculated across the n individuals. Ocelli, forewing rays (linear, light-colored patches), and hindwing rays are on the ventral wing surfaces and are shown in Figure 1.

Two Euclidean distance matrices were computed from the field data, based on habitat structure variables (Table 1) and butterfly phenotype variables (Table 2). A third Euclidean distance matrix contained geographic distances between pairs of sites, computed using the latitude and longitude of each site in UTM coordinates (meters). (When Euclidean distance is calculated using spatial coordinates it yields the approximate geographic distance between sites.) A distance matrix expresses dissimilaries between all pairs of objects (in this case, the 11 sites) using a numeric distance measure based on several variables. Euclidean distance is one of several such distance measures13,14 and is the simple straight-line distance between two points in a Cartesian coordinate system. Euclidean distance can be extended to an n-dimensional space in which each of the n dimensions is a variable used to compute the distance matrix. Pairs of sites that have similar values of all vegetation or phenotypic variables have smaller values of Euclidean distance, and are thus more similar to each other in a multivariate sense, as if they were plotted in an n-dimensional space where n is the number of variables. Euclidean distance ranges from zero for identical sites (i.e., along the diagonal of the distance matrix, the distance of each site to itself) to an arbitrary upper bound.14

Hierarchical cluster analysis was performed on the habitat and phenotype distance matrices. Cluster analysis is a technique for finding groups in multivariate data. The technique produces a tree diagram, or dendrogram, that groups similar objects together and shows the distance relationships among objects.13,15

We examined between-site relationships in habitat structure, butterfly phenotype, and geographic distance using the Mantel test14,16 for matrix correlation. Matrix correlation can be quantified by computing the Pearson correlation coefficient between the elements of two distance matrices. But because of the statistical dependence among the cells of a distance matrix, ordinary tests of significance cannot be used. Instead, the Mantel test uses random permutations of the rows and columns of the distance matrices. This is done a large number of times (10,000 in our analyses), and the correlation coefficient calculated each time, forming a sampling distribution. The observed matrix correlation is then compared with the sampling distribution to assess the significance of the observed correlation under the assumption of randomness.

We tested for matrix correlations between (1) habitat and phenotype, (2) habitat and geographic distance, and (3) phenotype and geographic distance. The questions17 addressed with these tests were:

  1. Do sites that are similar in terms of their habitat structure variables also tend to be similar in terms of butterfly phenotype variables?

  2. Are sites that are closer together also similar in habitat structure?

  3. Are sites that are closer together also similar in butterfly phenotype?

All computations, graphics, and document preparation were done using R and RStudio.18,19

# This is an R code chunk that just contains comments. Use them to document your code.
# In code chunks, lines beginning with the # symbol are comments and are ignored by the R
# interpreter.
# We can give names to chunks, letting us navigate the chunks in RStudio.
# We can also navigate the sections of the document in RStudio.
# Comments can also be used to turn code on and off during testing. 
# A good thing about R Notebooks is that you can selectively run code chunks. 
# Thus you could include exploratory data analysis that you don't want to have 
# in the final manuscript, but want to have included in your R Notebook for documentation. 
# Note: If you edit the comments in a code chunk that just contains informational comments,
# you have to run the chunk for the edits to appear in the rendered document.
# We first bring our sites.csv data file into R and get it into an R dataframe.
# Metadata for sites.csv is in sites.metadata.txt.
# Documentation of the R environment is in sessioninfo.txt.
# The sites dataframe created here will be accessible in the R environment 
# throughout this document.
# We also load some required R packages.

sites <- read.csv("sites.csv") # sites is an R dataframe.

# Compute distance matrices and Mantel tests for later use in text and graphics.
# These objects need to be created here so they are accessible later in the document.

sites.geo = dist(cbind(lat_utm, lon_utm), method = "euclidean")
sites.veg = dist(cbind(moss,sedge,forb,fern,bare,woody,shrub), method = "euclidean")
sites.pheno = dist(cbind(fwlength,thorax,spotting,rfray,rhray),method="euclidean")
mt1 <- mantel.rtest(sites.pheno, sites.geo, nrepet = 10000)
mt2 <- mantel.rtest(sites.veg, sites.geo, nrepet = 10000)
mt3 <- mantel.rtest(sites.pheno, sites.veg, nrepet = 10000)

Results and Discussion

Because sites were initially chosen to be predominantly herbaceous, we dropped habitat variable grass from the analysis because it did not vary much among sites (Table 1). The 11 sites ranged from 46 to 50% grass with varying amounts of forbs.

Three site clusters were recognized based on habitat structure (Table 3 and Figure 4). Cluster 1 (Orleans, Bellows Falls, Springfield, Bernardston) contained sites that were shrubby and woody. Sites in cluster 2 (Mount Ascutney, Lyme, Fairlee, Lyndonville) had the lowest percentage occurrence of forbs. Moss, sedges, bare ground, and shrubs were absent at the sites in Cluster 3 (Hanover, Newport, St. Johnsbury). The habitat structure clustering was broadly related to geographic location of the sites (Figure 2). Cluster 1 sites were mostly southern, cluster 3 sites were mostly northern, and sites in cluster 2 were somewhat in the middle.

Three site clusters were also apparent based on butterfly phenotype (Table 4 and Figure 5). Cluster 1 (Bellows Falls, Lyme) had the smallest butterflies in terms of forewing and thorax length, with more ocelli but shorter fore- and hindwing rays (Table 4). Ringlets in cluster 3 (Bernardston, Newport) had the largest forewings. Butterflies in cluster 2 (St. Johnsbury, Springfield, Fairlee, Hanover, Mount Ascutney, Orleans, Lyndonville) had an intermediate forewing length and the longest rays on both forewings and hindwings. Clusters 1 and 3 each contained only two sites and differed from each other mainly in Ringlet body size. In contrast, sites in the large cluster 2 had Ringlets with characters intermediate to those of clusters 1 and 3. Clusters 1 and 3 can be viewed as outlying clusters that differ from each other in addition to being different from the central cluster 2 containing the majority of sites. In contrast to the habitat clustering results, no strong relationship was apparent with regard to butterfly clusters and site location (Figures 2 and 5).

While these patterns in the clustering of sites by habitat structure and butterfly phenotype are interesting, Mantel tests showed that the underlying matrix correlations were all nonsignificant (Figure 5). The observed matrix correlations of r = 0.199 for phenotype vs distance, r = -0.061 for habitat vs distance, and r = -0.25 for phenotype vs habitat were indistinguishable from randomly-generated values.

The most likely explanation for the lack of a relationship between site habitat structure and butterfly phenotype is that we measured the wrong variables, or that we measured habitat and butterfly variables at the wrong scale. For example, Dennis and Eales20 classified 166 UK sites according to occupancy by Coenonympha tullia (simple presence-absence) and found that habitat quality, size of habitat patch, and patch isolation accounted for 61% of the variation in C. tullia occupancy of sites. Their 23 measures of habitat quality were much more detailed and species-specific than our eight simple measures of habitat structure.

Habitat variables thus may have a greater influence on whether or not Ringlets are present at a site, rather than being related to phenotypic characteristics of the local population. Morphological variation in Ringlets, for example thorax size, may be related to factors such as flight activity in differing thermal environments.21 Wing size, ocellation, and size of rays and other wing markings have been shown to be influenced by opposing forces of natural selection favoring crypticity versus predator avoidance.22

Our data do suggest the presence of some heterogeneity in Ringlet phenotypes among the 11 sites that were sampled (Table 4, Figure 5). This agrees with the findings of Wiernasz,1 who showed that range expansion in C. tullia was associated with significant increases in morphological variation and changes in other life history parameters. Such variation, coupled with small population sizes occupying isolated habitat patches, may result in increased genetic differentiation among local populations.23


In this example of using RStudio for creating reproducible research, everything was kept in a single R Notebook. Our intent was to show the basic pieces needed in order to produce reproducible research using the tools available in R and RStudio (e.g., R Markdown, code chunks, BibTeX). A more complex project would require a more complex organizational structure. As discussed in detail by Gandrud,24 this might entail a separate folder for the raw data and metadata files, one for R scripts that process the data, a folder that has R scripts for doing the analysis, one for graphics, one for production of the final documents, a plain-text README file to explain how everything fits together, and so on. Having all R code in a single R notebook, as we have done in this example, would probably only be suitable for a small project. Moreover, to control the length and complexity of this example we did not provide the original, raw field data collected on the habitat structure and morphology of individual butterflies that was processed and averaged to produce Tables 1 and 2. Truly reproducible research should provide all of the original data along with all of the computer code used to process and reduce the data for the final results. We also could have included some exploratory data analysis,25 but in the interest of space we did not include this important step (see An Introduction to R).

Creating reproducible research is good not only for the advancement of science, but is also good for the individual researcher. Gandrud24 lists the following benefits of reproducible research:

I hope that this document encourages you to explore the free tools provided by R and RStudio and to use them to generate your own brand of plain-text, cross-platform, future-proof reproducible research.

# We use the kable function, part of the knitr package, to generate tables. 
# kable can create nice tables from R data frames and matrices.
# Other R table packages include xtable and stargazer.
# An advantage of kable is that it will automatically generate the appropriate table
# code for the document type being rendered (e.g., HTML, PDF, or Word).
# Note the use of the results='asis' chunk option. This is required to ensure
# that the raw table output is not processed further by knitr. 

# Subset out the habitat variables for Table 1. The character vector hab contains the
# names of the variables we want to subset out of the main dataframe sites. The hab
# vector can then be used to index into the sites dataframe by enclosing it in square
# brackets: sites[hab].

hab <- c("locality","code","grass","moss","sedge","forb","fern","bare","woody","shrub")
habvars <- sites[hab]
kable(habvars, caption = "Table 1. Habitat structure variables.", digits = 0)
Table 1. Habitat structure variables.
locality code grass moss sedge forb fern bare woody shrub
Bellows Falls BF 46 4 0 34 0 4 10 0
Springfield SP 49 0 0 24 1 0 14 0
Mount Ascutney MA 49 0 4 17 5 0 0 0
Bernardston BE 49 1 0 29 0 0 11 7
Lyme LY 50 0 0 14 4 0 0 0
Fairlee FA 50 5 0 12 0 2 0 0
Newport NW 49 0 0 35 0 0 0 0
Orleans OR 49 0 0 41 0 0 19 0
Lyndonville LN 50 0 0 9 0 0 3 0
St. Johnsbury SJ 49 0 0 42 0 0 0 0
Hanover HA 49 0 0 38 5 0 5 0

# Subset out the phenotypic variables for Table 2.

lep <- c("locality","code","fwlength","thorax","spotting","rfray","rhray","n")
lepvars <- sites[lep]
kable(lepvars, caption = "Table 2. Butterfly phenotypic variables (means).")
Table 2. Butterfly phenotypic variables (means).
locality code fwlength thorax spotting rfray rhray n
Bellows Falls BF 15.8 3.6 1.6 1.6 1.4 62
Springfield SP 16.0 3.7 1.1 2.0 1.5 51
Mount Ascutney MA 16.2 3.8 1.2 1.9 1.6 58
Bernardston BE 16.7 3.7 1.4 1.8 1.6 27
Lyme LY 15.6 3.6 1.1 1.8 1.3 52
Fairlee FA 15.9 3.8 1.0 1.8 1.5 25
Newport NW 16.7 3.7 1.1 1.8 1.3 21
Orleans OR 16.1 3.8 1.4 1.9 1.7 41
Lyndonville LN 16.0 3.7 1.4 1.9 1.7 52
St. Johnsbury SJ 15.7 3.6 1.2 1.9 1.7 56
Hanover HA 16.0 3.6 1.2 1.8 1.8 65
# Compare habitat variable means for habitat clusters.
# A simple way to compare clusters in a dendrogram is to create a 
# grouping variable that indicates the cluster membership of each
# object, and then compare variables by cluster membership.
# The cutree function creates a vector of cluster memberships at the
# height in the dendrogram indicated by the h parameter. We then
# merge this cluster membership vector back into the original data,
# create a character vector of the habitat variables we want to
# summarize, and then use the aggregate() function to compute means
# of the habitat variables by cluster. The result is passed to 
# the kable() function to create a pretty table.

veg.clust = hclust(sites.veg, method="average")
cluster = cutree(veg.clust, h=16)
vars <- c('moss','sedge','forb','fern','bare','woody','shrub')
veg.groups = cbind(cluster,sites[vars])
table3 <- aggregate(veg.groups[vars], list(veg.groups$cluster), mean)
kable(table3, caption = "Table 3. Means of site habitat variables by cluster; see Figure 4.", digits = 1, col.names = c('Cluster','moss','sedge','forb','fern','bare','woody','shrub'))
Table 3. Means of site habitat variables by cluster; see Figure 4.
Cluster moss sedge forb fern bare woody shrub
1 1.1 0 32.0 0.4 1.1 13.6 1.8
2 1.2 1 13.0 2.2 0.5 0.8 0.0
3 0.0 0 38.3 1.7 0.0 1.7 0.0
# Compare phenotypic variable means for phenotype clusters.
# Same procedure as for Table 3, except that here we use the cluster
# analysis objects created for Figure 5.

pheno.clust = hclust(sites.pheno, method="average")
cluster = cutree(pheno.clust, h=0.59)
vars <- c('fwlength','thorax','spotting','rfray','rhray')
pheno.groups = cbind(cluster,sites[vars])
table4 <- aggregate(pheno.groups[vars], list(pheno.groups$cluster), mean)
kable(table4, caption = "Table 4. Means of site phenotype variables by cluster; see Figure 5.", digits = 1, col.names = c('Cluster','fwlength','thorax','spotting','rfray','rhray'))
Table 4. Means of site phenotype variables by cluster; see Figure 5.
Cluster fwlength thorax spotting rfray rhray
1 15.7 3.6 1.4 1.7 1.4
2 16.0 3.7 1.2 1.9 1.6
3 16.7 3.7 1.2 1.8 1.5
# We use the imager package, with function load.image, to create an image object that 
# we can then give to the plot function.

im2 <- load.image("ringlets.jpg")
par(mar = c(0,0,0,0))
plot(im2, asp = 1, axes=FALSE, xlab="", ylab="")