Some important differences between the academic and nonacademic job markets that are useful to consider at the start:
Probably the biggest transition when starting to apply for data science jobs was the shift from an academic CV to a nonacademic résumé. A CV lists functionally every major accomplishment you’ve achieved in your time in the field, while a résumé is highly targeted for a specific position. When applying to academic jobs, I wrote a (semi) customized cover letter for every job, and then included the relevant version of my CV (conflict, methods, or teaching). Each of these CVs contained the same information, just in a different order. In contrast, I significantly edited the skills section of most résumés I sent out based on the job listing. The WashU career center has a fantastic handout on differences between the two documents and how to adapt a CV into a résumé that I drew on heavily in this process.
In my opinion, the conventional wisdom that a résumé can only ever be one page is an overcorrection from the never-ending academic CV. The résumé I used to apply for jobs was two pages: the first included work experience, education, and a list of technical skills, while the second was project-oriented, and covered two publications, a couple of blog posts, a Shiny dashboard, teaching materials for the grad stats lab I taught. You definitely want to include links here, not just to the final product, but also the code behind it where relevant (replication materials for publications, git repos for smaller projects). This is an excellent opportunity to showcase work that uses data science skills to show something interesting, but wouldn’t be considered novel enough for publication in an academic journal. Here are some other points that may be helpful when writing a résumé:
Below is a list of non-résumé-related things I did to prepare for and during my nonacademic job search that I found helpful:
Social science PhD programs are good at teaching research design, formal modeling, and statistical methodology. They spend far less time on what I’ll call more supporting technical skills. Here are some suggestions in this domain based on my observations so far:
So far this post has mainly been oriented around a list of discrete things you can do to (potentially) improve your odds of securing a data science job as someone with a social science PhD. This last section reflects a perspective I developed throughout my job search process as I participated in more and more interviews, and I hope, will serve as a source of motivation for anyone pursuing a similar career transition.
The vast majority of quantitative social science PhDs (myself very much included) are never going to be machine learning engineers who run neural networks all day long. Instead, we’re going to be working with those engineers, running our own analyses (which might include some deep learning models, but plenty of other types of models as well), and also working with with less-technical stakeholders.
Based on conversations with other data scientists and my experiences as a data scientist thus far, a large part of a data scientist’s job is communicating the value of the work you and your more-technical team members have done to people with less technical training. Even if they have a strong background in statistics or research design more generally, they’re still likely to be less familiar with your specific area of expertise. Communicating effectively in this situation requires distilling large amounts of information, drawing conclusions based on data, and then summarizing what you did, why you did it, and what you learned from doing it. To me, that sounds exactly like what social science PhD programs train their students to do.4
Three months is also far too short a time to reach a definitive conclusion on this topic. ↩
Talking to other data scientists with similar backgrounds, which I discuss below, was useful because it gave me information and context that I was able to draw on when negotiating salary. However, an extensive body of research finds that women are penalized for negotiating where men are rewarded for it. This is just one reminder of the fact that something that I found helpful may be less useful for you. ↩
This was a pleasant surprise for me, as I still have vivid memories of sending résumé after résumé out into the void as a fresh poli sci BA in 2012 and almost never hearing back. ↩
To make this even more concrete: being able to communicate effectively with software engineers means that they help make your models more efficient with less work from you; being able to communicate with stakeholders means that you are more likely to get recognition for the work you did. ↩
Two things have changed since then. One: an already precarious academic job market that never recovered from the global financial crisis has imploded even further. Two: opportunities for people with the set of skills you pick up in a quantitative social science PhD program have exploded. Quit lit is often deeply personal and centered around the path one took to deciding to leave academia; see this piece for links to several prominent examples.1 This is not that kind of quit lit, because that’s not where my communication skills are strongest. Instead, I’m writing this post to illustrate the contrast between my academic and nonacademic job search processes in the hopes that it may be a useful data point for current grad students, postdocs, adjuncts, and maybe even some early-career faculty.2 When reading this post, bear in mind that I am presenting data from an n of one, and my experiences may not generalize outside of quantitative social science, or even very far within it.
I had an enormous amount of support in this process from both my institutions and my networks; in no way could I have gotten a data science job as easily on my own. I talk more about the help I received in this post.
Let’s get straight to the numbers. Out of 142 jobs I applied to, I received two job offers. That’s a 98.6% rejection rate.3 Visualizing this (with apologies to Andrew Heiss) looks like so.
Five jobs expressed interest me beyond my initial application, which translates to a 3.5% response rate. The ‘Nothing’ category encompasses both jobs that sent me an automated HR rejection email (often several months after their chosen candidate had accepted the offer) as well as ones that never got back to me. Many searches for faculty positions will conduct Zoom/Skype/Teams interviews with their long short list of candidates before inviting the short list to an on-campus visit, colloquially termed a flyout, but some may skip straight to the on-campus visit. Some postdoc positions conduct virtual interviews, while others simply make an offer to their preferred candidate. I used a rough ranking of potential outcomes as Offer > Flyout ≥ Interview > Rejection in constructing this plot, with each dot representing the final outcome for that application.
I applied to a wide range of permanent (tenure-track and teaching-track) faculty positions, as well as a number of temporary (postdoc, visiting assistant professor, lecturer) positions. Splitting my applications along this dimension shows that I had noticeably more success in my applications for temporary positions (10.3% response rate) than permanent ones (1.8% response rate).
Since my non-nothing outcomes are so few, I can easily list them in more detail:
If we break down the jobs I applied for by academic subfield, some unsurprising patterns emerge. Data science jobs include those listed as computational social science, jobs listed for a substantive subfield and methods are coded under the substantive area, and international relations, conflict, peace studies, security studies, and international political economy are all represented in the IR category.
The majority of jobs I applied to (92) were advertised as international relations. While much of my research sits at the intersection of international relations and comparative politics, very few of the jobs I applied to do. I didn’t track how frequent these jobs are, so it could just be a case of few jobs to apply to. Data science (24) handily outnumbers the more traditional subfield of methods (14), reflecting increasing interest in the former by the discipline.
The map below geographically visualizes the jobs I applied to. Each circle represents one institution, with the size of the circle denoting how many positions I applied for. I applied to five positions at UCSD, the most of any institution.
I focused primarily on the Eastern US and California. I applied to jobs in 31 states and the District of Columbia, meaning there were 19 states I did not apply to any jobs in. Looking at my applications over time helps tell the story of my academic job search process.
The 2018-19 academic job market season was my final year of grad school. I wanted to be done, so I applied to a wide variety of jobs. The postdoc I received an offer from was actually the last position I applied for in this cycle. I was a little more selective in the 2019-20 job market season because I had an excellent postdoc, with a high chance of a second year of funding. I started a new postdoc in 2020 and knew that I had a second year of funding guaranteed. COVID-19 absolutely devastated the job market that cycle as well. With a second year of funding secure and precious few institutions hiring, I decided to spend my time focusing on improving my CV and applied to a total of four jobs that cycle, all tenure-track. The market somewhat recovered in the 2021-22 cycle, but there were still far fewer jobs than in my first two cycles. I applied to 19 jobs this cycle, all of them permanent. There were some great postdocs this cycle, but three years as a postdoc had been enough for me.
Two jobs did show interest in me this last cycle, but it was too little, too late. I had an offer for a data science job when I received an on-campus interview for one job, and had already accepted the offer when I received a Zoom interview for another. Given the typical pace of an academic search, it was possible that even if I were successful in getting an offer for either of these positions, it wouldn’t be for another month or two. My postdoc ended in June, and an offer in hand doing interesting research was an easy sell compared to that uncertainty.
Across all 142 applications, I ended up submitting 399 letters of recommendation to search committees. I was very fortunate that UNC has a department administrator handle letters for grad students as a sort of discount (read: free) Interfolio Dossier service. They generously provide this service to graduates of the department until they secure their first permanent job, even after they have left. I spent so long on the academic job market that I had no less than three different people help me with this process. I am incredibly grateful for their efforts and want to highlight the support they gave me.
I haven’t done as good of a job tracking my applications to nonacademic jobs because the process is much less structured and standardized. Some applications require a cover letter, so I can count up all the cover letters saved in my job search folder: 25. You can also apply for many jobs with just a résumé. Let’s say I applied to 10 of those, which makes 35 applications total.
I started the interview process with seven of these employers. Acknowledging some uncertainty in the denominator, that’s a 20% response rate, more than six times higher than my academic response rate of 3.5%. I completed the interview process with four of these employers, receiving one rejection and three offers (I withdrew from the other three interview processes after accepting one of those offers). A 75% interview success rate is pretty incredible compared to my experience on the academic job market. That’s an overall success rate of 8.6%, which is also more than six times higher than my overall success rate for academic jobs.
Or is it a 50% success rate? I actually interviewed for two different positions with two of these employers, so you could also slice the data less favorably and say I received offers for three out of six positions I interviewed for. That’s still an overall success rate of 8.1%, which is pretty damn good in my eyes. I also want to highlight some of the experiences I had on the nonacademic job market that I can’t imagine ever happening on the academic one:
Others have written extensively about why you shouldn’t view a nonacademic job as a backup option or a failure, but sometimes it’s just nice to know that people want to pay you. If you’re striking out on the academic job market, there are plenty of other options out there. So it goes.
People have criticized the term quit lit for focusing on the individual and ignoring the systemic forces that contribute to many people’s decision to leave academia. I am very persuaded by this argument, but no one has yet coined a similarly catchy and succinct alternative. ↩
I’m using the term nonacademic instead of industry, which is usually presented as the alternative to academic jobs for people with a PhD, because I applied for jobs in both the private and public sectors. ↩
I considered Kilgore Trout’s intended epitaph from Breakfast of Champions as a title for this post, but decided it was both too obscure and too bleak: he tried. ↩
Sometimes it’s faster (easier) to just write code that works for you, on your system, without any consideration for some poor researcher who may try to replicate your results in the future.1 This tendency was especially bad for this project because at various points in time I was writing code to run on my personal laptop and two different high performance computing clusters. This is a recipe for code that doesn’t travel well and will almost certainly fail to replicate.
There were a lot of changes I made to my code to ensure my results replicate, but the most tedious (and time consuming, by far) was cleaning up my file paths. Due to the computationally intensive GIS work and Bayesian statistics involved in the project, I ran lots of code on a cluster, and then pulled the results back to my laptop to summarize and create figures. This unsurprisingly resulted in a huge mess when looking at the project as a whole, rather than any individual script. Luckily, R and Rstudio made things (relatively) painless to fix.
Anytime you load a dataset into R, you need to specify the path to that file. The same’s true when you save R output to a file. This article started as a chapter of my dissertation, so all of the code originally lived in the Dissertation folder on my laptop. However, as I started adapting it to an article length manuscript, I created a new Conflict Preemption folder in my Projects folder. By the time the article was accepted, I had two main folders I needed to combine:
/Users/Rob/Dropbox/UNC/Dissertation/Onset
/Users/Rob/Dropbox/WashU/Projects/Conflict Preemption
Both of these folders live in my Dropbox, but that’s about where the similarities end. I wrote most of the code for running models while still at UNC, so when I added new scripts to run models to respond to reviewer comments, I still stuck them in the UNC folder. That also means that all of the output of these models ended up in the UNC folder when it got transferred from the cluster. However, when I needed to do something simpler like create a time series plot of the number of separatist groups in existence, I wrote that code in the WashU folder. I also had a script in the WashU folder to load all of the results and generate plots from them. Because this script and the data it needed to load where in completely different directories, this is what I had to do to load the data to create one of the main figures:
load('/Users/Rob/Dropbox/UNC/Dissertation/Onset/Figure Data/marg_eff_pop_df_cy.RData')
Not particularly likely to work on anyone else’s computer. To fix this, I needed to move all of the data to the Conflict Preemption folder, which was easy, and then rewrite all of the code the referenced file locations, which was less easy.
As a first step, I needed to chop off /Users/Rob/Dropbox/UNC/Dissertation/Onset/
from the start of every file path. All the files for the article, including both the R scripts and the various data files, now live in /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption
, but all of the file paths in the scripts still start with /Users/Rob/Dropbox/UNC/Dissertation/Onset
, because that’s where all the files were before. You can do this just using the standard find and replace functionality built into RStudio. However, there’s no guarantee that someone in the future will correctly set R’s working directory before running the code. I used the here R package to ensure that R can always find everything it needs for my code. All you have to do is wrap file paths in the here()
function in the package, and they’ll be automatically completed with the full file path, letting R find your files.2
You need to use the relative path to each file, so for a file with an absolute path of /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption/Figure Data/marg_eff_pop_df_cy.RData
, the relative path (relative to the project folder of /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption
) would be Figure Data/marg_eff_pop_df_cy.RData
. The final bit of R code looks like this:
load(here('Figure Data/marg_eff_pop_df_cy.RData'))
The addition of that here()
in between load()
and the file path means that things are no longer as simple as finding and replacing the start of the file path.
Luckily, I was able to take advantage of RStudio’s built in support for regular expressions to save myself from having to manually change each line of code that either loaded or saved a file. Regular expressions are a powerful way to search through and manipulate text. You can activate them in RStudio’s find and replace dialog by checking the Regex box:
Once you’ve done that, certain characters in your search will no longer be interpreted literally. The most important difference is probably .
, which is a stand-in for any character.3 This is similar to how *
is a wildcard in the Unix shell, e.g., you can use ls *.R
to list all R script files in a folder. The main regular expression feature I used is the capturing group, which allows you to identify and extract a subset of a line of text. You designate a capturing group by surrounding the desired text with parentheses. To fix all of the code loading RData files from the Figure Data folder, my regular expression looked like this:
'/Users/Rob/Dropbox/UNC/Dissertation/Onset/(Figure Data/.*\.RData)'
It starts with /Users/Rob/Dropbox/UNC/Dissertation/Onset/
, which is the part I want to get rid of. Next, (Figure Data/.*\.RData)'
tells the regular expression to look for any character (.
) repeated an unlimited number of times (*
) followed by .RData
. Because .
is a special character in regular expressions, we have to escape it with a backslash (\
). This will match any file name ending in .RData
in the Figure Data folder. If we left out the leading /Users/Rob/Dropbox/UNC/Dissertation/Onset/
, we end up with the capturing group we want, but since /Users/Rob/Dropbox/UNC/Dissertation/Onset/
wouldn’t be part of the search string, it wouldn’t end up getting replaced. This is the same reason we need to include the opening and closing quotation marks; if we didn’t, we’d end up with a here()
command inside quotation marks, which R would just treat as a string and not a command.
At this point I had the core of the line that I wanted to keep, but now I needed to extract it and place it inside of a call to here()
. You accomplish this goal using a backreference to the capturing group. To reference the first capturing group, you use either \1
or $1
depending on which version of regular expressions you are using. This is often very difficult to figure out, and is one of the most annoying things about regular expressions. You’ll often just have to experiment and find out which one to use through trial and error. Luckily RStudio accepts either version!
To replace the absolute path with a relative one wrapped in a here()
call, this is what I typed into the Replace field in the find and replace dialog:
here('$1')
and it resulted in this:
here('Figure Data/marg_eff_pop_df_cy.RData')
Thanks to the power of capture groups, you can just hit the replace all button and instantly transform every file path into a much more portable and replication-friendly one.
If you’re feeling really confident that you moved every file correctly, you can replace all file paths with the following regular expression:
'/Users/Rob/Dropbox/UNC/Dissertation/Onset/(.*\..*)'
This will get any files with file extensions (the \.
followed by .*
to ensure there’s at least one character after a literal period), as well as any preceding subdirectories (the initial .*
) and stick them all into the resulting here()
call. As an example, this will successfully turn this:
fileConn <- file(here::here(‘Tables/pd_pop_cy.tex’))
groups <- readRDS('/Users/Rob/Dropbox/Dissertation/Onset/Input Data/groups_nightlights.RDS')
into this:
groups <- readRDS(here::here('Input Data/groups_nightlights.RDS'))
I’m using ‘replication’ here to mean that the code used to generate quantitative results from a dataset should produce those same results when run by another researcher, not in the sense that means that independent researchers following the published protocol can collect the data themselves and arrive at the same conclusion. I use the term ‘reproducible’ to describe this property. Annoyingly, different fields use opposing definitions of these two terms. ↩
Specifically, here()
will key into the .Rproj
file included in my replication materials and use that to properly locate everything else. ↩
Except for newlines, carriage returns, and other end of line special characters. ↩
See this simple example, which displays the area of each county in North
Carolina, from the sf
package
documentation.1
First, we need to load sf
and then get the built-in nc
dataset:
library(sf)
nc <- st_read(system.file('shape/nc.shp', package = 'sf'))
plot(nc[1])
Since I needed to generate choropleths for multiple countries, I decided
to use ggplot2
’s powerful
faceting
functionality. Unfortunately, as I discuss
below, ggplot2
and sf
don’t work together
perfectly in ways that become more apparent (and problematic) the more
complex your plots get. I moved away from faceting, and just glued
together a bunch of separate plots, but then I had to figure out how to
end up with a shared legend for five separate plots. Read on to see how
I solved both of these issues.
I already loaded sf
to make the plot of North Carolina above, so now
let’s load the remaining packages we’ll use:
library(tidyverse) # data manipulation and plotting
library(tmap) # spatial plots
library(cowplot) # combine plots
library(RWmisc) # clean plot theme
I’m working with cleaned and subsetted versions of
ACLED and GADM, which
I’ve uploaded to my website as PKO.Rdata
if you want to download them
and run this code yourself. The acled
object contains a list of
attacks on peacekeepers in active Chapter VII UN peacekeeping missions
in Subsaharan Africa, while the adm
object contains all of the second
order administrative districts (ADM2) in the five countries with active
missions.
## load data
load(url('https://jayrobwilliams.com/data/PKO.Rdata'))
## inspect
head(acled)
head(adm)
## Simple feature collection with 6 features and 30 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -3.6102 ymin: 0.4966 xmax: 29.4654 ymax: 19.4695
## Geodetic CRS: WGS 84
## # A tibble: 6 x 31
## data_id iso event_id_cnty event_id_no_cnty event_date year time_precision
## <dbl> <dbl> <chr> <dbl> <date> <dbl> <dbl>
## 1 6713346 140 CEN47283 47283 2019-12-27 2019 1
## 2 6689432 180 DRC16211 16211 2019-12-08 2019 1
## 3 7578005 180 DRC16182 16182 2019-12-04 2019 1
## 4 7191069 466 MLI3253 3253 2019-10-21 2019 1
## 5 6759702 466 MLI3225 3225 2019-10-06 2019 1
## 6 6023339 466 MLI3224 3224 2019-10-06 2019 1
## # … with 24 more variables: event_type <chr>, sub_event_type <chr>,
## # actor1 <chr>, assoc_actor_1 <chr>, inter1 <dbl>, actor2 <chr>,
## # assoc_actor_2 <chr>, inter2 <dbl>, interaction <dbl>, region <chr>,
## # country <chr>, admin1 <chr>, admin2 <chr>, admin3 <chr>, location <chr>,
## # geo_precision <dbl>, source <chr>, source_scale <chr>, notes <chr>,
## # fatalities <dbl>, timestamp <dbl>, iso3 <chr>, month <dbl>,
## # geometry <POINT [°]>
## Simple feature collection with 6 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 18.54607 ymin: 4.221635 xmax: 22.395 ymax: 9.774724
## Geodetic CRS: WGS 84
## # A tibble: 6 x 20
## GID_0 NAME_0 GID_1 NAME_1 NL_NAME_1 GID_2 NAME_2 VARNAME_2 NL_NAME_2 TYPE_2
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CAF Central… CAF.1… Bamin… <NA> CAF.… Bamin… <NA> <NA> Sous-…
## 2 CAF Central… CAF.1… Bamin… <NA> CAF.… Ndélé <NA> <NA> Sous-…
## 3 CAF Central… CAF.2… Bangui <NA> CAF.… Bangui <NA> <NA> Sous-…
## 4 CAF Central… CAF.3… Basse… <NA> CAF.… Alind… <NA> <NA> Sous-…
## 5 CAF Central… CAF.3… Basse… <NA> CAF.… Kembé <NA> <NA> Sous-…
## 6 CAF Central… CAF.3… Basse… <NA> CAF.… Minga… <NA> <NA> Sous-…
## # … with 10 more variables: ENGTYPE_2 <chr>, CC_2 <chr>, HASC_2 <chr>,
## # ID_0 <dbl>, ISO <chr>, ID_1 <dbl>, ID_2 <dbl>, CCN_2 <dbl>, CCA_2 <chr>,
## # geometry <MULTIPOLYGON [°]>
ggplot2
The first step we need to do is associate each individual attack with
the ADM2 it occurred in. We can do this with the st_join()
function.
This function executes a left join by default, so by using adm
for the
x
argument and acled
for the y
argument, we end up with one row
for every ADM2 with no attacks in it, and n rows for each ADM2 with
attacks in it, where n equals the number of attacks in that ADM2. We
can then use group_by()
and summarize()
to create a count of attacks
for each ADM2 by summing the number of non-NA observations of
event_id_cnty
, the main ID field in ACLED. Finally, I log this count
variable (using log1p()
to account for the ADM2s without any attacks
because ln(0) is undefined) to make the resulting plot more
informative due to outliers in Northern Mali and the Eastern DRC.
Putting it all together:
st_join(adm, acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
ggplot(aes(fill = attacks)) +
geom_sf(lwd = NA) + # no borders
scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
theme_rw() + # clean plot
theme(axis.text = element_blank(), # no lat/long values
axis.ticks = element_blank()) # no lat/long ticks
That’s a lot of wasted white space, and it can make certain countries
harder to see. Let’s split it out using facet_wrap()
. We simply add a
facet_wrap()
call to our ggplot2
code, and tell it to split by our
country name variable, NAME_0
:
adm %>%
st_join(acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
ggplot(aes(fill = attacks)) +
geom_sf(lwd = NA) +
scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
facet_wrap(~ NAME_0) +
theme_rw() +
theme(axis.text = element_blank(),
axis.ticks = element_blank())
We’ve got facets, but everything is still clearly on the same scale.
let’s set scales = 'free'
in our call to facet_wrap()
to try and fix
that.
st_join(adm, acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
ggplot(aes(fill = attacks)) +
geom_sf(lwd = NA) +
scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
facet_wrap(~ NAME_0, scales = 'free') +
theme_rw() +
theme(axis.text = element_blank(),
axis.ticks = element_blank())
## Error: coord_sf doesn't support free scales
And we get an error. It turns out the the ggplot2
codebase assumes
that it can maniulate axes independently of one
another.
This is very much not the case with geographic data where a meter
vertically needs to equal a meter horizontally, so coord_sf()
locks
the axes in much the same manner as coord_fixed()
.2 To try and get
around the limitations from ggplot2
’s non-spatial origins, I turned to
a package written from the ground up for plotting spatial data.
tmap
My googling led me to this Stack Overflow
answer extolling the virtue of
the tmap
package.3 tmap
is a
package for drawing thematic maps from sf
objects using a syntax very
similar to ggplot2
. We can reuse the same data wrangling code and as
before pipe it into our plotting function, which this time is
tm_shape()
. We then add a call to tm_polygons()
to get our colored
features and tm_facet()
to split them apart. Note that unlike
ggplot2
, we need to quote the names of variables in tmap
functions:
st_join(adm, acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
tm_shape() +
tm_polygons('attacks', title = 'PKO targeting\nevents (logged)') +
tm_facets('NAME_0')
Much better so far! However, notice that tmap
defaults to assuming
that our attacks
variable is discrete. We’ll need to tell it that it’s
continuous. And what if we moved that legend down to the bottom right to
get rid of the wasted space currently there?
adm %>%
st_join(acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
tm_shape() +
tm_polygons(col = 'attacks',
title = 'PKO targeting\nevents (logged)',
style = 'cont') + # continuous variable
tm_facets('NAME_0') +
tm_layout(legend.outside.position = "bottom", # legend outside below
legend.position = c(.8, 1.1)) # manually position legend
This is…fine. You’ll notice that there’s a lot of white space at the
bottom of the plot, which I still haven’t figured out how to eliminate,
and I personally prefer the color palette options available in
ggplot2
. Finally, there’s not much control over the legend compared to
what you get with ggplot2
, so let’s head back there and try to come at
this problem from a different direction.
cowplot
While we’re still using ggplot2
to make individual plots, we need some
way to combine them into a final plot. We can rely on the plot_grid()
function in the cowplot
library for that.4 We need to create five
subplots, which we could do manually, but let’s do it programmatically
because at some point you may need to do this for 27 different
countries. The best way to store our five subplots is in a list, because
lists in R can contain any type of R objects as their elements.5 I’m
going to use the map()
function from the purrr
package to accomplish
this, but you could also use lapply()
. map()
takes a list as its
first argument, .x
and a function as its second, .f
. To see how map
works, look at the following example:
map(1:3, sample)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 2 3 1
map()
returns a list of length 3 because our input .x
was a vector
of length three, and it applies the function .f
to each element of
.x
. I’m going to use an anonymous
function
to filter adm
to only contain ADM2s from one country at a time, then
create our subplots separately like we did together above:
pko_countries <- c('Central African Republic', 'Democratic Republic of the Congo',
'Mali', 'South Sudan', 'Sudan')
## create maps in separate plots, force common scale between them
maps <- map(.x = pko_countries,
.f = function(x) adm %>%
filter(NAME_0 == x) %>%
st_join(acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
ggplot(aes(fill = attacks)) +
geom_sf(lwd = NA) +
scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
theme_rw() +
theme(axis.text = element_blank(),
axis.ticks = element_blank()))
We can either supply each individual subplot to plot_grid()
separately, or we can use the plotlist
argument to pass a list of
plots; good thing we saved them in a list:
## use COWplot to combine and add single legend
plot_grid(plotlist = maps, labels = LETTERS[1:5], label_size = 10, nrow = 2)
I tried using the name of each country as the subplot label, but because label positioning is relative to the width of labels it was impossible to get them all nicely left-aligned. As a result, I had to settle on using letters to label the subplots and then identifying them in the figure caption in text. As you’ll see later, there’s no perfect way of accomplishing this and you’ll have to make a trade-off somewhere.
Setting aside that compromise, there’s still one issue with this plot that we can fix. We’re measuring the same thing (attacks on UN peacekeeping personnel) in all five choropleths, so there’s no need for five separate scales.
The cowplot
documentation
demonstrates how to use the get_legend()
function to extract the
legend from one of the subplots and then add it as another element to
plot_grid()
, placing it in the bottom right like we sort of managed to
do with tmap
. However, we need to add
theme(legend.position = 'none')
to the ggplot call for each subplot,
otherwise we’ll just end up with six legends. That means we need to
apply to each element of our list of maps, which means it’s another job
that map()
is perfect for! We’ll use map()
to take each subplot in
maps
and remove the legend from it, then use get_legend()
to add a
legend in the bottom right.
## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
.f = function(x) x + theme(legend.position = 'none'))),
get_legend(maps[[1]]),
labels = LETTERS[1:5], label_size = 10, nrow = 2)
This doesn’t look right! We told plot_grid()
to start with our maps,
so why is the legend the first thing in the plot? If you look closely at
the documentation for plot_grid()
, you’ll see that the ...
argument
comes before the plotlist
argument in the function definition. Even
when we specify plotlist
first, the function will add plotlist
after
...
.6 To fix this, all we need to do is concatenate the results of
get_legend()
with the results of our call to map()
. Note that we
need to first transform the former to a list with list()
, otherwise
each element of it will be concatenated separately rather than as a
grob
object:
## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
.f = function(x) x + theme(legend.position = 'none')),
list(get_legend(maps[[1]]))),
labels = LETTERS[1:5],
label_size = 10,
nrow = 2)
So far so good. But if we try using a different map in our call to
get_legend()
, things get weird:
## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
.f = function(x) x + theme(legend.position = 'none')),
list(get_legend(maps[[4]]))),
labels = LETTERS[1:5], label_size = 10, nrow = 2)
Each subplot has its own unique legend that’s automatically generated
from the values of attacks
it contains. This is even worse than it
might seem at first glance, because it means that the various subplots
are in no way comparable to one another!
To avoid misrepresenting the data, we need to ensure that each subplot
has the same legend. The easiest way to do this is to manually set the
legend for each subplot in our call to scale_fill_continuous()
. Even
though we’re manually setting the bounds of the legend, that doesn’t
mean we have to hard code them. We can use a simpler version of our code
to join attacks to ADM2s and then calculate the highest number of
attacks across all countries in the data. Then we take advantage of
the fact that scale_fill_continuous()
can pass additional parameters
to continuous_scale()
via the ...
argument. The continuous_scale()
function is a low-level function used throughout ggplot2
to construct
continuous scales, and it has a limits
argument that sets the bounds
of the scale. All we have to do is pass the minimum and maximum (logged)
numbers of attacks in the data and we’re in business:
st_join(adm, acled) %>%
st_drop_geometry() %>% # we don't need a map at the end; drop geometry to speed up
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
pull(attacks) %>% # extract attacks variable
range() -> attacks_range # get min and max
## create maps in separate plots, force common scale between them
maps_shared <- map(.x = pko_countries,
.f = function(x) adm %>%
filter(NAME_0 == x) %>%
st_join(acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
ggplot(aes(fill = attacks)) +
geom_sf(lwd = NA) +
scale_fill_continuous(limits = attacks_range,
name = 'PKO targeting\nevents (logged)') +
theme_rw() +
theme(axis.text = element_blank(),
axis.ticks = element_blank()))
Now all that’s left is to use plot_grid()
to put it all together:
## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps_shared,
.f = function(x) x + theme(legend.position = 'none')),
list(get_legend(maps_shared[[1]]))),
labels = LETTERS[1:5], label_size = 10, nrow = 2)
And unlike before, the legend is identical regardless of which subplot
we use with get_legend()
:
## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps_shared,
.f = function(x) x + theme(legend.position = 'none')),
list(get_legend(maps_shared[[4]]))),
labels = LETTERS[1:5], label_size = 10, nrow = 2)
This approach is still useful even if you’re not working with spatial
data. plot_grid()
is powerful because it lets you make asymmetric
arrangements like this example from the cowplot
documentation:
p1 <- ggplot(mtcars, aes(disp, mpg)) +
geom_point()
p2 <- ggplot(mtcars, aes(qsec, mpg)) +
geom_point()
plot_grid(p1, p2, labels = c('A', 'B'), rel_widths = c(1, 2))
If the units you’re faceting by contain substantially different observations, you might end up in a situation where the automatically generated legends are different from one another. Manually creating the scale of the legend and ensuring it’s the same for all plots would solve this problem here, too.
Don’t let anyone convince you they know everything. I still haven’t
managed to get my ideal (conditional on regular faceting with
facet_wrap()
being out of the question) solution to this working. I
tried to create five subplots and just add a facet label to each, with
each one being a facet of one panel. Straightforward enough, right?
maps_facet <- map(.x = pko_countries,
.f = function(x) adm %>%
filter(NAME_0 == x) %>%
st_join(acled) %>%
group_by(NAME_0, NAME_1, NAME_2) %>%
summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>%
ggplot(aes(fill = attacks)) +
geom_sf(lwd = NA) +
scale_fill_continuous(limits = attacks_range,
name = 'PKO targeting\nevents (logged)') +
facet_wrap(~NAME_0) +
theme_rw() +
theme(axis.text = element_blank(),
axis.ticks = element_blank()))
plot_grid(plotlist = c(map(.x = maps_facet,
.f = function(x) x + theme(legend.position = 'none')),
list(get_legend(maps_facet[[1]]))),
nrow = 2)
Not so much, and no amount of tinkering with the align
and axis
arguments to plot_grid()
has yielded any improvement. The specific
paper this plot is for doesn’t have any other plots with facets, so I’m
content to go with my inelegant solution of lettered labels and a key to
them in the figure caption. If that weren’t the case, I might still be
fiddling with this and getting deeper and deeper into the source code
for plot_grid()
.
If you’re wondering why the largest county area is in the ballpark of 0.25, it’s because the data are in square degrees, an old non-SI unit of measurement that’s defined in terms of how much the field of view from a given point is obstructed by an object. GIS is so easy these days, folks. ↩
The more I learn about how ggplot2
and sf
work under the hood,
the more amazed I am that geom_sf()
Just Works in 80% of cases,
let alone works at all. ↩
The answer also listed the geom_spatial()
function from the
ggspatial
package as an alternative option, but I couldn’t get it
to work. The answer is three and a half years old, which means it’s
very possible something changed in either sf
or ggspatial
that
broke this solution. So it goes. ↩
It’s much more powerful and easily customizable than
gridExtra::grid.arrange()
. ↩
They can also contain heterogeneous elements which will come in handy later. ↩
If you check out the actual source code of plot_grid()
, line 9
shows you that the function is indeed putting ...
ahead of
plotlist
: plots <- c(list(...), plotlist)
. ↩
That means I’m spending a decent chunk of time thinking about and planning future trips. At some point in the process of doing this, I realized that I could use the GIS skills from my day job to help make planning future trips more efficient. In this post I walk through how you can use GIS tools in R to help with some of the route planning for a multiday backpacking trip. Specifically, how you can use open source spatial data on geography and transportation infrastructure to identify potential campsites along a hiking trail.
This was largely an exercise in seeing how I could apply GIS skills I’ve learned in the study of political violence to small-scale GPS navigation. I haven’t had the opportunity to hit the trail and test out any of the assumptions I use in this process yet, so you should view this post as more of a (loose) method than concrete suggestions. For a short and simple point-to-point hike with only one route, there’s really no need to engage in this level of GIS analysis. I’ve kept things simple to make them easier to follow, but this approach could actually be useful and save some time when planning a longer trip with many potential routes.
At some point in the future, I want to hike the Uwharrie Trail in Uwharrie National Forest in central North Carolina, near where I went to grad school. As I think about this (probably far off) trip, I’ve been using CalTopo to plan my route.
If you spend any amount of time in the outdoors, you should know about CalTopo. CalTopo is a website that lets you plan routes (hiking, skiing, rafting, etc.) on top of super high resolution topographic maps. You can then turn your smartphone into a full-featured GPS and use it to follow those routes (CalTopo offes a mobile app, as does Gaia GPS, both for about $20 a year). While the Uwharrie Trail is a pretty straightforward hike, I’ve been using this as an excuse to try and apply my GIS skills in a new context.
CalTopo is great, but it’s very point and click. I like doing things programmatically when I can, so that means it’s time to grab some of the open source data that CalTopo uses so we can play around with it in R. The base map in CalTopo is called MapBuilder Topo, and uses OpenStreetMap data as its starting point, so let’s start there.
This guide is intended to show how to identify potential backcountry campsites on public land where dispersed camping is permitted. If you are backpacking in an area with designated, maintained backcountry campsites, you should use them. Dispersed camping is typically permitted in less-traveled areas where the impact of campers is better minimized by diffusing it rather than concentrating it into a handful of designated sites.1
Always check regulations for any land you plan to camp on to see if there are specific requirements for site selection or areas where camping is prohibited. Picking an actual campsite requires identifying areas where your saftey will be maximized and the longterm impact of your stay will be minimized. See this guide for the basics and this series for a slightly more hardcore set of principles to follow. And remember, never go into the wilderness without telling someone where you’re going and when you should be back.
OpenStreetMap (OSM) is an open source
map of the entire globe; think of it as a hybrid of Google Maps and Wikipedia.
OSM is designed so that anyone can easily add to or edit it. Setting aside the
normative value of this perspective, this is helpful for us because it means
that OSM is transparent. We can use the excellent osmdata
R package to query OSM via the
[Overpass API], and we can use OSM itself via the
OSM website to learn the various parameters
we’ll use to query OSM.
The
getting started vignette
covers much of the basics of using osmdata
. The key functions are
osmdata::opq()
, which builds a query to the Overpass API, and
osmdata::add_osm_feature()
, which requests specific features. OSM classifies
features using
key-value pairs,
and we can use the OSM website to figure out just which pairs we need. Navigate
to an area of interest, right-click on the feature of interest, and then select
“query features.”
Next, select the desired feature in the dialog on the left of the screen. In this case, select the “Relation” rather than the “Path” because the path will only include one segment of the trail while the relation will include its entire length.
We can see here that the Uwharrie Trail relation has type=hiking
, so that’s
the key-value pair wew’ll have to specify in our query.
Make sure to use the bbox
argument to osmdata::opq()
, otherwise you’ll
request every hiking trail in the world! You can manually specify the four
edges of a bounding box to search in, or you can use the osmdata::getbb()
function to get it automatically using the
Nominatim geocoder.
library(tidyverse)
library(sf)
library(osmdata)
## get hiking routes in Uwharrie National Forest
unf_trails <- opq(bbox = getbb('uwharrie national forest usa')) %>%
add_osm_feature(key = 'route', value = 'hiking') %>%
osmdata_sf()
Notice that we use the osmdata::osmdata_sf()
function to convert the resulting
object for use with the sf
R package. Let’s inspect the resulting object of
class osmdata_sf
.
## inspect
unf_trails
## Object of class 'osmdata' with:
## $bbox : 35.3951403,-80.0236608,35.4351403,-79.9836608
## $overpass_call : The call submitted to the overpass API
## $meta : metadata including timestamp and version numbers
## $osm_points : 'sf' Simple Features Collection with 3341 points
## $osm_lines : 'sf' Simple Features Collection with 26 linestrings
## $osm_polygons : 'sf' Simple Features Collection with 0 polygons
## $osm_multilines : 'sf' Simple Features Collection with 1 multilinestrings
## $osm_multipolygons : NULL
We can see that the unf_trails
object includes points, lines, polygons,
multilines, and multipolygons. We want to use the lines since that will include
any short trail segments that aren’t part of a larger trail. We can easily plot
the trails using this object.
## plot
plot(unf_trails$osm_lines$geometry, col = 'coral4')
Let’s do some quick sanity checks. First, Wikipedia tells us the trail should be
about 20 miles. We can use the sf::st_length()
function to measure the length
of each trail segment, and the sf::st_union()
function to combine all
segments. We’ll get our answer in meters, which as a metric-deprived American,
won’t be all that helpful to me. To get around this, we can use the
`units::st_units()
function to convert from meters to miles.
## measure total trail length
st_union(unf_trails$osm_lines$geometry) %>% # combine all segments.
st_length() %>% # measure length
units::set_units(mi) # convert to miles
## 28.26457 [mi]
While that’s initially concerning, a closer reading of the Wikipedia article for the trail reveals that it was originally 40 miles long, so OSM likely includes some of the Northern section of the trail beyond what’s officially recognized today.
We should also plot the bounding box that osmdata::getbb()
ends up generating
to ensure we’re not missing any part of the trail. We can do this with the
OpenStreetMap
[R package](https://cran.r-project.org/package=OpenStreetMap.
Here we unfortunately need to manually specify the bounding box as a series of
two vectors with the latitude and longitude coordinate of the upper-left and
lower-right of the box. OpenStreetMap::openmap()
uses (latitude, longitude)
pairs, not (longitude, latitude) pairs as is more common in GIS, i.e.,
(y, x) not (x, y), so be sure to include them in that
order.[^lat-long]
{markdown} OpenStreetMap::openproj()
also requires a
projection
argument, so I use sf::st_crs(4326)$proj4string
to generate one
automatically, ensuring I don’t introduce a type somewhere by accident.
[^lat-long]:
{markdown} I spent 20 minutes not understanding why I couldn’t get this to work before I finally read the documenation. Don’t be like me, folks.
library(OpenStreetMap)
## get bounding box
unf_bb <- getbb('uwharrie national forest usa')
## get OSM tiles
unf_tile <- openmap(c(unf_bb[2,1], # lat
unf_bb[1,1]), # long
c(unf_bb[2,2], # lat
unf_bb[1,2]), # long
type = 'osm', mergeTiles = T)
## project map tiles and plot (OSM comes in Mercator...)
plot(openproj(unf_tile), projection = st_crs(4326)$proj4string)
## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')
Uh oh. We can see that we’re only getting a small portion of the total trail and that it trails (heh) off the map on three sides. That’s not great, so let’s fix it. We can start by looking up Uwharrie National Forest itself on the OSM website. This gives us the boundaries of the official forest land in orange.
We can see from the dialog on the left that the forest’s OSM ID is 2918413, so
we can use the osmdata::opq_osm_id()
function to get the polygons for the
forest’s boundaries. Let’s grab the forest boundaries and plot them, along with
the bounding box they imply and the bounding box that resulted from
osmdata::getbb()
(in red) for comparison.
## get Uwharrie National Forest Boundaries
unf <- opq_osm_id(type = 'relation', id = 2918413) %>%
osmdata_sf()
## plot Uwharrie National Forest polygons
plot(unf$osm_multipolygons$geometry, col = 'lightgreen', border = NA, bty = 'n')
## construct line for original bounding box
plot(st_multilinestring(list(matrix(c(unf_bb[1, 1], unf_bb[2, 1],
unf_bb[1, 1], unf_bb[2, 2],
unf_bb[1, 2], unf_bb[2, 2],
unf_bb[1, 2], unf_bb[2, 1],
unf_bb[1, 1], unf_bb[2, 1]),
ncol = 2, byrow = T))),
add = T, col = 'red')
## plot bounding box for Uwharrie National Forest polygons
plot(st_as_sfc(st_bbox(unf$osm_multipolygons)), add = T)
## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')
Wow, we were missing a lot before. Let’s use the bounding box for the entire
forest as our new bounding box. First, we plot OSM using this new bounding box.
st_bbox()
yields a vector of four numbers, rather than the matrix that
osmdata::getbb()
produces, so we need to work around this and specify the
top-left and bottom-right corners of our new, bigger bounding box.
## get OSM tile for Uwharrie National Forest polygons
unf_full_tile <- openmap(c(st_bbox(unf$osm_multipolygons)[4], # lat
st_bbox(unf$osm_multipolygons)[1]), # long
c(st_bbox(unf$osm_multipolygons)[2], # lat
st_bbox(unf$osm_multipolygons)[3]), # long
type = 'osm', mergeTiles = T)
## project and plot OSM tile
plot(openproj(unf_full_tile), projection = st_crs(4326)$proj4string)
## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')
That’s much better! We’re getting a lot of area beyond the trail, but it’s easy to filter that out later so it’s better to grab too much than too little.
Now we can go back and grab all hiking trails in Uwharrie National Forest using
our new bounding box. osmdata::opq()
expects a bounding box in a certain
format, so let’s inspect it to see what we’re working with and what we need to
reshape the output of sf::st_bbox(unf$osm_multipolygons)
into:
## bbox format osmdata::opq() expects
unf_bb
## min max
## x -80.02366 -79.98366
## y 35.39514 35.43514
## rearrange sf::st_bbox() output
matrix(st_bbox(unf$osm_multipolygons), ncol = 2,
dimnames = list(c('x', 'y'), c('min', 'max')))
## min max
## x -80.17085 -79.73170
## y 35.21987 35.63684
Note that I’m specifying row and column names when creating the new bounding
box. Without them, osmdata::opq()
will fail! We can now plug this new bounding
box object into osmdata::opq()
and get all hiking routes in the forest.
## get hiking trails in all of Uwharrie National Forest
unf_trails_full <- opq(bbox = matrix(st_bbox(unf$osm_multipolygons), ncol = 2,
dimnames = list(c('x', 'y'), c('min', 'max')))) %>%
add_osm_feature(key = 'route', value = 'hiking') %>%
osmdata_sf()
## plot
plot(unf_trails_full$osm_lines$geometry, col = 'coral4')
Now we’re getting a bunch of trails across the Pee Dee River in Morrow Mountain State Park. Again it’s easy to drop these extra trails later, so for the moment, more complete is better than less complete. These data come from OpenStreetMap, so they also include lots of usuable data. Let’s take a look at the fields included in our lines:
## inspect
glimpse(unf_trails_full$osm_lines)
## Rows: 106
## Columns: 37
## $ osm_id <chr> "32024414", "216945232", "216945234", "216945241", …
## $ name <chr> "Uwharrie Trail", "Mountain Loop Trail", "Mountain …
## $ alt_name <chr> "Uwharrie National Recreation Trail", NA, NA, NA, N…
## $ bicycle <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ bridge <chr> NA, NA, "yes", "yes", NA, NA, "boardwalk", NA, NA, …
## $ construction <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ dog <chr> NA, "leashed", "leashed", "leashed", "leashed", NA,…
## $ foot <chr> "designated", "designated", "designated", "designat…
## $ footway <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ highway <chr> "path", "path", "path", "path", "path", "path", "pa…
## $ horse <chr> NA, "no", "no", "no", "no", "no", NA, NA, NA, NA, "…
## $ lanes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ layer <chr> NA, NA, "1", "1", NA, NA, "1", NA, NA, NA, NA, NA, …
## $ motor_vehicle <chr> NA, "no", "no", "no", "no", "no", NA, NA, NA, NA, "…
## $ name_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ oneway <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ rcn_ref <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ sac_scale <chr> NA, "mountain_hiking", "mountain_hiking", "mountain…
## $ service <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "parkin…
## $ smoothness <chr> NA, "bad", "good", "good", "bad", NA, NA, NA, NA, N…
## $ source <chr> NA, NA, NA, NA, NA, "GPS_2009", "GPS_2009", "GPS_20…
## $ surface <chr> "dirt", "ground", "wood", "wood", "ground", "ground…
## $ symbol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "wh…
## $ tiger.cfcc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.county <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_base <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_base_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.reviewed <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_left <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_left_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_right <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_right_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tracktype <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ trail_visibility <chr> NA, "excellent", "excellent", "excellent", "excelle…
## $ wheelchair <chr> NA, "no", "no", "no", "no", NA, NA, NA, NA, NA, "no…
## $ geometry <LINESTRING [°]> LINESTRING (-80.0435 35.310..., LINESTRI…
We can use the “name” field to subset the data. If you were considering some
parallel or spur trails, you could use sf::st_filter()
in combination with
`sf::st_is_within_distance()
to instead just grab trails near your primary
trail.
## extract OSM lines and filter
ut <- unf_trails_full$osm_lines %>% filter(name == 'Uwharrie Trail')
Now we’ve gotten the Uwharrie Trail twice. Once using a smaller bounding box and once using a larger one. We can plot them both and see if there were any segments the intial query missed
## plot
plot(ut$geometry, col = 'red')
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')
Luckily the initial query still picked up every segment, but that won’t always be the case if you start with an inaccurate initial bounding box. If the entire Uwharrie Trail wasn’t collected into a relation, we might have missed large chunks of it on either end. Now we can use the bounding box for the Uwharrie Trail to capture any other features we care about nearby.
The first other feature we need is water. On any multi-day trip, being able to
refill your water is essential. The
OSM wiki page on waterways
shows us that they values we need to grab relevant water sources are river
and
stream
. Although not well-documented, you can supply multiple value
arguments to osmdata::opq()
using c()
. This will let us quickly and easily
grab both rivers and streams in the area.2
## create bbox for just the Uwharrie Trail; no need for all water in the whole National Forest
ut_bb <- matrix(st_bbox(ut), ncol = 2, dimnames = list(c('x', 'y'), c('min', 'max')))
## get rivers and streams and extract OSM lines
ut_water <- opq(bbox = ut_bb) %>%
add_osm_feature(key = 'waterway', value = c('river', 'stream')) %>%
osmdata_sf()
Our next step will be to drop any water sources more than a kilometer from the
trail. This will simplify our analysis later and also minimizes our
environmental impact. To conduct GIS operations in meters, we need to project
our data from latitude and longitude-based
WGS84 to a meter-based
coordiante reference system (CRS). The CRS database epsg.io shows that
NAD83/North Carolina(EPSG:32119) is the projection for
data in North Carolina, so we use sf::st_transform()
along with sf::st_crs()
to project our trail and water source objects. This lets us calculate distances
in feet/meters rather than decimal degrees. We’ll use this to limit the water
features to those that fall within 1km of the trail. This way we’re not limiting
ourselves to only water features that directly intersect the trail, but we’re
also not retaining a bunch of features that are farther off-trail than I like to
hike for water.
## project trail
ut <- st_transform(ut, st_crs(32119))
## project water sources
ut_water <- ut_water$osm_lines %>%
st_transform(st_crs(32119)) %>%
st_filter(ut, .predicate = st_is_within_distance, dist = 1000)
## plot
plot(ut_water$geometry, col = 'lightblue')
plot(ut$geometry, add = T, col = 'coral4')
If we want to be near water, we want to be far from roads. OpenStreetMap has lots of different categories of roads, so we’ll want to capture all the major ones, as well as service roads and “tracks”, which is how OpenStreetMap refers to forest roads.3 OSM identifies roads with the key “highway,” and inspecting the OSM wiki page on roads shows us the various values we’ll need to grab all relevant roads.
## get roads, project, and limit to w/in 1000 m of trail
ut_roads <- opq(bbox = ut_bb) %>%
add_osm_feature(key = 'highway',
value = c('primary', 'secondary', 'tertiary', 'residential',
'unclassified', 'track', 'service')) %>%
osmdata_sf() %>%
magrittr::extract2('osm_lines') %>%
st_transform(st_crs(32119)) %>%
st_filter(ut, .predicate = st_is_within_distance, dist = 1000)
## plot
plot(ut_roads$geometry, col = 'black')
plot(ut$geometry, add = T, col = 'coral4')
Note the use of magrittr::extract2()
to extract the osm_lines
object from
the osmdata_sf
object returned by osmdata::osmdata_sf()
. This is how you can
access a list element in a pipeline, and is equivalent to $osm_lines
.
To locate potential campsites we need to identify our priorities and use them to define a set of rules for selecting potential sites. For this exercise, I’m using the following:
I’d like to be within 750 feet of a water source. Some (more hardcore) backpackers prefer to be farther away from water sources to minimize the chance of encountering animals. Since Uwharrie National Forest isn’t an area with heightened bear activity, I’m willing to trade the chance of a raccoon sniffing around my bear canister for a shorter walk to refill my water.
The US Forest Service requires that you camp at least 200 feet away from any water source. This is good practice everywhere, but it’s required in National Forests, so we want to make sure any potential campsites are at least 200 feet from any water features.
The Uwharrie Trail is a fairly heavily-trafficked trail, so I’d like to avoid going more than 1/4 mile off-trail to find a campsite. This will minimize the disturbance to the surrounding area.4 All of the semi-official campsites on the Uwharrie Trail are a good ways off the trail itself, so staying near the trail will contain my impact on a large scale, but minimize it locally.
If you’re not in a designated campsite, you should be at least 200 feet away from any trail. Again, this seeks to minimize your impact on the area by spreading out campsites over time.
If I’m making the effort to carry my shelter, sleep system, and food on my back, you better believe I don’t want to be hearing any cars at night. To try and minimize the chances of this happening, I want to be least 1,000 feet from any roads. The lower section of the trail skirts particularly close to a residential neighborhood, so this is an important consideration.
I’m going to drop any potential campsites smaller than 0.1 km^2. Choosing where to actually pitch your tent within a potential site area requires many considerations like drainage, wind exposure, and avoiding dead trees overhead. This means that we want to have ample space in which to find the ideal tent spot, so dropping small potential sites reduces the possibility of arriving at a spot and finding that there’s no good place for your tent.
With all of those factors in mind, we can now define our potential campsites and
then narrow them down. I start by buffering the rivers and streams by 1,000 feet
with sf::st_buffer()
, which gives us every area within 1,000 feet of a water
source. Then I move down my list of conditions, buffering the relevant feature
and using sf::st_intersect()
when I want to ensure I stay within a given
distance of that feature and sf::st_difference()
when I want to stay a given
distance away from that feature.
Since NAD83 uses meters as the unit of measurement, we need to convert these
distances in feet into meters. Again, the units
package makes this easy with the
units::set_units()
function.
## buffer water 750 ft
campsites <- st_union(ut_water) %>%
st_buffer(dist = units::set_units(750, ft))
## buffer water 200 ft and subtract
campsites <- st_union(ut_water) %>%
st_buffer(dist = units::set_units(200, ft)) %>%
st_difference(x = campsites)
## buffer trail 1/4 mile and intersect
campsites <- st_union(ut) %>%
st_buffer(dist = units::set_units(.25, mi)) %>%
st_intersection(x = campsites)
## buffer trail 200 ft and subtract
campsites <- st_union(ut) %>%
st_buffer(dist = units::set_units(200, ft)) %>%
st_difference(x = campsites)
## buffer roads to 1000 ft and subtract
campsites <- st_union(ut_roads) %>%
st_buffer(dist = units::set_units(1000, ft)) %>%
st_difference(x = campsites)
## cast multipolygon to polygons and convert to sf
campsites <- campsites %>%
st_cast('POLYGON') %>%
st_sf() %>%
mutate(id = 1:n()) %>% # create ID variable
filter(st_area(.) > units::set_units(.1, km^2)) # filter to > .1 sq km
The animation below shows each step in the process in order:
So far we haven’t really done anything that you couldn’t do on CalTopo, albeit in a less programmatic way. Let’s change that by bringing in some elevation data. Elevation is important when hiking because it determines how many climbs your lungs will have to endure and how many descents your knees will. CalTopo has great built-in tools for generating elevation profiles and more detailed terrain statistics that can tell you what to expect along a given route. However, you can only calculate them for lines or polygons you’ve manually drawn.
While we could import the potential campsite polygons we’ve just generated into CalTopo and then calculate the terrain statistics, this has two major drawbacks. First, you have to point and click through generating the report for each polygon because there’s no way to batch process. Second, and more importantly, this would use a lot of processing power and computing time on CalTopo’s servers. If, unlike me, you have a paid subscription, you might feel less bad about this, but I’m trying not to take advantage of such an awesome service that CalTopo currently provides for free.
We can use R’s capabilities to handle
raster data
to solve both of these problems! The elevatr
package lets you easily download
elevation data in the form of a
digital elevation model.
These models combine multiple measurements from satellites to produce a single
image of the earth where the brightness of each pixel represents the elevation
of a given area. elevatr
allows you to easily access elevation data compiled
from a number of different data sources. The main function is
elevatr::get_elev_raster()
, which takes an sf
object as its first argument
and z
, z zoom level of 1:14. We can also specify the clip = 'bbox'
argument
to crop the resulting raster to just the bounding box of our potential
campsites, and not the entire tile they fall in.
library(raster)
library(elevatr)
## get elevation raster and clip to bbox
elev <- get_elev_raster(campsites, z = 13, clip = 'bbox')
## plot to inspect
plot(elev, col = grey(1:100/100))
plot(ut$geometry, add = T, col = 'coral4')
Since we can see that the highest point in the area is only about 300 feet above
sea level, we don’t need to worry about absolute elevation when picking
potential sites. Instead, we want to know how level these areas are; no one
wants to wake up smushed against the downhill wall of their tent. We can use the
raster::terrain()
function to calculate the slope in each pixel.
## calculate slope
camp_slope <- terrain(elev, opt = 'slope', unit = 'degrees')
## plot slope
plot(camp_slope)
plot(ut$geometry, add = T, col = 'coral4')
All that’s left to do is aggregate slope measures to each polygon, and then calculate some sort of summary statistic to tell us how steep each potential site is overall. I’m going to use the median of each area’s slope rather than its average to avoid giving undue influence to outliers (if a .5 km2 area is largely flat with a cliff at one edge, then it’s likely still a good candidate for a campsite). Let’s filter out all areas with a median slope of more than 10°.
## calculate median slope for each polygon and filter
campsites <- campsites %>%
mutate(med_slope = (raster::extract(camp_slope, ., fun = median, na.rm = T))) %>%
filter(med_slope < 10)
With that done, we can now plot our potential campsite locations and all the features used to define them:
## plot campsites and all features
plot(ut$geometry, type = 'n')
plot(campsites$geometry, add = T, col = 'lightgreen', border = NA)
plot(ut_water$geometry, add = T, col = 'lightblue')
plot(ut_roads$geometry, add = T, col = 'black')
plot(ut$geometry, add = T, col = 'coral4')
This is a pretty picture, but it’s not very useful. To make it so that we can actually navigate to any of these spots, we need to get them onto a topographic map.
To make our map usable, all we have to do is export the potential campsite
polygons from R so that we can import them into CalTopo. CalTopo supports a
number of file formats for importing, but the one we want to use is
GeoJSON. We can use the geojsonio
package to easily convert our polygons from sf
objects to GeoJSON format and
then save them to disk to import into CalTopo.5
There are two (potentially) tricky things we need to do. First, make sure we
reproject our NAD83 data back to decimal degree-based WGS84 so that CalTopo can
properly reference them. Second, we want to take advantage of R’s capabilities
to efficiently wrangle data and create a name field for our polygons so they’ll
be easy to identify and reference once they’re in CalTopo. To do this, we need
to create a “title” field in our sf
object before we convert it to
GeoJSON.6
library(geojsonio)
## create site number field; transmute b/c all fields other than label are lost on import
campsites %>%
transmute(title = str_c('Potential Site ', row_number())) %>%
st_transform(st_crs(4326)) %>% # project to WGS84
geojson_json() %>%
geojson_write(file = 'campsites.json')
## export Uwharrie Trail to save the trouble of tracing it
ut %>%
st_transform(st_crs(4326)) %>% # project to WGS84
geojson_json() %>%
geojson_write(file = 'trail.json')
At this point all that’s left to do is click the “Import” button in CalTopo and
select your newly created .json
file. You can check out the potential
campsites live on CalTopo below:
Some closing thoughts viewing the potential sites in context on CalTopo:
See here for a discussion of different types of campsites and contexts in which they are usually found. ↩
If we didn’t do this, we’d have to use c()
to combine multiple osmdata_sf
objects and then extract the osm_lines
object from the combined osmdata_sf
object. ↩
The US Forest Service maintains GIS data on forest roads on National Forest land, but the API to access them is…less than user friendly so I’m ignoring them for this illustration. ↩
In very sparsely-traveled areas, it can be better to seek out campsites far from the trail to avoid camping in areas where others have recently stayed. This can help prevent the emergence of ‘social’ campsites that are not officially recognized or maintained but are frequently used. It will also reduce the chance that you’ll encounter any local wildlife that have learned that such spots can be a source of easy meals. ↩
Want to find potential campsites for a trail that’s not in OpenStreetMap? CalTopo supports exports as well as imports, so you can trace the route in CalTopo, export it, then load it in R with sf::st_read()
and then carry out the steps above! ↩
CalTopo refers to an object’s name field as its “Label” in the interface, but this isn’t what it’s called under the hood. I had to export a line I create and inspect the resulting .json
file to find out that it’s referred to as a “title” instead. ↩
Update: 05/19/2021 John MacFarlane helpfully pointed out that this is all incredibly unnecessary because pandoc makes it easy to add support for footnotes to GitHub-Flavored Markdown. The documentation notes that you can add extensions to output formats they don’t normally support. Since standard markdown natively supports footnotes when used as an output format, I didn’t even think to look into manually enabling them for GitHub-Flavored Markdown.
If you’re running pandoc from the command line all you need to do is add
-t gfm+footnotes
to your pandoc command. If you’re working with .Rmd
files
like me, all you need to do is add +footnotes
to the end of of the
variant: gfm
line in your YAML header. As a side benefit, you can drop the
--wrap=preserve
flag and end up with .md
files that aren’t hundreds of
columns wide. I’m leaving the original post up below in case anyone who has an
even weirder use case than me might find it helpful, or if any of my students
ever stumble across this page and don’t believe that I’m still constantly
learning, too.
I use jekyll to create my website. Jekyll converts Markdown files into the HTML that your browser renders into the pages you see. As others and I have written before, it’s pretty easy to use R Markdown to generate pages with R code and output all together. One thing has consistently eluded me, however: footnotes.
Every time I try to include footnotes in my .Rmd
file, they end up mangled and
not actually footnotes in the final HTML page. My solution thus far has been to
just avoid footnotes and lean heavily on parenthetical asides when I’m using R
Markdown to generate a page. My recent post on
using SQL style filtering to preprocess large spatial datasets before loading
them into memory needed a whopping six footnotes, so I finally had to sit down
and figure it out.
The ‘standard’ method for adding footnotes in R Markdown is actually a bit of a cheat compared to the method in the official Markdown specification. R Markdown lets you use a LaTeX-esque syntax for defining footnotes:
Here is some body text.^[This footnote will appear at the bottom of the page.]
However, Jekyll uses the official Markdown specification for footnotes, so this won’t work. Instead, we need to define them with the official syntax:
Here is some body text.[^1]
[^1]: This footnote will appear at the bottom of the page.
However, when R Markdown converts your file from standard Markdown to GitHub-Flavored Markdown, something strange happens and the output in your .md
file
will look like this:
Here is some body text.\[1\]
1. This footnote will appear at the bottom of the page.
When Jekyll converts the Markdown file to HTML, you end up with a sad lonely unclickable [1] where your footnote should go. The content of the footnote does appear at the bottom of the page, but it lacks the footnote formatting so it just looks like regular text and there’s no link to click and return to the footnote’s place in the text.
Understanding what’s happening here (and thus how to fix it) requires a slightly
detailed explanation of what exactly happens when you hit that Knit
button in RStudio. First, the knitr package runs all
of the code in your .Rmd
file and creates a .md
file. Next,
pandoc takes the .md
file and converts it to whatever
output format you want.1
Pandoc is the source of our problems here. The square braces that set off a footnote are metacharacters in Markdown, since they’re used to construct links (among other things, like citations with pandoc-citeproc). When Pandoc sees them in the process of converting from standard Markdown to GitHub-Flavored Markdown, it (logically) decides that they’re important content and preserves them by escaping them with a backslash so they’re preserved in the GitHub-Flavored Markdown. Unfortunately for us, we want our square brackets to be treated as special characters and not turned into text. This is a known issue with Pandoc (see this issue on GitHub) so it will eventually get fixed, but in the meantime I’ve come up with a workaround.
Pandoc allows you to tag both code chunks and inline code with a special
raw attribute which will
ensure they’re passed on to the output format unmodified. To do this, just
enclose any text with backticks (`
) and then put {=markdown}
immediately
after the closing backtick. This will ensure that Pandoc doesn’t alter the
‘code’ in the backticks at all. It’s debatable whether the [^1]
used to
define a footnote is really code, but for our purposes treating it like code
will ensure that our footnotes work in the final output:
Here is some body text.`[^1]`{=markdown}
`[^1]:`{=markdown} This footnote will appear at the bottom of the page.
There’s one more tweak we have to make to get this to work. If any of your
footnotes are longer than 72 characters,2 then Pandoc will split
them up and divide them into multiple lines in the output .md
file. Since
footnotes need to be all on the same line, this will break them and you’ll have
a bunch of sentence fragments at the end of your page right above the equally
fragmented footnotes. To fix this, we need to use the --wrap
argument to
Pandoc in our YAML header. Below is the YAML header for the .Rmd
file
that
generates the .md
file
that Jekyll uses to generate the HTML your browser
uses to render this page.
---
title: Footnotes in `.Rmd` files
output:
md_document:
variant: gfm
preserve_yaml: TRUE
pandoc_args:
- "--wrap=preserve"
knit: (function(inputFile, encoding) {
rmarkdown::render(inputFile, encoding = encoding, output_dir = "../_posts") })
date: 2020-10-26
permalink: /posts/2020/10/jeykll-footnotes
excerpt_separator: <!--more-->
toc: true
tags:
- jekyll
- rmarkdown
---
By specifying --wrap=preserve
, we tell Pandoc to respect the line breaks
present in the .Rmd
file when generating the .md
file.3
Accordingly, our footnotes will be intact and functional in the final web page.
And now, to prove to you that this post really did start out as a .Rmd
file,
here’s some R code and a plot. Everyone’s seen mtcars
a million times, and it
turns out that iris
was originally
published in the Annals of Eugenics,
so I went digging for a new built in dataset.4 I landed on the
Loblolly pines dataset,
which records the height of 14 different
loblolly pine trees.5
library(ggplot2)
ggplot(Loblolly, aes(x = age, y = height, group = Seed)) +
geom_line(alpha = .5) +
labs(x = 'Age (years)', y = 'Height (feet)') +
theme_bw()
It looks like all of the trees in the sample followed a pretty similar growth
trajectory! Finally, to really really prove this page started out as a .Rmd
file, here’s the sessionInfo()
:
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.11 knitr_1.30 magrittr_1.5
## [4] tidyselect_1.1.0 munsell_0.5.0 colorspace_1.4-1
## [7] here_0.1 R6_2.4.1 rlang_0.4.8
## [10] dplyr_1.0.2 stringr_1.4.0 tools_4.0.2
## [13] grid_4.0.2 gtable_0.3.0 xfun_0.18
## [16] withr_2.3.0 htmltools_0.5.0.9001 ellipsis_0.3.1
## [19] yaml_2.2.1 rprojroot_1.3-2 digest_0.6.25
## [22] tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4
## [25] purrr_0.3.4 vctrs_0.3.4 glue_1.4.2
## [28] evaluate_0.14 rmarkdown_2.3 stringi_1.5.3
## [31] compiler_4.0.2 pillar_1.4.6 generics_0.0.2
## [34] scales_1.1.1 backports_1.1.10 pkgconfig_2.0.3
Pandoc is incredibly powerful, but it’s also incredibly opaque and difficult to learn. You can create incredibly fancy PDF and HTML documents in R Markdown without ever having to know anything about Pandoc. ↩
The default output width defined by the --columns
argument to Pandoc. ↩
You can also use --wrap=none
, which will put every paragraph in a single gigantic line of text. ↩
If you’re willing to install additional packages, Allison Horst’s palmerpenguins package is fantastic and fills much the same educational niche as iris
. See here for even more alternatives. ↩
Fun fact, loblolly pine seeds were carried aboard Apollo 14 and subsequently planted throughout the US. ↩
As a result, I finally had to teach myself how to break large spatial datasets into more manageable chunks. In the process a learned a little SQL and a lot about the underlying software libraries that power the r-spatial ecosystem of R packages. In this post, I walk through the workflow I developed for this task and explain the logic behind each step.
The general idea is to work with data ‘on disk’ instead of ‘in memory’. Normally, when you load a dataset into R, your computer reads it from whatever storage media it uses (hard drive or solid state drive) into memory (RAM). Memory is considerably faster to read from and write to than storage, which is what lets you complete simple operations in R in the blink of an eye. Most consumer computers have much more storage than RAM (my 2015 MacBook Pro has 256 GB of storage and 8 GB of memory) so it’s very possible to end up with a dataset larger than your computer’s memory. In fact, it doesn’t have to be anywhere near the size of your computer’s memory to bump into this limit because every other application you have running uses up memory as well.
To deal with this issue, you can extract just the parts of a dataset you need to work with at any given time; this subset will be loaded into memory, and the rest remain on disk and invisible to R1. There are a couple of R packages that exist for dealing with this issue, such as bigmemory for basic R data types like numerics or disk.frame for dplyr-compatible operations, but neither supports spatial data.
I’m going to use the cshapes to illustrate and explain this workflow2. You can download and extract it from within R:
## download cshapes dataset
download.file('http://downloads.weidmann.ws/cshapes/Shapefiles/cshapes_0.6.zip',
'cshapes.zip')
## extract cshapes dataset
unzip('cshapes.zip')
## check that dataset extracted correctly
list.files(path = '.', pattern = 'cshapes')
## [1] "cshapes_shapefile_documentation.txt" "cshapes.dbf"
## [3] "cshapes.prj" "cshapes.shp"
## [5] "cshapes.shx" "cshapes.zip"
Then use the sf package to load the data and check them out:
## load packages
library(tidyverse)
library(sf)
## read in cshapes
cshapes <- st_read('cshapes.shp')
The cshapes dataset is specifically designed to be easy to load and manipulate on a conventional laptop computer. To do this, it sacrifices a significant degree of detail in the polygons that represent each individual state. For many analyses, this is fine and won’t affect the results. However, sometimes you need to measure the length of borders between states, and the coastline paradox dictates that you use the most high resolution spatial data possible. In that case, the data might be too large for your computer to hold in memory. If that’s the case, then it’s time to start thinking about leaving the data on disk and only loading what you really need at any given point.
Luckily the sf package supports SQL queries to filter the data on disk and only read in a subset of the total data. SQL is a language for interacting with relational databases, and is incredibly fast compared to loading data into R and then filtering it. SQL has many variants, referred to as dialects, and the sf package uses one called OGR SQL dialect to interact with spatial datasets. The basic structure of a SQL call is SELECT col FROM "table" WHERE cond
.
SELECT
tells the database what columns (fields in SQL parlance) we wantFROM
tells the database what table (databases can have many tables) to select those columns fromWHERE
tells that database we only want rows where some condition is trueIf you use the tidyverse a lot, this may seem familiar to you because it’s pretty similar to dplyr syntax, except dplyr already knows which data frame you want to work with. If we want to only load one polygon at a time into R, then we need to know the field (or combination of fields) that uniquely identifies a polygon. To demonstrate, let’s load just the polygon for Morocco that begins in 1976 when it annexed the Northern part of Western Sahara. Let’s cheat by looking at the data I’ve loaded into R:
## filter to Morocco beginning in 1976
cshapes %>% filter(CNTRY_NAME == 'Morocco', GWSYEAR == 1976)
## Simple feature collection with 1 feature and 24 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -15.22687 ymin: 23.11465 xmax: -1.011809 ymax: 35.91916
## geographic CRS: WGS 84
## CNTRY_NAME AREA CAPNAME CAPLONG CAPLAT FEATUREID COWCODE COWSYEAR COWSMONTH COWSDAY COWEYEAR
## 1 Morocco 576351.8 Rabat -6.83 34.02 220 600 1976 4 1 1979
## COWEMONTH COWEDAY GWCODE GWSYEAR GWSMONTH GWSDAY GWEYEAR GWEMONTH GWEDAY ISONAME ISO1NUM ISO1AL2
## 1 8 4 600 1976 4 1 1979 8 4 Morocco 504 MA
## ISO1AL3 geometry
## 1 MAR MULTIPOLYGON (((-4.420418 3...
The cshapes dataset records when states change territorial boundaries or capital locations, so the combination of a state name or identifier and a start or end date uniquely identifies all rows in the data. Since, this polygon begins on April 1, 1976 and the Gleditsch and Ward code for Morocco is 600, plugging it all into the query
argument to st_read()
gets us:
## read in morocco polygon
morocco <- st_read('cshapes.shp',
query = 'SELECT * FROM "cshapes" WHERE GWCODE = 600 AND GWSYEAR = 1976 AND GWSMONTH = 4 AND GWSDAY = 1')
## verify country name
morocco$CNTRY_NAME
## [1] "Morocco"
Awesome! We were able to read in just one polygon from the cshapes dataset. Note that *
means all columns. As I mentioned above, this is cheating, since we had to read the whole dataset into R with a standard st_read()
call to learn the names and values of the variables we then filtered on.
When this isn’t an option, we can sneak a peak at the data by loading just the first observation into R. This requires significantly less memory than loading an entire dataset, and can give us the information we need to filter the full dataset and read in one observation at a time. Most SQL implementations don’t have row numbers, so it’s hard to just grab the first row of the data for this purpose. However, the OGR SQL dialect documentation notes that it implements a special field called FID
that is a feature ID, i.e., a row number. We can take advantage of FID
to select the first polygon from the data using the query
argument to st_read()
again:
## read in first row of the data
cshapes_row <- st_read('cshapes.shp', query = 'SELECT * FROM "cshapes" WHERE FID = 1')
## inspect
cshapes_row
## Simple feature collection with 1 feature and 24 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -58.0714 ymin: 1.836245 xmax: -53.98612 ymax: 6.001809
## geographic CRS: WGS 84
## CNTRY_NAME AREA CAPNAME CAPLONG CAPLAT FEATUREID COWCODE COWSYEAR COWSMONTH COWSDAY
## 1 Suriname 145952.3 Paramaribo -55.2 5.833333 1 115 1975 11 25
## COWEYEAR COWEMONTH COWEDAY GWCODE GWSYEAR GWSMONTH GWSDAY GWEYEAR GWEMONTH GWEDAY ISONAME
## 1 2016 6 30 115 1975 11 25 2016 6 30 Suriname
## ISO1NUM ISO1AL2 ISO1AL3 _ogr_geometry_
## 1 740 SR SUR POLYGON ((-55.12796 5.82217...
Even if we knew that the data had an ID column and start and end dates, we wouldn’t know the precise formatting (capitalization, underscores or dashes) of column names, or whether start and end dates are stored as one column or sets of three like they are here.
We still need more information if we want to iterate through the polygons in the data and load them one at a time. We know what columns uniquely identify the rows, but what don’t know all the values they take on. Without that, we we’re stuck. What (usually) makes spatial data big is not the tabular data themselves, but the spatial features they’re attached to. This is particularly the case with polygons, which can be incredibly large in size for complex features. So, the goal here is to get the data we care about (ID column and start date) and ditch everything else, loading only the bare minimum into memory.
To do this, we’ll use the ogr2ogr()
function in the gdalUtils package3. ogr2ogr()
converts between different spatial data formats. It also offers two features that we’re going to use to cut down the data to the bare minimum. The select
argument is a SQL selection, so we’re going to create a comma separated list of our key columns. The nlt
argument specifies what type of geometry to create in the output. Conveniently it accepts NONE
as a value, which will yield a plain table of data with none of the memory-hogging geometries:
## load package
library(gdalUtils)
## convert to nonspatial geometry
ogr2ogr(src_datasource_name = 'cshapes.shp', dst_datasource_name = 'cshapes_no_geom',
select = 'GWCODE,GWSYEAR,GWSMONTH,GWSDAY', nlt = 'NONE')
This will create a shapefile in the new directory cshapes_no_geom called cshapes
. The usual .shp
and .shx
components of a shapefile are missing, but the .dbf
part is there, and that’s the one we care about. Load it up with st_read()
and we’ll have what we need:
## load non-geometry table
cshapes_id <- st_read('cshapes_no_geom/cshapes.dbf')
## inspect
head(cshapes_id)
## GWCODE GWSYEAR GWSMONTH GWSDAY
## 1 110 1966 5 26
## 2 115 1975 11 25
## 3 52 1962 8 31
## 4 101 1946 1 1
## 5 990 1962 1 1
## 6 972 1970 6 4
Now you can load polygons one at a time and perform whatever geometric operations you need to. To illustrate, I’ll load the first four polygons in the dataset, calculate their area, and then plot them.
## set up four panel plot
par(mfrow = c(1, 4), mar = c(6.1, 4.1, 4.1, 4.1))
## read in each polygon and plot
for (i in 1:4) {
## build SQL query
query_str <- str_c('SELECT * FROM "cshapes" WHERE GWCODE = ', cshapes_id$GWCODE[i],
' AND GWSYEAR = ', cshapes_id$GWSYEAR[i],
' AND GWSMONTH = ', cshapes_id$GWSMONTH[i],
' AND GWSDAY = ', cshapes_id$GWSDAY[i])
## read in data
pol <- st_read('cshapes.shp', query = query_str)
## plot data
pol %>%
st_geometry() %>%
plot(main = pol$CNTRY_NAME,
sub = str_c(round(units::set_units(st_area(pol), 'km^2'), digits = 0),
' km^2'))
}
Sometimes (oftentimes in spatial analysis) we need not just a polygon, but also its neighbors. That means loading just one polygon is insufficient. If your data are already in R, this is easy with the st_filter()
function, but it’s much trickier if you’re trying to filter data before loading them into R4. Luckily, st_read()
as you covered! The wkt_filter
accepts a well-known text string that can be used to filter the data before loading them into R5. Well-known text is a standard string representation of geometry, and is actually how the sf package prints geometry in R:
st_point(c(1, 2))
We want to use the wkt_filter
argument to only load polygons that intersect with our Morocco polygon into R. To do that, we need to convert our polygon to a well-known text string with the st_as_text()
function, then pass it to st_read()
. However, st_as_text()
only accepts sfc
and sfg
objects, not sf
objects:
## create well known text object to filter cshapes on disk
morocco_wkt <- st_as_text(morocco)
## Error in UseMethod("st_as_text"): no applicable method for 'st_as_text' applied to an object of class "c('sf', 'data.frame')"
To get around this, we need to drop the data on morocco and extract just the geometry of the polygon with st_geometry()
:
## create well known text object to filter cshapes on disk
morocco_wkt <- morocco %>%
st_geometry() %>% # convert to sfc
st_as_text() # convert to well known text
## plot morocco and neighbors
st_read('cshapes.shp', wkt_filter = morocco_wkt) %>%
st_geometry() %>%
plot(main = morocco$CNTRY_NAME)
## add morocco polygon on top
morocco %>%
st_geometry() %>%
plot(add = T, col = rgb(0, 1, 0, .5))
Notice that there are multiple polygon boundaries within the green area of our green Morocco polygon. That’s because there are 4 Morocco polygons in the data starting in 1956, 1958, 1976, and 1979. Be sure to filter the dataset, either as part of the SQL query or in a dplyr::filter()
so that you only get polygons that existed contemporaneously with your polygon of interest.
So far, we’ve covered:
You can technically skip the first two steps and just move the .shp
and .shx
files out of the directory before loading the .dbf
file with st_read()
, but that kind of feels like cheating to me6 and it only works with shapefiles. If you have another type of spatial dataset, read on.
In my research, I often need to work with spatial data that’s measured at or aggregated up to different administrative divisions (ADMs). GADM helpfully provides a global dataset of ADMs. Although you can download ADMs for specific countries, I work with data in enough different countries that I finally decided to just download the entire dataset. While the cshapes example above just illustrated how to implement a pipeline for working with spatial data on disk, you may actually need to use one with these data depending on your machine’s hardware.
This master dataset comes as a GeoPackage. Most importantly for us, that means we can’t just delete a few component files to load the non-spatial table from the dataset; we have to convert it from a spatial dataset to a non-spatial one with ogr2ogr()
. The GeoPackage contains ADMs from level 0 (countries) all the way down to level 5. Each level is stored as a separate layer in the .gpkg
, and we can get a list of available layers with the st_layers()
function:
## get layers
st_layers('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg')
## Driver: GPKG
## Available layers:
## layer_name geometry_type features fields
## 1 level0 Multi Polygon 256 2
## 2 level1 Multi Polygon 3610 10
## 3 level2 Multi Polygon 45962 13
## 4 level3 Multi Polygon 147427 16
## 5 level4 Multi Polygon 138053 14
## 6 level5 Multi Polygon 51427 15
We want to work with the third-order administrative divisions (cities, towns, and other municipalities in the US context), so we need the level3
layer. Where we just used the name of the dataset in our SQL call before, this time we’ll use level3
. Now we just follow the same workflow as with the cshapes dataset above:
## get first observation
level3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
query = 'SELECT * FROM "level3" WHERE FID = 1', layer = 'level3')
## inspect
level3
## Simple feature collection with 1 feature and 16 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13.08792 ymin: -8.010127 xmax: 13.59943 ymax: -7.708598
## geographic CRS: WGS 84
## GID_0 NAME_0 GID_1 NAME_1 NL_NAME_1 GID_2 NAME_2 NL_NAME_2 GID_3 NAME_3 VARNAME_3
## 1 AGO Angola AGO.1_1 Bengo <NA> AGO.1.1_1 Ambriz <NA> AGO.1.1.1_1 Ambriz <NA>
## NL_NAME_3 TYPE_3 ENGTYPE_3 CC_3 HASC_3 geom
## 1 <NA> Commune Commune <NA> <NA> MULTIPOLYGON (((13.12764 -7...
This time we have a single column that uniquely identifies observations, GID_3
, so we only have to extract one column from the dataset. We use the ogr2ogr()
function as before, but we have to specify the layer = 'level3'
argument since the GeoPackage has more than one layer and we want to work with a specific one. Since GID_3
is our identifier column, that’s what we select from the dataset:
## convert to nonspatial geometry
ogr2ogr(src_datasource_name = '/Users/Rob/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
dst_datasource_name = 'gadm34_levels_no_geom',
layer = 'level3',
select = 'GID_3',
nlt = 'NONE')
## load non-geometry table
gadm_ids <- st_read('gadm34_levels_no_geom/level3.dbf')
## inspect
head(gadm_ids)
## GID_3
## 1 AGO.1.1.1_1
## 2 AGO.1.1.2_1
## 3 AGO.1.1.3_1
## 4 AGO.1.2.1_1
## 5 AGO.1.2.2_1
## 6 AGO.1.2.3_1
And we can again read the polygons into R one at a time and perform whatever spatial operations we need. Since our identifying column is a string this time, we need to enclose it quotes in our SQL call. SQL is very picky about quotation mark types, so while we needed to surround our layer name with double quotes, we need to surround our identifier variable with single quotes. I’m already using single quotes to define the character string for the SQL call, so I need to escape the single quotes around the identifier. You can do this with a single backslash (\
). Thus, you can include single quotes in a single-quoted string like this: 'this is a string \'this is another part of a string\''
. Other than that wrinkle, things are pretty much the same as with cshapes:
## for reproducibility
set.seed(27599)
## set up four panel plot
par(mfrow = c(1, 4), mar = c(2.1, 4.1, 4.1, 4.1))
## read in each polygon and plot
for (i in sample(1:nrow(gadm_ids), 4, replace = F)) { # mix it up
## build SQL query
query_str <- str_c('SELECT * FROM "level3" WHERE GID_3 = \'',
gadm_ids$GID_3[i], '\'')
## read in polygon for ADM3 i
adm3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
query = query_str, layer = 'level3')
## plot polygon and label with full name
print(plot(adm3$geom,
main = adm3 %>%
select(starts_with('NAME_')) %>% # get all name variables
st_drop_geometry() %>% # drop geometry
rev() %>% # reverse order of names to 3, 2, 1, 0
str_c(collapse = ', '), # collapse w/ commas
cex.main = .6))
}
Spatially filtering the GADM dataset is just as easy as with cshapes. To illustrate, I’m going to pull out a random polygon and use it to filter the data. However, these are third-order administrative divisions, and so it’s possible that even capturing all adjacent polygons won’t cover a very large area. To deal with this concern, we can buffer the polygon with the st_buffer()
function before we convert it to well-known text:
## import single polygon
adm3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
query = str_c('SELECT * FROM "level3" WHERE FID = 63130'))
## create well known text object to filter GADM on disk
adm3_wkt <- adm3 %>%
st_geometry() %>% # convert to sfc
st_buffer(.025) %>% # buffer .05 decimal degrees
st_as_text() # convert to well known text
## plot Dakkoun and neighbors w/in .05 decimal degrees
st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
layer = 'level3', wkt_filter = adm3_wkt) %>%
st_geometry() %>%
plot(main = adm3 %>%
select(starts_with('NAME_')) %>%
st_drop_geometry() %>%
rev() %>%
str_c(collapse = ', '))
## plot Dakkoun and highlight
adm3 %>%
st_geometry() %>%
plot(add = T, col = 'green')
## plot buffered polygon used to filter GADM on disk
adm3 %>%
st_geometry() %>%
st_buffer(.025) %>%
st_cast('LINESTRING') %>%
plot(add = T, col = 'blue')
The green polygon above is Dakkoun, the 63,130th polygon in the the dataset. The blue line is the extent of the .025 decimal degree buffer applied to it to before filtering the dataset. This workflow can speed things up when working with these data, considering there are 884,562 third-order administrative division polygons in the dataset.
The query
and wkt_filter
arguments to st_read()
can help you work with large spatial datasets that are either too big to load into memory, or too slow to work with once loaded. While this is less of a concern with low resolution datasets created by social scientists, it can be incredibly useful if you ever have to work with super high resolution data created by remote sensing technologies or actual cartographers and geographers.
This is the appraoch that the raster package uses. R only stores information on the extent and resolution of a raster in memory; the actual values in each cell of a raster are only loaded into memory when accessed by R using a function like extract()
. ↩
Although I’m using cshapes as an example throughout this post so you can easily follow along and run the code yourself, it’s a small enough dataset that no modern machine should have trouble loading it. I also use this approach for a much larger dataset where you’d actually benefit from this approach at the end of this post. ↩
This function is just a wrapper around the GDAL utility ogr2ogr
. You could also do this with ogr2ogr
directly in the shell, but it’s much uglier: ogr2ogr -f "ESRI SHAPEFILE" cshapes_no_geom.shp cshapes.shp cshapes -nlt NONE -select GWCODE,GWSYEAR,GWSMONTH,GWSDAY
. ↩
st_filter()
accepts various spatial predicates beyond the default of st_intersects()
. This filtering on disk gives much less fine-grained control. If you need more precision, you can load more nearby polygons by buffering the polygon before filtering the input like here and then using st_filter()
with your spatial predicate of choice. ↩
I spent over an hour trying to figure out how to tell the query
parameter to use PostGIS or SpatiaLite dialects instead of the OGR SQL dialect so I could execute a spatial filter before finding the wkt_filter
argument to st_read()
. Always read the documentation carefully. ↩
Having to move or delete files also risks losing them; the ogr2ogr()
approach is safer in this regard. ↩
While just plotting them on a map is easy, since it will be on a web page, I figured why not also embed links to each program in the map as well. In theory this is easy thanks to R packages like leaflet, which leverages the (unsurprisingly named) leaflet JavaScript library for interactive webmaps. However, because I use Jekyll instead of Hugo for my site, I can’t just use the blogdown R package and have everything magically work.
Steven Miller’s tutorial on integrating R Markdown and
Jekyll
is the starting point my own use of R Markdown and Jekyll, so check that
out first for a quick primer on how to use R Markdown to render .Rmd
files into the .md
files that Jekyll uses to render your website. This
approach works fantastically well for static images, and requires just a
little tweaking to make interactive widgets like leaflet maps work.
We’ll use three packages to create our map. The tidyverse is pretty
well-documented at this point, but I use it to write efficient and
readable code. tidygeocoder
is a geocoder that can use a variety of
geocoding services and works well with data frames and tibbles. Finally,
leaflet
is what we’ll use to create our actual map widget.
library(tidyverse)
library(tidygeocoder)
library(leaflet)
First, we need to load our data. This is a CSV file of program information that I’ve compiled myself.
## read in data
predoc <- read_csv('predoc.csv')
## inspect the data
predoc
## # A tibble: 9 x 4
## Institution Name Location URL
## <chr> <chr> <chr> <chr>
## 1 University of South… POIR Predoctoral Summer… Los Angeles,… https://dornsife.usc.edu/poir/predoct…
## 2 Duke University Ralph Bunche Summer Ins… Durham, NC, … https://www.apsanet.org/rbsi
## 3 UC San Diego START La Jolla, CA… https://grad.ucsd.edu/diversity/progr…
## 4 MIT MSRP Cambridge, M… https://oge.mit.edu/graddiversity/msr…
## 5 UC Irvine SURF Irvine, CA, … https://grad.uci.edu/about-us/diversi…
## 6 University of Washi… NSF REU: Spatial Models… Tacoma, WA, … https://www.tacoma.uw.edu/smed/nsf-re…
## 7 University of North… NSF REU: Civil Conflict… Denton, TX, … https://untconflictmgmtreu.wordpress.…
## 8 Princeton University Emerging Scholars in Po… Princeton, N… https://politics.princeton.edu/gradua…
## 9 Harvard University PS-Prep Cambridge, M… https://projects.iq.harvard.edu/ps-pr…
First, we need to get latitude and longitude coordinates from our place
names to plot them on a map. We’ll use the geocode()
function, where
the first argument is a data frame containing a column with the location
information we want to use. The second argument is address
, which
tells the geocoder to use the information stored in the Address column
of our data frame, and then method = 'osm'
dispatches it to the Open
Street Map geocoder,
Nominatim.
Next, we’ll use mutate()
to create a new variable to hold the popup
text a user will see when they click on a point. I want to provide the
university name, the program’s name, and then a link to the program’s
information page. I use the str_c()
function to combine the
Institution and Name columns, and then I use another call to str_c()
to format the URL. This second call looks like str_c('<a href="', URL,
'" target="_PARENT">Program Info</a>')
, where URL is the name of the
URL field. It combines the standard start of an HTML anchor tag (<a
href="
) with the URL itself, adds the link text of “Program Info”, and
then closes the tag. The one unusual element is target="_PARENT"
in
the anchor tag. This is necessary to make any links a user clicks open
normally, instead of within the frame used to embed it into the page
(more on that later).
Once we’ve prepped our popup text, we just pass the data frame to
leaflet()
, add a background map (I’ve used a styled map, but you can
also get the default map with addTiles()
), and then the markers
themselves. The one tricky part of addMarkers()
is that it expects its
arguments as one-sided formulas, not just variable names like tidyverse
functions. geocode()
has created lat and long columns, so pass those
through as well as our label column, and we’re good to go.
Putting all the above code together in a pipeline looks like this:
## prep and plot
predoc %>%
geocode(address = Location, method = 'osm') %>% ## gecode locations
mutate(lab = str_c(Institution, Name,
str_c('<a href="', URL, '" target="_PARENT">Program Info</a>'),
sep = '<br>')) %>% # paste fields into popup text
leaflet() %>% # create leaflet map widget
addProviderTiles(providers$CartoDB.Positron) %>% # add muted palette basemap
addMarkers(lng = ~ long, lat = ~ lat, popup = ~ lab) # add markers with popup text
Unfortunately this code produces an error that stops R Markdown dead in
its tracks; like, the-error = T
-knitr-chunk-option-won’t-even-save-you
dead in its tracks. What gives? R Markdown is supposed to be able to
render interactive widgets no problem. The issue is that R Markdown can
render those widgets for HTML output, but since we’re creating a GitHub
Flavored Markdown document that Jekyll
then turns into HTML, R Markdown chokes. It can’t embed an HTML widget
into a plain text markdown document. Luckily there is a way around this,
but it involves an extra step and dealing with some file paths.
To make things work, we have to manually save the HTML from our widget, and then embed it into our resulting markdown document. Then, when Jekyll renders the markdown to HTML, it will be visible in the final HTML files that comprise your website. This involves telling R where to save the HTML, then referencing it using raw HTML code in our markdown document. We’re going to do this with the htmlwidgets R package.
## load htmlwidgets to save map widget
library(htmlwidgets)
## prep and plot
predoc %>%
geocode(address = Location, method = 'osm') %>% ## gecode locations
mutate(lab = str_c(Institution, Name,
str_c('<a href="', URL, '" target="_PARENT">Program Info</a>'),
sep = '<br>')) %>% # paste fields into popup text
leaflet() %>% # create leaflet map widget
addProviderTiles(providers$CartoDB.Positron) %>% # add muted palette basemap
addMarkers(lng = ~ long, lat = ~ lat, popup = ~ lab) %>% # add markers with popup text
saveWidget(here::here('/files/html/posts', 'predoc_map.html')) # save map widget
The code is identical to that above, with the addition of the file line
that saves the map widget as an HTML file called predoc_map.html
in
/files/html/posts
using the saveWidget()
function. You’ll notice I
use the here()
function from the
here R package to supply the
file
argument to saveWidget()
. here
is great because it very
intelligently finds the top level of whatever project you’re working on
and then constructs file paths from there. It has a number of ways to
determine where a project ‘starts’, but for us it works because our
website is a git repo and contains a .git
directory.
All that’s left to do is embed the map widget in the page using an
iframe. iframes allow
you to embed an HTML page inside of another HTML page. Since
saveWidget()
saved our map widget as an HTML file that’s nothing but
our map, we can then embed it into our page using an iframe. Jekyll
allows raw HTML in markdown files which it ignores and passes through
untouched into the final HTML files it produces. Here’s the code I used
for the map in this post.
<iframe src="/files/html/posts/predoc_map.html" height="600px" width="100%" style="border:none;"></iframe>
The main argument is src="..."
, which tells the iframe what content it
will contain. Notice that this is the same file path I just specified
above in saveWidget()
. As long as that directory exists in your
website repo, everything will work smoothly. There are three important
arguments in addition to the content of the iframe itself:
height
is how tall you want the iframe to be; here I’ve specified
it in pixels, but you can also use inches, centimeters, or
percentages as you’ll see belowwidth
is how wide you want the iframe to be; I’ve used a
percentage here because the
AcademicPages template is
responsive and will resize itself on smaller screensstyle
is where I tell the iframe not to include a border so it
blends seamlessly with the rest of the pageHere’s what the final map looks like. If you didn’t know the extra
effort it took, it would blend seamlessly into the page. Theoretically
this should work for any HTML widget, like those produced by the
plotly
R package. If you haven’t checked plotly
out, you really
should. It can turn ggplot2
plots into interactive widgets with a
single line of code!
These missions are distinguished from other current UN peacekeeping missions by high levels of violence (both overall and against UN personnel) and expansive mandates that go beyond ‘traditional’ goals of stabilizing post-conflict peace. The conflict management aims of these operations necessarily expose peacekeepers to high levels of risk. If we want to try understand what the future of MINUSMA might look like dealing with a new government in Mali, it’s important to place MINUSMA in context among the remainder of the big 5 missions. To help do so, I turned to the source for data on peacekeeping missions, the UN.
When we wrote the piece, the Peacekeeping open data portal page on fatalities only had a link to this PDF report instead of the usual CSV file (the CSV file is back, so you don’t technically have to go through all of these steps to recreate this figure). Here’s what the first page of that PDF looks like:
Since we were working on a short deadline, I needed to get these data out of that PDF. The most direct option is to just copy and paste the data into an Excel sheet. However, these data run to 148 pages, so all that copying and pasting would be tiring and risks introducing errors when your attention eventually slips and you forget to include page 127.
Enter the tabulizer
R package. This package is just a (much)
friendlier wrapper to the Tabula Java
library, which is designed to extract
tables from PDF documents. To do so, just plug in the file name of the
local PDF you want or URL for a remote one:
library(tabulizer)
## data PDF URL
dat <- 'https://peacekeeping.un.org/sites/default/files/fatalities_june_2020.pdf'
## get tables from PDF
pko_fatalities <- extract_tables(dat, method = 'stream')
The extract_tables()
function has two different methods for extracting
data: lattice
for more structured, spreadsheet like PDFs and stream
for messier files. While the PDF looks pretty structured to me, method
= 'lattice'
returned a series of one variable per line gibberish, so I
specify method = 'stream'
to speed up the process by not forcing
tabulizer
to determine which algorithm to use on each page.
Note that you may end up getting several warnings, such as the ones I received:
## WARNING: An illegal reflective access operation has occurred
## WARNING: Illegal reflective access by RJavaTools to method java.util.ArrayList$Itr.hasNext()
## WARNING: Please consider reporting this to the maintainers of RJavaTools
## WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
## WARNING: All illegal access operations will be denied in a future release
Everything still worked out fine for me, but you may run into problems in the future based on the warning about future releases.
We end up with a list that is 148 elements long, one per page. Each
element is a matrix, reflecting the structured nature of the data.
Normally, we could just combine this list of matrices into a single
object with do.call(rbind, pko_fatalities)
:
do.call(rbind, pko_fatalities)
## Error in (function (..., deparse.level = 1) : number of columns of matrices must match (see arg 2)
But if we do this, we get an error! Let’s take a look and see what’s
going wrong. We can use lapply()
in combination with dim()
to do so:
head(lapply(pko_fatalities, dim))
## [[1]]
## [1] 54 9
##
## [[2]]
## [1] 54 7
##
## [[3]]
## [1] 54 7
##
## [[4]]
## [1] 54 7
##
## [[5]]
## [1] 54 7
##
## [[6]]
## [1] 54 7
The first matrix has an extra two columns, causing our attempt to
rbind()
them all together to fail.
head(pko_fatalities[[1]])
## [,1] [,2] [,3] [,4]
## [1,] "Casualty_ID" "Incident_Date Mission_Acronym" "" "Type_of_Casualty"
## [2,] "BINUH‐2019‐12‐00001" "30/11/2019 BINUH" "" "Fatality"
## [3,] "BONUCA‐2004‐06‐04251" "01/06/2004 BONUCA" "" "Fatality"
## [4,] "IPTF‐1997‐01‐02515" "31/01/1997 IPTF" "" "Fatality"
## [5,] "IPTF‐1997‐09‐02720" "17/09/1997 IPTF" "" "Fatality"
## [6,] "IPTF‐1997‐09‐02721" "17/09/1997 IPTF" "" "Fatality"
## [,5] [,6] [,7] [,8]
## [1,] "Casualty_Nationality" "M49_Code ISOCode3" "" "Casualty_Personnel_Type"
## [2,] "Haiti" "332 HTI" "" "Other"
## [3,] "Benin" "204 BEN" "" "Military"
## [4,] "Germany" "276 DEU" "" "Police"
## [5,] "United States of America" "840 USA" "" "Police"
## [6,] "United States of America" "840 USA" "" "Police"
## [,9]
## [1,] "Type_Of_Incident"
## [2,] "Malicious Act"
## [3,] "Illness"
## [4,] "Accident"
## [5,] "Accident"
## [6,] "Accident"
head(pko_fatalities[[2]])
## [,1] [,2] [,3] [,4] [,5]
## [1,] "MINUSCA‐2015‐10‐09459" "06/10/2015 MINUSCA" "Fatality" "Burundi" "108 BDI"
## [2,] "MINUSCA‐2015‐10‐09468" "13/10/2015 MINUSCA" "Fatality" "Burundi" "108 BDI"
## [3,] "MINUSCA‐2015‐11‐09509" "10/11/2015 MINUSCA" "Fatality" "Cameroon" "120 CMR"
## [4,] "MINUSCA‐2015‐11‐09510" "22/11/2015 MINUSCA" "Fatality" "Rwanda" "646 RWA"
## [5,] "MINUSCA‐2015‐11‐09511" "30/11/2015 MINUSCA" "Fatality" "Cameroon" "120 CMR"
## [6,] "MINUSCA‐2015‐12‐09542" "06/12/2015 MINUSCA" "Fatality" "Congo" "178 COG"
## [,6] [,7]
## [1,] "Military" "Malicious Act"
## [2,] "Military" "Accident"
## [3,] "Military" "Malicious Act"
## [4,] "Military" "To Be Determined"
## [5,] "International Civilian" "Illness"
## [6,] "Military" "Illness"
We can see that the first page has two blank columns, accounting for the 9 columns compared to the 7 columns for all other pages. Closer inspection of the header on the first page and the columns on both the first and second pages reveals that there actually should be 9 columns in the data.
The Incident_Date
and Mission_Acronym
columns are combined into one,
as are the M49_Code
and ISOCode3
columns. We’ll fix the data in
those two columns in a bit, but first we have to get rid of the empty
columns in the first page before we can merge the data from all the
pages. We could just tell R to drop those columns manually with
pko_fatalities[[1]][, -c(3, 7)]
, but this isn’t a very scalable
solution if we have lots of columns with this issue.
To do this programmatically, we need a way to identify empty columns. If
this was a list of data frames, we could use colnames()
to identify
the empty columns. However, extract_tables()
has given us a matrix
with the column names in the first row. Instead, we’ll just get the
first row of the matrix. Since we’re accessing a matrix that is the
first element in a list, we want to use pko_fatalities[[1]][1,]
to
index pko_fatalities
. Next, we’ll use the grepl()
function to
identify the empty columns. We want to search for the regular expression
^$
, which means the start of a line immediately followed by the end of
a line, i.e., an empty string. Finally, we negate it with a !
to
return only non-empty column names:
## drop two false empty columns on first page
pko_fatalities[[1]] <- pko_fatalities[[1]][, !grepl('^$', pko_fatalities[[1]][1,])]
With that out of the way, we can now combine all the pages into one giant matrix. After that, I convert the matrix into a data frame, set the first row as the column names, and then drop the first row.
## rbind pages
pko_fatalities <- do.call(rbind, pko_fatalities)
## set first row as column names and drop
pko_fatalities <- data.frame(pko_fatalities)
colnames(pko_fatalities) <- (pko_fatalities[1, ])
pko_fatalities <- pko_fatalities[-1, ]
Now that we’re working with a data frame, we can finally tackle those
two sets of mashed up columns. To do this, we’ll use the separate()
function in the dplyr
package, which I load via the tidyverse
package. Separate is magically straightforward. It takes a column name
(which I have to enclose in backticks thanks to the space), a character
vector of names for the resulting columns, and a regular expression to
split on. I use \\s
, which matches any whitespace characters. I also
filter out any duplicate header rows that may have crept in (there’s one
on page 74, at the very least).
library(tidyverse)
## separate columns tabulizer incorrectly merged
pko_fatalities <- pko_fatalities %>%
filter(Casualty_ID != 'Casualty_ID') %>% # drop any repeated header(s)
separate(`Incident_Date Mission_Acronym`, c('Incident_Date', 'Mission_Acronym'),
sep = '\\s', convert = T, extra = 'merge') %>%
separate(`M49_Code ISOCode3`, c('M49_Code', 'ISOCode3'),
sep = '\\s', convert = T) %>%
mutate(Incident_Date = dmy(Incident_Date)) # convert date to date object
You’ll notice I also supply two other arguments here: convert
and
extra
. The former will automatically convert the data type of
resulting columns, which is useful because it converts Incident_Date
into a Date
object, and M49_Code into an int
object. The latter
tells separate()
what to do if it detects more matches of the
splitting expression than you’ve supplied column names. There are 18
observations where the mission acronym is list as “UN Secretariat”. That
means that separate()
will detect a second whitespace character in
these 18 rows. If you don’t explicitly set extra
, you’ll get a warning
telling you what happened with those extra characters. By setting extra
= 'merge'
, you’re telling separate()
to effectively ignore any space
after the first one and keep everything to the right of the first space
as part of the output. Thus, our "UN Secretariat"
observations are
preserved instead of being chopped off to just "UN"
.
Now that we’ve got the data imported and cleaned up, we can recreate the plot from the Monkey Cage piece. However, first we need to bring in some outside information and calculate some simple statistics.
Before we can plot the data, we need to bring in some mission-level
information, namely what country each mission operates in. We can get
this easily from the Peacekeeping open data portal master
dataset.
Once I load the data into R I select just the mission acronym and
country of operation. I then edit the strings for CAR and DRC to add
newlines between words with \n
to make them fit better into the plot.
## get active PKO data and clean up country names
read_csv('https://data.humdata.org/dataset/819dce10-ac8a-4960-8756-856a9f72d820/resource/7f738eb4-6f77-4b5c-905a-ed6d45cc5515/download/coredata_activepkomissions.csv') %>%
select(Mission_Acronym, Country = ACLED_Country) %>%
mutate(Country = case_when(Country == 'Central African Republic' ~
'Central\nAfrican\nRepublic',
Country == 'Democratic Republic of Congo' ~
'Democratic\nRepublic\nof the Congo',
TRUE ~ Country)) -> pko_data
We’re looking to see how dangerous peacekeeping missions are for peacekeepers, so we want to only look at fatalities that are the result of deliberate acts. The data contain 6 different types of incident, so let’s check them out:
table(pko_fatalities$Type_Of_Incident)
##
## Accident Illness Malicious Act Self‐Inflicted To Be Determined
## 2712 2582 2096 268 244
## Unknown
## 50
Malicious acts are the third highest type of incident, so it’s important
for us to subset the data to ensure we’re counting the types of attacks
we’re interested in. Since we’re looking at fatalities in the big 5
missions, we also need to subset the data to just these missions. We’re
going to use the summarize()
function in conjunction with group_by()
to calculate several summary statistics for each mission. We’ll also use
the time_length()
and interval()
functions from the lubridate
package, so load that as well.
library(lubridate)
## list of PKOs to include
pkos <- c('MINUSMA', 'UNAMID', 'MINUSCA', 'MONUSCO', 'UNMISS')
## aggregate mission level data
pko_fatalities %>%
filter(Type_Of_Incident == 'Malicious Act',
Mission_Acronym %in% pkos) %>%
group_by(Mission_Acronym) %>%
summarize(casualties = n(),
casualties_mil = sum(Casualty_Personnel_Type == 'Military'),
casualties_pol = sum(Casualty_Personnel_Type == 'Police'),
casualties_obs = sum(Casualty_Personnel_Type == 'Military Observer'),
casualties_civ = sum(Casualty_Personnel_Type == 'International Civilian'),
casualties_oth = sum(Casualty_Personnel_Type == 'Other'),
casualties_loc = sum(Casualty_Personnel_Type == 'Local'),
duration = time_length(interval(min(Incident_Date),
max(Incident_Date)),
unit = 'year')) %>%
mutate(MINUSMA = case_when(Mission_Acronym == 'MINUSMA' ~ 'MINUSMA',
TRUE ~ '')) %>%
left_join(pko_data, by = 'Mission_Acronym') %>%
mutate(Country = factor(Country,
levels = Country[order(casualties,
decreasing = T)])) -> data_agg
casualties = n()
counts the total number of fatalities in each
mission because each row is one fatalitycasualties_mil = sum(Casualty_Personnel_Type == 'Military')
counts
how many of those casualties were UN troopscasualties_...
lines do the same for different
categories of UN personnelduration
calculates how long each mission
has lasted by:
interval
object from those datesFinally, we merge on the country information contained in pko_data
and
convert Country
to a factor with levels that are decreasing in
fatalities. This last step is necessary to have a nice ordered plot.
With that taken care of, we can create the plot using ggplot
. I’m
using the label
argument to place mission acronyms inside the bars
with geom_text()
, and a second call to geom_text()
with the
casualties
variable to place fatality numbers above the bars. The
nudge_y
argument in each call to geom_text()
ensures that they’re
vertically spaced out, making them readable instead of overlapping.
ggplot(data_agg, aes(x = Country, y = casualties, label = Mission_Acronym)) +
geom_bar(stat = 'identity', fill = '#5b92e5') +
geom_text(color = 'white', nudge_y = -10) +
geom_text(aes(x = Country, y = casualties, label = casualties),
data = data_agg, inherit.aes = F,
nudge_y = 10) +
labs(x = '', y = 'UN Fatalities',
title = 'UN fatalities in big 5 peacekeeping operations') +
theme_bw()
We can also create some other plots to visualize how dangerous each mission is to peacekeeping personnel. While total fatalities are an important piece of information, the rate of fatalities can tell use more about the intensity of the danger in a given conflict.
data_agg %>%
ggplot(aes(x = duration, y = casualties, label = MINUSMA)) +
geom_point(size = 2.5, color = '#5b92e5') +
geom_text(nudge_x = 1) +
expand_limits(x = 0, y = 0) +
labs(x = 'Mission duration (years)', y = 'Fatalities (total)',
title = 'UN fatalities in big 5 peacekeeping operations') +
theme_bw()
We can see from this plot that not only does MINUSMA have the most
peacekeeper fatalities out of any mission, it reached that point in a
comparatively short amount of time. To really drive this point home, we
can draw on the fantastic gganimate
package. We’re going to animate
cumulative fatality totals over time, so we need a yearly version of our
mission-level data frame from above. The code below is pretty similar
except we’re grouping by both Mission_Acronym
and a variable called
Year
what we’re generating with the year()
function in lubridate
(it extracts the year from a Date
object).
pko_fatalities %>%
filter(Type_Of_Incident == 'Malicious Act',
Mission_Acronym %in% pkos) %>%
group_by(Mission_Acronym, Year = year(Incident_Date)) %>%
summarize(casualties = n(),
casualties_mil = sum(Casualty_Personnel_Type == 'Military'),
casualties_pol = sum(Casualty_Personnel_Type == 'Police'),
casualties_obs = sum(Casualty_Personnel_Type == 'Military Observer'),
casualties_civ = sum(Casualty_Personnel_Type == 'International Civilian'),
casualties_oth = sum(Casualty_Personnel_Type == 'Other'),
casualties_loc = sum(Casualty_Personnel_Type == 'Local')) %>%
mutate(MINUSMA = case_when(Mission_Acronym == 'MINUSMA' ~ 'MINUSMA',
TRUE ~ ''),
Mission_Year = Year - min(Year) + 1) %>%
left_join(pko_data, by = 'Mission_Acronym') %>%
mutate(Country = factor(Country, levels = levels(data_agg$Country))) -> data_yr
Once we’ve done that, we need to make a couple tweaks to our data to
ensure that our plot animates correctly. I use the new across()
function (which is likely going to eventually replace mutate_at
,
mutate_if
, and similar functions) to select all columns that start
with “casualties”. Then, I supply the cumsum()
function to the .fns
argument, and use the .names
argument to append “_cml” to the end of
each resulting variable’s name. This argument uses glue
syntax, which allows you to embed R
code in strings by enclosing it in curly braces. The complete()
function uses the full_seq()
function to fill in any missing years in
each mission, i.e., a year in the middle of a mission without any
fatalities due to malicious acts. Finally, the fill()
function fills
in any rows we just added that are missing fatality data due to an
absence of fatalities that year.
Now we’re ready to animate our plot! We construct the ggplot
object
like before, but this time we add the transition_manual()
function to
the end of the plot specification. This function tells gganimate
what
the ‘steps’ in our animation are. Since we’ve got individual years,
we’re using the manual
version of transition_
instead of the many
fancier versions included in the package.
If you check out the documentation for transition_manual()
, you’ll
notice that there are a handful of special label variables you can use
when constructing your plot. These will update as the plot cycles
through its frames, allowing you to convey information about the flow of
time. I’ve used the current_frame
variable, again with glue syntax, to
make the title of the plot display the current mission year as the
frames advance.
library(gganimate)
data_yr %>%
arrange(Mission_Year) %>%
mutate(across(starts_with('casualties'), .fns = cumsum, .names = '{col}_cml')) %>%
complete(Mission_Year = full_seq(Mission_Year, 1)) %>%
fill(Year:casualties_loc_cml, .direction = 'down') %>%
filter(Mission_Year <= 6) %>% # youngest mission is UNMISS
ggplot(aes(x = Country, y = casualties_cml, label = casualties_cml)) +
geom_bar(stat = 'identity', fill = '#5b92e5') +
geom_text(nudge_y = 10) +
labs(x = '', y = 'UN Fatalities',
title = 'UN fatalities in big 5 peacekeeping operations: mission year {current_frame}') +
theme_bw() +
transition_manual(Mission_Year)
While the scatter plot above illustrates that UN personnel working for MINUSMA have suffered the most violence in the shortest time out of any big 5 mission, this animation make it abundantly clear, especially since MONUSCO and UNMISS both experience years without a single UN fatality from a deliberate attack. Visualizations like these are a great way to showcase your work, especially if you’re dealing with dynamic data. While you still can’t easily include them in a journal article, they’re fantastic tools for conference presentations or
]]>Editing the welcome page for your site (_pages/about.md
) is relatively straightforward. Things get a little trickier if you want to build an entirely new page to your website. You’ll notice that I have a software page on my site that isn’t part of the academicpages template. I’ll use that page as a running example to walk you through adding a new page to your site.
First things first, we need to create a file for the page itself. The main pages for your website are generated from Markdown files contained in the _pages
directory. Create a new file called software.md
in _pages
. Now, open it up in RStudio or your text editor of choice. If you’ve looked at the .md
files for other pages, you’ll notice that they all start with a similar block of text. This is a YAML header that tells Jekyll the basic information needed to build the page. There are lots of different options you can include, but the only two you really need are the permalink
for the page and its title
. Add the following to the top of software.md
:
---
permalink: /software/
title: "Software"
---
Anything after that second line of dashes will be translated into actual content on the page.
Now we need to make our new page actually say something. My software page lists the R packages I’ve contributed to and includes links to miscellaneous other bits of code like functions for working with video data in Python or the LaTeX template I used for my dissertation. You can check out the .md
source file for my software page on my GitHub.
A couple of things to notice:
[link text](url)
.md
file in _pages
, just put a slash before the page name and don’t include any extension, e.g., [software](/software)
[
in Markdown link syntax, e.g., ![](/images/profile.png)
These tools should be sufficient to let you build an awesome new page for your website. However, letting visitors actually get to your new page requires a little more work.
If you want to just add a link to your new page from an existing page, like the homepage, that’s easy and can be accomplished by adding a link to the Markdown source in _pages/about.md
. That’s how I added my teaching materials page; it’s just a link on my teaching page. But what about if you want your new page to be easily accessed from the fancy navigation bar at the top of the site?
To do that, we’ll need to edit the files Jekyll uses to control navigation on the site. Open up _data/navigation.yml
and get ready to add our new page to the to menu. This is what it looks like in the template:
# main links links
main:
- title: "Publications"
url: /publications/
- title: "Talks"
url: /talks/
- title: "Teaching"
url: /teaching/
- title: "Portfolio"
url: /portfolio/
- title: "Blog Posts"
url: /year-archive/
- title: "CV"
url: /cv/
- title: "Guide"
url: /markdown/
The order that items appear in top-to-bottom in this file is also the order they’ll appear in left-to-right in the navigation bar. So decide where you want your new page to go, and slot it in. This is what _data/navigation.yml
looks like for my website:
# main links links
main:
- title: "Publications"
url: /publications/
- title: "Research"
url: /research/
- title: "Teaching"
url: /teaching/
- title: "Software"
url: /software/
- title: "Posts"
url: /posts/
- title: "CV"
url: /cv/
Again, you can check out the current version of this file for my site at my GitHub if you want. If I’ve changed anything in the navigation menu since I wrote this post, those changes will be reflected there.
You’ll also notice that the Guide link is no longer there in my _data/navigation.yml
. Removing elements from this file drops them from the navigation menu, so if there are any other pages in the template you don’t plan to use, go ahead and remove them now.
Once you’re happy with your new page, it’s time to tell git about them, and then upload them to GitHub. You can do this with
git add _pages/software.md _data/navigation.yml
git commit -m "add software page"
git push
If you followed the guide on uploading changes to GitHub in my post on making an academic website, all of the above code should run smoothly and in a few minutes you’ll have a new page on your website.
One of the advantages of using GitHub pages to host your website is that you don’t have to use Dropbox to host PDFs of your working papers and published articles, not to mention your CV. If you use Wix or WordPress, you may have to upload your files to Dropbox, and then link to them on your site. This process has three major downsides:
All of these are less than ideal. Luckily, GitHub Pages has the capability to address all three already built in. When you make an update to your website and git push
it to GitHub, all tracked files get uploaded with it. This means it’s super easy to upload your PDFs to your site and link directly to them. I’ll walk through how to do this with an example PDF called working-paper.pdf
.
First, copy the PDF into the files/pdf
directory in your site’s directory. Next we need to tell git about this file, which we do with
git add files/pdf/working-paper.pdf
git commit -m "add working paper"
git push
Don’t forget to add a link to the paper somewhere on your research page so that visitors can access it. Here’s an example of what that link might look like: [Working Paper](/files/pdf/working-paper.pdf)
. And if you want to use the fancy button from my post on customizing your site, you would do this: [Working Paper](/files/pdf/working-paper.pdf){: .btn--research}
.
One of the advantages of the academicpages template is that it is responsive, meaning that layouts change automatically with screen size to present content in the most efficient manner. Take a look at my website on your phone to see how a smaller device changes the site’s layout. When you’re editing your website, it’s a good idea to periodically check how it appears on a phone, as it’s likely that a number of visitors to your site will view it on their phones.1
To do so, you can use tools like Chrome’s device mode, but this can be annoying and doesn’t perfectly capture the experience of navigating your site on a small touchscreen. The best way to do that is, unsurprisingly, to use your actual phone. However, this requires a slight tweak to our usual bundle exec jekyll serve
command. We need to add a --host
argument to the command, where the value of the argument is our computer’s IP address. There are many ways to look this up, but here are two quick ones you can execute from the terminal:
ifconfig en0 | grep inet | grep -v inet6 | awk '{print $2}'
hostname -I
What each of these will do is capture the local IP address of the computer. Often this will be something like 192.168.1.x
or 10.0.0.x
. This won’t let you access the site from outside your network over the internet, but it will let you access it locally on your own network. Once you’ve found your local IP address, you can serve your site on your local network, letting you view it on your phone or tablet. For example, my IP address is 192.168.1.6
, so putting it all together I get:
bundle exec jekyll serve --host 192.168.1.6
This is quite a lot to type, and your computer’s local IP address can change occasionally, so you can’t just keep putting in the same IP address each time. To save yourself some time by creating an alias for the command. An alias is simply a way to refer to a longer command with a shorter label. To do this, you’ll need to edit your .bash_profile
configuration file.2 The easiest way to do this is to run
nano ~/.bash_profile
This will open up your .bash_profile
in nano, a simple text editor.3 I’ve decided to call my aliased command serve-site
, but you could call it anything you want. Scroll down to the end of the file and add either
alias serve-site="bundle exec jekyll serve --host=$(ifconfig en0 | grep inet | grep -v inet6 | awk '{print $2}')"
for MacOS or
alias serve-site="bundle exec jekyll serve --host=$(hostname -I)"
for Linux. Once you’ve added this line, save the file by pressing ctrl+o and then enter to use the existing filename, overwriting the old version of .bash_profile
. Then press ctrl+x to close nano. The last step is to tell your terminal about this new alias. You can accomplish this with
source ~/.bash_profile
regardless of whether you’re on Linux or MacOS. Now whenever you want to check out your website on a mobile device, you can just navigate to your website’s directory and use the new serve-site
alias to launch it locally.
If you’re trying to figure out how to do this on Windows, I haven’t forgotten about you, I just have no idea how to do this on Windows ¯\_(ツ)_/¯. My recommendation would be to do a lot of googling, or to install the Windows Subsystem for Linux, which will allow you to use a bash shell to interact with your files.
As a senior faculty member once pointed out to me, the search committee member who didn’t fully read your application is most likely to pull up your website on their phone during a committee meeting. ↩
I’m assuming that you’re using bash as your shell. If you’re using a different shell, see this list for which configuration files you should be editing. Other shells may also define aliases in different ways. ↩
Feel free to use a different editor or use the edit
command if you’ve set the default editor to your preferred editor. ↩