The GESIS Data Catalogue offers a repository of approximately 5,000 datasets. Due to a lack of an API, however, accessing these datasets in a programmatic and reproducible way is difficult. The gesis
package seeks to solve this issue through the use of Selenium and the RSelenium
package.
In essence, the gesis
package allows the user to emulate a web browser session, wherein he or she logs in to the GESIS website, browses to the data set of interest, clicks to download that dataset, agrees to accept the terms of use, and, ultimately, downloads the dataset. This whole process can be done through R
with these three steps:
setup_gesis
)gesis_login
)download_dataset
)The next section describes this workflow by working through a simple example. An second example then shows how the package can be leveraged for more advanced uses.
The first step to using the gesis
package is always to set up a Selenium remote driver by running setup_gesis
. This function takes care of all the preliminaries for using Selenium, including checking for the existence of a Selenium server, starting such a server, creating a remote driver, and opening a web browser window. In addition, it specifies browser settings such that files can be downloaded without prompting the user. (NB: The setup_gesis
function currently only supports Firefox. See the help file for how to use other browsers.)
if(!dir.exists("downloads")) dir.create("downloads")
gesis_remDr <- setup_gesis(download_dir = "downloads")
An empty browser window should now pop up. Leave this window open; this is where we will emulate a session to access the GESIS website.
Next we go to the GESIS main page and log in by providing our user name and password. (To avoid having to provide the user name and password in plain text in a script, the default behavior is to fetch these as options using getOption("gesis_user")
and getOption("gesis_pass")
. You can thus specify these in your .Rprofile
by option("gesis_user" = "myusername", "gesis_pass" = "mypassword")
.)
login_gesis(gesis_remDr, user = "myusername", pass = "mypassword")
Switching to the browser window opened earlier, we should now see that we are logged in. Now all we have to do is figure out the unique identifier for the data set we are interested in. This is called a “DOI” and can be found on every data set’s description page.
download_dataset(gesis_remDr, doi = 5928, filetype = "dta", purpose = 1)
The above function will:
.dta
(Stata) version of this dataset,Finally, we can now check that the downloaded file is in the folder we specified, and then close the browser window and the Selenium server.
dir("downloads")
gesis_remDr$Close()
gesis_remDr$closeServer()
To simplify further analysis, the package also provides a convenience function for browsing the codebook of a specified dataset. This function does not require an active Selenium session, but does require that the xml2
package be installed:
browse_codebook(doi = 5928)
The workflow described above is clearly more laborious than just downloading data sets by hand if you are only downloading a handful of data sets. However, many opinion surveys take the form of repeated cross-sections, meaning that each time a survey is conducted it is distributed as a separate file. If one is interested in analyzing these surveys over time, one therefore needs to download a separate data set for each point in time.
An example of such a repeated cross-section is a study called “Atlantic Trends”, for which there are annual surveys between 2002 and 2013. We can easily scrape the DOI for these data sets.
library(xml2)
# Browsing the gesis website, we find the url for the main page for these studies
url <- "https://dbk.gesis.org/dbksearch/GDesc2.asp?no=0074&ll=10&db=d¬abs=1"
page <- read_html(url)
doi_links <- xml_find_all(page, "//a[contains(text(), 'ZA')]")
doi <- substr(xml_text(doi_links), 3, 7)
str(doi)
## chr [1:16] "4218" "4219" "4220" "4262" "4518" "4746" ...
Using the gesis
package just like before, we can now batch download all these surveys:
# Setup preliminaries
if(!dir.exists("downloads")) dir.create("downloads")
gesis_remDr <- setup_gesis(download_dir = "downloads")
# Log in
login_gesis(gesis_remDr, user = "myusername", pass = "mypassword")
# Loop over DOIs to download
lapply(doi, download_dataset, remDr = gesis_remDr)
Disclaimer: the gesis
package is neither affiliated with, nor endorsed by, the Leibniz Institute for the Social Sciences. I have been unable to find any indication that programmatic access to the website is disallowed under its terms of use (indeed, its guideslines appear to encourage it). That said, I would discourage users from using the gesis
package to put undue pressure on their servers by initiating unnecessary (or unnecessarily large) batch downloads.