What do Wikipedia’s readers care about? Is Britney Spears more popular than Brittany? Is Asia Carrera more popular than Asia? How many people looked at the article on Santa Claus in December? How many looked at the article on Ron Paul?
What can you find?
Source: http://stats.grok.se/

The wikipediatrend package provides convenience access to daily page view counts (Wikipedia article traffic statistics) stored at http://stats.grok.se/ .

If you want to know how often an article has been viewed over time and work with the data from within R, this package is for you. Maybe you want to compare how much attention articles from different languages got and when, this package is for you. Are you up to policy studies or epidemiology? Have a look at page counts for Flue, Ebola, Climate Change or Millennium Development Goals and maybe build a model or two. Again, this package is for you.

If you simply want to browse Wikipedia page view statistics without all that coding, visit http://stats.grok.se/ and have a look around.

If non-big data is not an option, get the raw data in their entity at http://dumps.wikimedia.org/other/pagecounts-raw/ .

If you think days are crude measures of time but seconds might do if need be and info about which article views led to the numbers is useless anyways - go to http://datahub.io/dataset/english-wikipedia-pageviews-by-second.

To get further information on the data source (Who? When? How? How good?) there is a Wikipedia article for that: http://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics and another one: http://en.wikipedia.org/wiki/Wikipedia:About_page_view_statistics .

1 Installation

stable CRAN version

install.packages("wikipediatrend")

developemnt version

devtools::install_github("petermeissner/wikipediatrend")

… and load it via:

library(wikipediatrend)

2 A first try

The workhorse of the package is the wp_trend() function that allows you to get page view counts as neat data frames like this:

page_views <- wp_trend("main_page")
    
page_views
##    date       count    lang page      rank month  title    
## 3  2015-05-01 15195088 en   Main_page 2    201505 Main_page
## 2  2015-05-02 13800408 en   Main_page 2    201505 Main_page
## 1  2015-05-03 19462469 en   Main_page 2    201505 Main_page
## 7  2015-05-04 21295053 en   Main_page 2    201505 Main_page
## 6  2015-05-05 21338940 en   Main_page 2    201505 Main_page
## 5  2015-05-06 21198056 en   Main_page 2    201505 Main_page
## 4  2015-05-07 20128000 en   Main_page 2    201505 Main_page
## 11 2015-05-08 17191834 en   Main_page 2    201505 Main_page
## 10 2015-05-09 19560505 en   Main_page 2    201505 Main_page
## 26 2015-05-10 21168444 en   Main_page 2    201505 Main_page
## 27 2015-05-11 22101221 en   Main_page 2    201505 Main_page
## 28 2015-05-12 22320344 en   Main_page 2    201505 Main_page
## 29 2015-05-13 20420337 en   Main_page 2    201505 Main_page
## 23 2015-05-15 17176625 en   Main_page 2    201505 Main_page
## 24 2015-05-16 15845474 en   Main_page 2    201505 Main_page
## 25 2015-05-17 21462364 en   Main_page 2    201505 Main_page
## 20 2015-05-18 23386371 en   Main_page 2    201505 Main_page
## 21 2015-05-19 22999646 en   Main_page 2    201505 Main_page
## 9  2015-05-20 22486802 en   Main_page 2    201505 Main_page
## 8  2015-05-21 19693422 en   Main_page 2    201505 Main_page
## 19 2015-05-22 16096041 en   Main_page 2    201505 Main_page
## 18 2015-05-23 20243939 en   Main_page 2    201505 Main_page
## 13 2015-05-24 43322132 en   Main_page 2    201505 Main_page
## 12 2015-05-25 21908990 en   Main_page 2    201505 Main_page
## 15 2015-05-26 21954108 en   Main_page 2    201505 Main_page
## 14 2015-05-27 22918926 en   Main_page 2    201505 Main_page
## 16 2015-05-28 19572988 en   Main_page 2    201505 Main_page
## 31 2015-05-29 15049919 en   Main_page 2    201505 Main_page
## 30 2015-05-30 20611835 en   Main_page 2    201505 Main_page
## 
## ... 2 rows of data not shown

… that can easily be turned into a plot …

library(ggplot2)

ggplot(page_views, aes(x=date, y=count)) + 
  geom_line(size=1.5, colour="steelblue") + 
  geom_smooth(method="loess", colour="#00000000", fill="#001090", alpha=0.1) +
  scale_y_continuous( breaks=c(10e6, 15e6, 20e6), 
  label=c("10 M","15 M","20 M")) +
  theme_bw()

3 wp_trend() options

wp_trend() has several options and most of them are set to defaults:

page , from = prev_month_start(), to = prev_month_end(), lang = “en”, file = wp_cache_file(),

3.1 page

The page option allows to specify one or more article titles for which data should be retrieved.

These titles should be in the same format as shown in the address bar of your browser to ensure that the pages are found. If we want to get page views for the United Nations Millennium Development Goals and the article is found here: http://en.wikipedia.org/wiki/Millennium_Development_Goals the page title to pass to wp_trend() should be Millennium_Development_Goals not Millennium Development Goals or Millennium_development_goals or amy other ‘mostly-like-the-original’ variation.

To ease data gathering wp_trend() page accepts whole vectors of page titles and will retrieve date for each one after another.

page_views <- 
  wp_trend( 
    page = c( "Millennium_Development_Goals", "Climate_Change") 
  )
library(ggplot2)

ggplot(page_views, aes(x=date, y=count, group=page, color=page)) + 
  geom_line(size=1.5) + theme_bw()

3.2 from and to

These two options determine the time frame for which data shall be retrieved. The defaults are set to gather the last 30 days but might be set to cover larger time frames as well. Note that there is no data prior to December 2007 so that any date prior will be set to this minimum.

page_views <- 
  wp_trend( 
    page = "Millennium_Development_Goals" ,
    from = "2000-01-01",
    to   = prev_month_end()
  )
library(ggplot2)

ggplot(page_views, aes(x=date, y=count, color=wp_year(date))) + 
  geom_line() + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 22), color="#CD0000a0", size=1.2) +
  theme_bw() 

3.3 lang

This option determines for which Wikipedia the page views shall be retrieved, English, German, Chinese, Spanish, … . The default is set to "en" for the English Wikipedia. This option should get one language shorthand that then is used for all pages or for each page a corresponding language shorthand should be specified.

page_views <- 
  wp_trend( 
    page = c("Objetivos_de_Desarrollo_del_Milenio", "Millennium_Development_Goals") ,
    lang = c("es", "en"),
    from = Sys.Date()-100
  )
library(ggplot2)

ggplot(page_views, aes(x=date, y=count, group=lang, color=lang, fill=lang)) + 
  geom_smooth(size=1.5) + 
  geom_point() +
  theme_bw() 

3.4 file

This last option allows for storing the data retrieved by a call to wp_trend() in a file, e.g. file = "MyCache.csv". While MyCache.csv will be created if it does not exist already it will never be overwritten by wp_trend() thus allowing to accumulate data from susequent calls to wp_trend(). To get the data stored back into R use wp_load(file = "MyCache.csv").

wp_trend("Cheese", file="cheeeeese.csv")
wp_trend("Käse", lang="de", file="cheeeeese.csv")

cheeeeeese <- wp_load( file="cheeeeese.csv" )
cheeeeeese
##    date       count lang page      rank month  title 
## 33 2015-05-02  179  de   K%C3%A4se 6057 201505 Käse  
## 32 2015-05-03  261  de   K%C3%A4se 6057 201505 Käse  
## 35 2015-05-07  319  de   K%C3%A4se 6057 201505 Käse  
## 41 2015-05-09  193  de   K%C3%A4se 6057 201505 Käse  
## 58 2015-05-11  297  de   K%C3%A4se 6057 201505 Käse  
## 59 2015-05-12  356  de   K%C3%A4se 6057 201505 Käse  
## 53 2015-05-14  218  de   K%C3%A4se 6057 201505 Käse  
## 55 2015-05-16  201  de   K%C3%A4se 6057 201505 Käse  
## 56 2015-05-17  239  de   K%C3%A4se 6057 201505 Käse  
## 51 2015-05-18  267  de   K%C3%A4se 6057 201505 Käse  
## 52 2015-05-19  324  de   K%C3%A4se 6057 201505 Käse  
## 39 2015-05-21  366  de   K%C3%A4se 6057 201505 Käse  
## 49 2015-05-23  173  de   K%C3%A4se 6057 201505 Käse  
## 44 2015-05-24  450  de   K%C3%A4se 6057 201505 Käse  
## 43 2015-05-25  295  de   K%C3%A4se 6057 201505 Käse  
## 46 2015-05-26  250  de   K%C3%A4se 6057 201505 Käse  
## 61 2015-05-30  226  de   K%C3%A4se 6057 201505 Käse  
## 2  2015-05-02 1409  en   Cheese    705  201505 Cheese
## 1  2015-05-03 1642  en   Cheese    705  201505 Cheese
## 6  2015-05-05 2053  en   Cheese    705  201505 Cheese
## 5  2015-05-06 2165  en   Cheese    705  201505 Cheese
## 4  2015-05-07 2983  en   Cheese    705  201505 Cheese
## 28 2015-05-12 2066  en   Cheese    705  201505 Cheese
## 22 2015-05-14 2040  en   Cheese    705  201505 Cheese
## 25 2015-05-17 1421  en   Cheese    705  201505 Cheese
## 21 2015-05-19 1982  en   Cheese    705  201505 Cheese
## 18 2015-05-23 1205  en   Cheese    705  201505 Cheese
## 15 2015-05-26 1877  en   Cheese    705  201505 Cheese
## 31 2015-05-29 1696  en   Cheese    705  201505 Cheese
## 
## ... 33 rows of data not shown

4 Caching

4.1 Session caching

When using wp_trend() you will notice that subsequent calls to the function might take considerably less time than previous calls - given that later calls include data that has been downloaded already. This is due to the caching system running in the background and keeping track of things downloaded already. You can see if wp_trend() had to download something if it reports one or more links to the stats.grok.se server, e.g. …

wp_trend("Cheese")
## http://stats.grok.se/json/en/201505/Cheese
wp_trend("Cheese")

… but …

wp_trend("Cheese", from = Sys.Date()-60)
## http://stats.grok.se/json/en/201504/Cheese

The current cache in memory can be accessed via:

wp_get_cache()
##      date       count    lang page             rank month 
## 3280 2015-01-20       52 ar   %D8%AF%D8%A7 ... -1   201501
## 3329 2015-03-08     2110 ar   %D8%AF%D8%A7 ... -1   201503
## 3379 2014-08-21    18755 de   Islamischer_ ... -1   201408
## 3409 2014-09-14    14530 de   Islamischer_ ... -1   201409
## 3543 2015-01-05     2912 de   Islamischer_ ... -1   201501
## 3604 2015-03-20     3554 de   Islamischer_ ... -1   201503
## 6304 2015-04-26     1674 en   Cheese           705  201504
## 6267 2015-05-06     2165 en   Cheese           705  201505
## 3021 2015-02-06    69654 en   Islamic_Stat ... -1   201502
## 1150 2009-05-10      496 en   Millennium_D ... 7435 200905
## 1208 2009-07-14      493 en   Millennium_D ... 7435 200907
## 1795 2011-03-14     1832 en   Millennium_D ... 7435 201103
## 2114 2012-01-15     1139 en   Millennium_D ... 7435 201201
## 2355 2012-09-23     1668 en   Millennium_D ... 7435 201209
## 2537 2013-03-28     2341 en   Millennium_D ... 7435 201303
## 2621 2013-06-11     2712 en   Millennium_D ... 7435 201306
## 2616 2013-06-16     1920 en   Millennium_D ... 7435 201306
## 2796 2013-11-18     2837 en   Millennium_D ... 7435 201311
## 429  2014-12-23      798 en   Millennium_D ... 7435 201412
## 473  2015-01-04      840 en   Millennium_D ... 7435 201501
## 4868 2011-11-01     8578 en   Syria            1802 201111
## 4949 2012-01-19     6412 en   Syria            1802 201201
## 5262 2012-12-14    10552 en   Syria            1802 201212
## 5322 2013-02-13     6295 en   Syria            1802 201302
## 5811 2014-06-12     6170 en   Syria            1802 201406
## 5929 2014-10-03     5323 en   Syria            1802 201410
## 3827 2015-01-23      320 es   Estado_Isl%C ... -1   201501
## 3837 2015-02-25      368 es   Estado_Isl%C ... -1   201502
## 577  2015-03-18      756 es   Objetivos_de ... 4160 201503
##      title           
## 3280 <U+062F><U+0627><U+0639><U+0634>
## 3329 <U+062F><U+0627><U+0639><U+0634>
## 3379 Islamischer_ ...
## 3409 Islamischer_ ...
## 3543 Islamischer_ ...
## 3604 Islamischer_ ...
## 6304 Cheese          
## 6267 Cheese          
## 3021 Islamic_Stat ...
## 1150 Millennium_D ...
## 1208 Millennium_D ...
## 1795 Millennium_D ...
## 2114 Millennium_D ...
## 2355 Millennium_D ...
## 2537 Millennium_D ...
## 2621 Millennium_D ...
## 2616 Millennium_D ...
## 2796 Millennium_D ...
## 429  Millennium_D ...
## 473  Millennium_D ...
## 4868 Syria           
## 4949 Syria           
## 5262 Syria           
## 5322 Syria           
## 5811 Syria           
## 5929 Syria           
## 3827 Estado_Islám ...
## 3837 Estado_Islám ...
## 577  Objetivos_de ...
## 
## ... 6294 rows of data not shown

… and emptied by a call to wp_cache_reset().

4.2 Caching across sessions 1

While everything that is downloaded during a session is cached in memory it might come handy to save the cache parallel on disk to reuse it in the next R session. To activate disk-caching for a session simply use:

wp_set_cache_file( file = "myCache.csv" )

The function will reload whatever is stored in the file and in subsequent calls to wp_trend() will automatically add data as it is downloaded. The file used for disk-caching can be changed by another call to wp_set_cache_file( file = "myOtherCache.csv") or turned off completely by leaving the file argument empty.

4.3 Caching across sessions 2

If disk-caching should be enabled by default one can define a path as system/environment variable WP_CACHE_FILE. When loading the package it will look for this variable via Sys.getenv("WP_CACHE_FILE") and use the path for caching as if …

wp_set_cache_file( Sys.getenv("WP_CACHE_FILE") )

.. would have beend typed in by the user.

5 Counts for other languages

If comparing languages is important one needs to specify the exact article titles for each language: While the article about the Millennium Goals has an English title in the English Wikipedia, it of course is named differently in Spanish, German, Chinese, … . One might look these titles up by hand or use the handy wp_linked_pages() function like this:

titles <- wp_linked_pages("Islamic_State_of_Iraq_and_the_Levant", "en")
titles <- titles[titles$lang %in% c("en", "de", "es", "ar", "ru"),]
titles 
##   page             lang title           
## 1 Islamic_Stat ... en   Islamic_Stat ...
## 2 %D8%AF%D8%A7 ... ar   <U+062F><U+0627><U+0639><U+0634>
## 3 Islamischer_ ... de   Islamischer_ ...
## 4 Estado_Isl%C ... es   Estado_Islám ...
## 5 %D0%98%D1%81 ... ru   <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...

… then we can use the information to get data for several languages …

page_views <- 
  wp_trend(
    page = titles$page[1:5], 
    lang = titles$lang[1:5],
    from = "2014-08-01"
  )
library(ggplot2)

for(i in unique(page_views$lang) ){
  iffer <- page_views$lang==i
  page_views[iffer, ]$count <- scale(page_views[iffer, ]$count)
}

ggplot(page_views, aes(x=date, y=count, group=lang, color=lang)) + 
  geom_line(size=1.2, alpha=0.5) + 
  ylab("standardized count\n(by lang: m=0, var=1)") +
  theme_bw() + 
  scale_colour_brewer(palette="Set1") + 
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

6 Going beyond Wikipediatrend – Anomalies and mean shifts

6.1 Identifying anomalies with AnomalyDetection

Currently the AnomalyDetection package is not availible on CRAN so we have to use install_github() from the devtools package to get it.

# install.packages( "AnomalyDetection", repos="http://ghrr.github.io/drat",  type="source")
library(AnomalyDetection)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

The package is a little picky about the data it accepts for processing so we have to build a new data frame. It should contain only the date and count variable. Furthermore, date should be named timestamp and transformed to type POSIXct.

page_views <- wp_trend("Syria", from = "2010-01-01")
## http://stats.grok.se/json/en/201505/Syria
page_views_br <- 
  page_views  %>% 
  select(date, count)  %>% 
  rename(timestamp=date)  %>% 
  unclass()  %>% 
  as.data.frame() %>% 
  mutate(timestamp = as.POSIXct(timestamp))

Having transformed the data we can detect anomalies via AnomalyDetectionTs(). The function offers various options e.g. the significance level for rejecting normal values (alpha); the maximum fraction of the data that is allowed to be detected as anomalies (max_amoms); whether or not upward deviations, downward devaitions or irregularities in both directions might form the basis of anomaly detection (direction) and last but not least whether or not the time frame for detection is larger than one month (lonterm).

Lets choose a greedy set of parameters and detect possible anomalies:

res <- 
AnomalyDetectionTs(
  x         = page_views_br, 
  alpha     = 0.05, 
  max_anoms = 0.40,
  direction = "both",
  longterm  = T
)$anoms

res$timestamp <- as.Date(res$timestamp)

head(res)
##    timestamp anoms
## 1 2010-02-02  5567
## 2 2010-02-04  5191
## 3 2010-01-23     0
## 4 2010-02-03  4322
## 5 2010-02-01  3918
## 6 2010-02-08     0

… and play back the detected anomalies to our page_views data set:

page_views <- 
  page_views  %>% 
  mutate(normal = !(page_views$date %in% res$timestamp))  %>% 
  mutate(anom   =   page_views$date %in% res$timestamp )

class(page_views) <- c("wp_df", "data.frame")

Now we can plot counts and anomalies …

(
  p <-
    ggplot( data=page_views, aes(x=date, y=count) ) + 
      geom_line(color="steelblue") +
      geom_point(data=filter(page_views, anom==T), color="red2", size=2) +
      theme_bw()
)

… as well as compare running means:

p + 
  geom_line(stat = "smooth", size=2, color="red2", alpha=0.7) + 
  geom_line(data=filter(page_views, anom==F), 
  stat = "smooth", size=2, color="dodgerblue4", alpha=0.5) 

It seems like upward and downward anomalies partial each other out most of the time since both smooth lines (with and without anomalies) do not differ much. Nonetheless, keeping anomalies in will upward bias the counts slightly, so we proceed with a cleaned up data set:

page_views_clean <- 
  page_views  %>% 
  filter(anom==F)  %>% 
  select(date, count, lang, page, rank, month, title)

page_views_br_clean <- 
  page_views_br  %>% 
  filter(page_views$anom==F)

6.2 Identifying mean shifts with BreakoutDetection

BreakoutDetection is a package that allows to search data for mean level shifts by dividing it into timespans of change and those of stability in the presence of seasonal noise. Similar to AnomalyDetection the BreakoutDetection package is not available on CRAN but has to be obtained from Github.

# install.packages(  "BreakoutDetection",   repos="http://ghrr.github.io/drat", type="source")
library(BreakoutDetection)
library(dplyr)
library(ggplot2)
library(magrittr)

… again the workhorse function (breakout()) is picky and requires “a data.frame which has ‘timestamp’ and ‘count’ components” like our page_views_br_clean.

The function has two general options: one tweaks the minimum length of a timespan (min.size); the other one does determine how many mean level changes might occur during the whole time frame (method); and several method specific options, e.g. decree, beta, and percent which control the sensitivity adding further breakpoints. In the following case the last option tells the function that overall model fit should be increased by at least 5 percent if adding a breakpoint.

br <- 
  breakout(
    page_views_br_clean, 
    min.size = 30, 
    method   = 'multi', 
    percent  = 0.05,
    plot     = TRUE
  )
br
## $loc
##  [1]  53 105 137 174 263 306 389 426 458 488 518 566 601 640 670 751 784
## 
## $time
## [1] 1.79
## 
## $pval
## [1] NA
## 
## $plot

In the following snippet we combine the break information with our page views data and can have a look at the dates at which the breaks occured.

breaks <- page_views_clean[br$loc,]
breaks
##           date count lang  page rank  month title
## 53  2010-02-13  3327   en Syria 1802 201002 Syria
## 105 2010-04-08  5210   en Syria 1802 201004 Syria
## 137 2010-06-03  5176   en Syria 1802 201006 Syria
## 174 2010-07-14  3874   en Syria 1802 201007 Syria
## 263 2010-11-22  6090   en Syria 1802 201011 Syria
## 306 2010-12-05  6182   en Syria 1802 201012 Syria
## 389 2011-04-15  6113   en Syria 1802 201104 Syria
## 426 2011-05-12  8217   en Syria 1802 201105 Syria
## 458 2011-06-07  9442   en Syria 1802 201106 Syria
## 488 2011-07-04  4745   en Syria 1802 201107 Syria
## 518 2011-08-07  7506   en Syria 1802 201108 Syria
## 566 2011-10-22  7449   en Syria 1802 201110 Syria
## 601 2011-11-27  8492   en Syria 1802 201111 Syria
## 640 2012-01-31 11496   en Syria 1802 201201 Syria
## 670 2012-02-22 23197   en Syria 1802 201202 Syria
## 751 2012-06-04  9461   en Syria 1802 201206 Syria
## 784 2012-07-27 17383   en Syria 1802 201207 Syria

Next, we add a span variable capturing which page_view observations belong to which span, allowing us to aggregate data.

page_views_clean$span <- 0
for (d in breaks$date ) {
  page_views_clean$span[ page_views_clean$date > d ] %<>% add(1)
}

page_views_clean$mcount <- 0
for (s in unique(page_views_clean$span) ) {
  iffer <- page_views_clean$span == s
page_views_clean$mcount[ iffer ] <- mean(page_views_clean$count[iffer])
}

spans <- 
  page_views_clean  %>% 
  as_data_frame() %>% 
  group_by(span) %>% 
  summarize(
    start      = min(date), 
    end        = max(date), 
    length     = end-start,
    mean_count = round(mean(count)),
    min_count  = min(count),
    max_count  = max(count),
    var_count  = var(count)
  )
spans
## Source: local data frame [18 x 8]
## 
##    span      start        end length mean_count min_count max_count
## 1     0 2010-01-01 2010-02-13     43       3662         0      4734
## 2     1 2010-02-14 2010-04-08     53       4768         0      8179
## 3     2 2010-04-10 2010-06-03     54       4900      3849      5741
## 4     3 2010-06-04 2010-07-14     40       3454         0      5270
## 5     4 2010-07-20 2010-11-22    125       5760      4172      8711
## 6     5 2010-11-24 2010-12-05     11       5488      4752      6182
## 7     6 2010-12-06 2011-04-15    130       7158      3725     22825
## 8     7 2011-04-16 2011-05-12     26      10075      6713     19661
## 9     8 2011-05-13 2011-06-07     25       6374      3672      9442
## 10    9 2011-06-08 2011-07-04     26       7135      4194     11989
## 11   10 2011-07-05 2011-08-07     33       4940      3395      9729
## 12   11 2011-08-08 2011-10-22     75       5729      3510     12574
## 13   12 2011-10-24 2011-11-27     34       8060      5195     13217
## 14   13 2011-11-28 2012-01-31     64       6479         0     11496
## 15   14 2012-02-01 2012-02-22     21      18015      7005     36378
## 16   15 2012-02-23 2012-06-04    102       9042         0     24728
## 17   16 2012-06-05 2012-07-27     52      12042      6464     25414
## 18   17 2012-07-28 2015-05-31   1037       7287         0    111331
## Variables not shown: var_count (dbl)

Also, we can now plot the shifting mean.

ggplot(page_views_clean, aes(x=date, y=count) ) + 
  geom_line(alpha=0.5, color="steelblue") + 
  geom_line(aes(y=mcount), alpha=0.5, color="red2", size=1.2) + 
  theme_bw()