Research:Characterizing Wikipedia Citation Usage/First Round of Analysis - Meta-Wiki
Jump to content
From Meta, a Wikimedia project coordination wiki
Research:Characterizing Wikipedia Citation Usage
This page summarizes the findings of the analysis of the
first round of data collection
We analyzed the frequency of clicks on references linking to an external source, by crossing the data collected with our instrumentation with the page views recorded in the
webrequests
table in the Data Lake.
Data
edit
Reference click data collection
edit
We collected 10 days of data, from Jun 28th to Jul 9th 2018.
The schema for the data we collected is here:
Schema:CitationUsage
. The data comes from non-logged-in users only.
More info on the data collection in the
main project page
The schema detected around 3M events per day, for a total of 32 Milion events over the course of 10 days. We detected 4 types of events:
`extClick` — click on external URLs;
`upClick` — click that takes the user from the reference at the bottom back to the anchor (e.g., “[1]”) in the main text (e.g., on “^”);
`fnClick` — clicks on page-internal links (e.g., “[1]”) that take the user to the reference section at the bottom;
`fnHover` — event when user hovers over (at least 1000ms) reference (e.g., “[1]”) in main page articles."
Date
Number of Events
upClick
extClick
fnClick
fnHover
29 June 2018
2912911
15827
1380607
605643
910834
30 June
2509565
13259
1175389
601582
719335
01 July
2912911
14336
1292807
670076
781537
02 July
3218551
17072
1506356
679826
1015297
03 July
3160465
15639
1478172
658227
1008406
04 July
3015490
19191
1396499
660644
939093
05 July
3170142
21594
1473209
663434
1011812
06 July
2980773
15261
1380396
643607
941431
07 July
2603657
13514
1216515
641242
732296
08 July
3324488
15003
1341844
692435
810974
Reference text collection
edit
We also parse the XML dumps to collect information about the text and templates used to reference external sources in English Wikipedia. Below the plot of the most popular templates for citations in English Wikipedia. Each bar represents how many times a given template appears in the references of all English Wikipedia articles.
Each bar represents how many times a given template appears in the references of all english Wikipedia articles
Page requests data collection
edit
We counted the page views relevant to our analysis by using the table
wmf.webrequest
. We limited the selection only to the English version of Wikipedia, on
namespace 0
, where the requests generated from desktop/web mobile (no app) and where the user is not logged in. Additionally, we detect the requests potentially generated by bots through a
regex matching
on the user-agent string, and since automatic requests are not relevant for our analysis, we discard them.
We grouped the pageviews by four variables to allow different stratified analysis: page_id, continent, country_code, access_method.
This dataset can answer a question like:
"How many times was the article A loaded by not logged in users from mobile devices in the UK?"
Dimensions of Analysis
edit
We analyze the frequency of external clicks according to 4 dimensions.
Topic
: we extracted the topic of the ~2 milion pages where we recorded events, by using the
draft topic prediction model
from the Scoring Platform team.
Country
: we infer the country where the event was generated from the
geocoded_data
field available on both the
webrequest
and our
citationusage
tables.
Domain
: we segmented the clicks on external references according to the domain of the external link (e.g., "www.theguardian.com" or "www.imdb.com"). Below is a plot of the most popular domains in English Wikipedia references. These are the domains which appear more often across all articles in English Wikipedia. The top cited domains are books and newspapers.
Number of References in Page
: we parse all pages to get the number of references with an external link. Here is a plot of the distribution of pages over number of references: the majority of pages have 1 to 5 external links. Around 1M pages have 0 external links.
How many pages have 0 externa links in their references? How many have 1-5? This plot shows the distribution of number of pages vs number of references
Number of references linking to external domains, breakdown by domain.
Next, we compute the ratio between page views and external clicks on these 4 dimensions.
Results
edit
Most Visited References
edit
We looked at the most popular references among readers during our data collection period. We found that the most clicked external references are very much influenced by the events happening during the week of data collection. Among the most clicked links we found, for example, news about movie releases happened during that week; links to websites related to the football world cup and other popular sport events during that week. To even out the influence of these localized events on these statistics, we might need to collect the second round of data during a longer period.
Breakdown by topic
edit
We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by page topic, and divide this quantity by the total number of webrequests in each topic. We find that the topics where external references tend to be more clicked are Mathematics and Engineering. Note that, since we aggregate data at a session level (and not a per-user level), some of these patterns might be biased by the presence of superusers (e.g., a reader interested in mathematics who is clicking on external references at every session).
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by topic
Breakdown by country
edit
We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by country of origin of the event, and divide this quantity by the total number of web requests in each country. We find that around 6% of the sessions coming from US or UK convert into a click on an external reference. We also find that Iran and some Pacific islands are among the countries with lower click-through rate for external citations.
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by country
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by country (bottom 20)
Breakdown by domain
edit
Finally, we look at the breakdown of number of clicks per domain. Below, a plot of the domains in English Wikipedia that receive readers click more often. Despite Google Books being the most popular domain in English Wikipedia references, we find that the top-clicked domain is the Internet Archive's Wayback Machine, while Google Books is the second most visited domain, followed by a number of newspapers.
Total clicks on an external links, breakdown by link domain
Retrieved from "
Category
Citations
Research
Characterizing Wikipedia Citation Usage/First Round of Analysis
Add topic
US