Commons Impact Metrics - Wikitech

Commons Impact Metrics - Wikitech
Jump to content
From Wikitech
The
Commons Impact Metrics
data product is a collection of datasets designed to provide insight on the impact of community contributions to
Wikimedia Commons
. So far, the data is focused on media files uploaded by and categories belonging to GLAM actors (affiliates, projects, individual contributors, etc. related to galleries, libraries, archives, and museums). This page describes the project, the main properties of the data pipeline, and how to access the data and code. If you are looking for developer documentation on the shape of the data and how to query it, see the
data model docs
Access the Commons Analytics API
Project rationale
There has been a long-standing community request for a data product that would give insight into the impact of Commons contributions
. While the WMF has not been able to attend the request, the community has created a list of tools which compute such data and serve it via visual web applications: tools such as
GLAM Wiki Dashboard
BaGLAMa2
and
GLAMorgan
. In the couple years before this project, the community has reported that they have difficulties maintaining these tools for several reasons, and the tools have become less useful to the community due to data outages, data inconsistency between tools and the complexity of the calculations. This project aims to improve on those issues by delivering a data product that:
Answers most of the use cases covered by the mentioned tools.
Is robust, not subject to data outages.
Is standardized and can be used consistently across a range of tools.
Provides pre-calculated data that is easy to query and manage.
Properties and caveats
Category allow-list
Because of computational complexity, data size and dataset semantics, we have scoped this data product to report only on a list of curated GLAM primary categories. Each of those categories belongs to a GLAM institution, event, contributor, project, etc. The data product will also report on all sub-categories under those listed primary categories, and the media files directly associated with them. The initial allow-list was put together from the existing mentioned tools, but it is open to additions. See the current
Commons Impact Metrics allow-list on GitLab
Allow-list updates
The Commons Impact Metrics category allow-list is open to update requests (addition of new categories and renaming or removal of existing ones). You can
request an update to the allow-list
on Phabricator (guidelines for the process at
m:Wikimedia Foundation Culture and Heritage team/Commons Impact Metrics Requests
).
New categories should correspond to the primary (top) category of either:
A Commons mass contributor actor/entity. For instance, the category of a specific museum, library, individual mass contributor, etc. ("Media_contributed_by_someone" or "National_Museum_of_someplace").
An event or a project aimed at generating Commons mass contribution. For instance, an editathon organized to generate mass contribution. ("Wiki Loves Something", "Images_uploaded_as_part_of_some_collaboration").
The category should not refer to other things, such as media locations, media subjects, media formats, tools used to upload media, etc. ("Modern_art", "Wales", "Uploaded_with_some_tool"); especially, if the category is vague or overarching, like "Images about art". Categories like those could quickly compromise the performance of the data pipeline and make the dumps unusable by the community. Exceptions can be made on a case-by-case basis.
Category renames
If an allow-listed category is renamed in Commons, the Commons Impact Metrics pipeline will cease to calculate metrics for it on its next monthly run. To prevent that, the allow-list has to be updated by replacing the old name with the new name before the end of the month. This can be done using the allow-list update process above. Note that even when an allow-listed category is properly renamed, the data collected before the rename will still be associated with the old name, while the data collected after the rename will be associated with the new name.
No retroactive calculations by default
When new categories are added to the allow-list, the pipeline will calculate metrics for them starting at the time of the addition, and going onward. By default, there will be no retroactive data re-runs or back-fills for new categories. Because the calculation needs to happen for all categories at once (not just the new ones) it is expensive and impractical in terms of computation and engineering resources to execute ad-hoc re-runs or back-fills. If necessary, it would be possible to have general re-runs every 6 months, which would back-fill data for the last 6 months.
Max depth
In Commons' category graph, most sub-graphs are interconnected. You can navigate from a sub-graph about a given museum in a given country, and end up in a sub-graph about a project on the other side of the world. In practice, if the allow-list mentioned above is big enough, navigating through the listed sub-graphs without limits might result in traversing the whole of Commons's category graph. Because we want to report on GLAM-specific sub-graphs, we impose a limit to how deep an allow-listed category tree will be considered. Learn more about why in this deep dive on the
algorithm
. Currently the max depth is 7. This means that this data product will only report on sub-categories that are at a maximum distance of 7 steps from the allow-listed primary category. The data will also report on all media files directly associated to any of those categories and sub-categories.
Aggregated and released monthly
Because of data size reasons, we currently aggregate the data in a monthly granularity. One of the design criteria of this product is that it should be manageable for community members. Usually community members do not have access to a cluster to run queries on top of hundreds of gigabytes of data. Thus we reduced the granularity to monthly to make it lighter and more manageable. On the other hand, because this dataset depends on data that we currently only ingest at a monthly pace, we can only offer a monthly release schedule.
Pageviews vs. Mediarequests
The previously existing tools developed by the community used two different base metrics: mediarequests and pageviews. In an effort to unify into a single metric, the Data Products team analyzed the pros and cons of each metric. The main ones are listed below:
Metric
Pros
Cons
Mediarequests
No monthly drift
. Mediarequests are associated directly with media files, which does not cause any monthly drift.
Not associated to wiki page
. We can not filter or breakdown Mediarequests per wiki page.
Less bot filtering
. The Mediarequests pipeline filters out self-identified bots, but not automated traffic.
Pageviews
Associated to wiki page
. Pageviews are associated directly with wiki pages. We can filter and break down Pageviews per wiki page.
Better automated traffic detection
. Since Pageviews is a core pipeline in WMF's Data Engineering Platform, it benefits from the automated traffic detection pipeline.
Monthly drift
. Pageviews are associated directly with wiki pages. This causes monthly drift. See the corresponding section for more details.
Data Products chose Pageviews as the base metric for the dataset, because of the combined evaluation of all pros and cons. However at several points during the project, some community members noted that the monthly drift problem was a significant drawback. Data Products agrees with this, and plans to mitigate the monthly drift in the future. Note: Mediarequests (outside the context of Commons Impact Metrics) are already publicly available in the form of dumps and API (AQS).
Monthly drift
The base metric used for the Commons Impact Metrics data product is currently Pageviews. More specifically, Pageviews to wiki pages containing media files categorized under an allow-listed category tree.
The problem comes when a media file belonging to an allow-listed category is added to a wiki page. The only way of knowing the exact date of addition is parsing the wikitext history for media file updates. This can be done in theory, but it would be very difficult and would take a long time and a big engineering effort. Another way we can proxy the date of addition is querying MediaWiki's imagelinks table, and that is what the pipeline does. However, we can only do it at a monthly pace, since the current MediaWiki database imports to Data Engineering's data lake only happen monthly. As a result of all this, we only know the month a media file was added to a wiki page (not the day or hour). So, we can only calculate Pageview aggregations for the full month. So, even when a media file is added (for example, on the 15th of the month), the pipeline will aggregate Pageviews for the corresponding wiki page since the 1st of the month, thus overcounting the Pageviews from the 1st to the 14th.
Note the monthly drift does only happen on the month a media file is added to a wiki page. It does not happen in the subsequent months, because the media file will be there from the 1st to the last day of the month and all Pageviews will be rightfully counted. It also does not change the past aggregations for that media file or its associated categories (a known issue of previous tools), it will only reflect them moving forward.
Potential mitigations of the monthly drift include:
Adding Mediarequests to the data product
Increasing the granularity and release schedule of the data product to daily
How to access the data
The Commons Impact Metrics data pipeline populates different datastores where the data can be queried:
the
Commons Analytics Query Service API
(AQS);
the
Commons Impact Metrics dumps
; and
the
Data Engineering team's data lake
At some point, the data is also ingested to Cassandra, but that is just for internal AQS consumption, not for user queries.
Data lake
Data Engineering's data lake stores the base datasets of the Commons Impact Metrics product. They are the basis from which the dumps are formatted, and also from which the Cassandra tables that serve AQS are populated. That said, they can also be directly queried by people who have access to the data lake (who have WMF Kerberos credentials). You do not need to have further "analytics-privatedata-users" permissions to access this data, since it is not private. There are 5 base datasets for Common Impact Metrics (read more about the
data model
):
Hive database
Iceberg table
HDFS location
Description
wmf_contributors
commons_category_metrics_snapshot
/wmf/data/wmf_contributors/commons/
category_metrics_snapshot
Metrics about CIM categories.
wmf_contributors
commons_media_file_metrics_snapshot
/wmf/data/wmf_contributors/commons/
media_file_metrics_snapshot
Metrics about CIM media files.
wmf_contributors
commons_pageviews_per_category_monthly
/wmf/data/wmf_contributors/commons/
pageviews_per_category_monthly
Aggregated pageview counts for CIM categories.
wmf_contributors
commons_pageviews_per_media_file_monthly
/wmf/data/wmf_contributors/commons/
pageviews_per_media_file_monthly
Aggregated pageviuew counts for CIM media files.
wmf_contributors
commons_edits
/wmf/data/wmf_contributors/commons/edits
CIM edit events.
Dumps
The Commons Impact Metrics dumps consist of 5 public datasets updated at a monthly schedule. They follow exactly the same data model as the data lake datasets above, and you can see
Commons Impact Metrics/Data Model
for the data model details. Anyone can download them from
. Take into account that:
They are formatted in TSV (tab separated values).
They are compressed using
Bzip2
Some fields contain lists of strings; in which case, the strings are separated by | (pipe) symbols.
API
The Commons Impact Metrics data is also served publicly (without authentication) via the
Analytics Query Service API
(AQS). The service has 14 endpoints that you can query with different parameters. To use the API, see the
Analytics API documentation
Pipeline architecture
The Commons Impact Metrics data pipeline consists of 4 pieces of software:
the transformation of the source data into the base datasets;
the generation of the dumps;
the transformation and loading of the base data into Cassandra; and
the AQS service (which consumes the data in Cassandra).
Base datasets
There is an
Airflow Directed Acyclic Graph (DAG)
that waits for source data to be present, and processes it to generate the 5 base datasets and store them in Hive (Iceberg) tables. It uses a
Spark-Scala module
to put the Commons category graph together, and a set of
SparkSQL queries
(excluding the ones starting with
dump_
) to compute the final data on top of the category graph. The execution of the DAG happens at a monthly schedule. The source data includes a couple of MediaWiki tables (imported monthly to the data lake via Gobblin), i.e.
page
image
imagelinks
categorylinks
, etc; and also includes
wmf.pageview_hourly
and other minor data lake tables.
Dumps
There is another
Airflow DAG
that triggers once the process for the base datasets above has finished, and produces the new monthly release of the
Commons Impact Metrics dumps files
. It uses a set of
SparkSQL queries
(the ones starting with
dump_
), to extract and format the files. There is also a README file under
that can be modified in
puppet
Cassandra loading
There is a third
Airflow DAG
that also triggers once the base datasets process has finished; and loads data to
14 different Cassandra tables
, each designed to serve an AQS endpoint. The DAG uses a
set of SparkSQL queries
(the ones starting with
load_cassandra_commons
) to extract and format the data into the expected shape.
AQS service
Finally, we have an
AQS service
named commons-analytics that serves the data stored in Cassandra through
14 endpoints
. The service uses a common generic AQS library named
aqsassist
, and also
this testing environment
Notes
Retrieved from "
Category
Data platform
Commons Impact Metrics
Add topic