Event Platform/Schemas - Wikitech
Jump to content
From Wikitech
Event Platform
Event Platform
Documentation
Schemas
Schema guidelines
Event stream configuration
Producer requirements
Flaws
Decision log
Stream Processing/Flink
Stream Processing/Flink Catalog
Stream processing use cases
Stream processing framework evaluation
Instrumentation tutorial
Differences from EventLogging legacy
Disambiguation page
Maintainers
History
Services, libraries, and repositories
Event Utilities
(code client libraries)
EventGate service
admin
EventStreams service
admin
EventBus MediaWiki extension
EventLogging MediaWiki extension
Schema API
EventStreams API
Beta Cluster EventStreams API
Primary event schema repo
Secondary event schema repo
Event data in the Data Lake
Access
Ingestion lifecycle
Sanitization
Uses of the Event Platform
Wikidata Query Service updater
Search update pipeline
MediaWiki JobQueue
Changeprop
LiftWing streams
MediaWiki Event Enrichment
Wikimedia Enterprise
via
EventStreams
See also
Issue tracker (Phabricator)
Event Platform epic task (Phabricator)
edit
Motivation and Overview
Event Schemas are essential for an Event Streaming Platform. They allow disparate continuously changing producers and consumers to reliably communicate with each other. By explicitly declaring the shape of data, schemas ease integration between various systems.
Schemas should be readily available for any producer or consumer code that might need it. Schemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported. Access of those schemas should be reliable and immutable for any given deployed service.
WMF uses JSON as our preferred in-flight data serialization format, and as such we have chosen* to use JSONSchema for our event schemas. Schema evolution is necessary to be able to reliably upgrade producer and consumer code, but unfortunately, JSONSchema does not have any built-in features for schema evolution. Therefore, each change (even a small one) requires the creation of a totally separate JSONSchema file.
WMF has chosen to distribute schemas using Git. This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project. However, even though we use Git, we do not rely on Git history for schema versioning. Each schema version is an explicit static file in the schema repository. For more background, see
RFC: Modern Event Platform: Schema Registry
To make development of many schema versions files in git easier, WMF has developed the
jsonschema-tools
library. This tooling makes it easier for developers to design and evolve schemas dynamically while allowing production services can use static and immutable versions of those schemas.
jsonschema-tools
will be used in the rest of this documentation to set up and develop schemas in a Git schema repository. Please skim the
jsonschema-tools README
before proceeding.
jsonschema-tools is a NodeJS module, so you'll need a recent (Node 10 or greater) version of NodeJS and npm installed. You can get NodeJS and npm at
nodejs.org
. Once installed,
cd
to the schema repository and run
npm install
Heads-up
: the full path to the directory cannot contain spaces. For example,
~/Documents/analytics\ engineering/event\ schemas/primary
is likely to yield errors, but
~/Documents/analytics-engineering/event-schemas/primary
would be fine.
*There are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see
RFC: Modern Event Platform - Choose Schema Tech
and
Event Schema Design Rules and Conventions
Event Platform/Schemas/Guidelines
Schema Repositories
A schema repository is a Git repository with a hierarchy of versioned JSONSchema files, with a file layout something like:
jsonschema
└── analytics
├── button
│ ├── click
│ │ ├── 1.0.0 -> 1.0.0.yaml
│ │ ├── 1.0.0.yaml
│ │ ├── current.yaml
│ │ └── latest -> 1.0.0
│ └── release
│ ├── 1.0.0 -> 1.0.0.yaml
│ ├── 1.0.0.yaml
│ ├── 1.0.1 -> 1.0.1.yaml
│ ├── 1.0.1.yaml
│ ├── current.yaml
│ └── latest -> 1.0.1
└── page_preview
└── visibility_change
├── 1.0.0 -> 1.0.0.yaml
├── 1.0.0.yaml
├── 2.0.0 -> 2.0.0.yaml
├── 2.0.0.yaml
├── current.yaml
└── latest -> 2.0.0
JSONSchema has
title
and
$id
fields that we use to associate event data with a schema, as well as for semantically versioning schemas. The actual hierarchy layout shown here is arbitrary, but each schema's
title
and
$id
must match the layout in a specific way. More on this below.
Note the 'current.yaml' files. These files represent the current working version of the schema. The current schemas are never themselves used as a schema for validation or data integration. Instead, they are 'materialized' by jsonschema-tools into static versioned schema files. These versioned schema files are the canonical schemas used by event processing systems.
Hierarchy Rules
Each schema's
title
should match its relative path in the schema repository. E.g. all schema version files in
namespace1/entity1/verbB
should have
title: namespace1/entity1/verbB
. Each schema's
$id
field should be set to the path (starting with
) and (extensionless) version. E.g.
namespace1/entity1/verbB/1.0.1.yaml
should have
$id: /namespace1/entity1/verbB/1.0.1
This layout combined with the
title
and
$id
allow for event data to specifically point to their schemas via relative URIs. By semantically versioning schema files, jsonschema-tools is able to associate schemas with the same
title
and enforce backwards compatibility. The relative and versioned
$id
URIs can also be used as
JSON
$ref
links and with JSON Pointers
. More on this below as well.
Creating a new schema repository
Most likely you will already be working with a schema repository. If so, skip to
Creating a new schema
or
Modifying schemas
jsonschema-tools
is a NodeJS library and CLI for managing JSONSchema Git repositories. To create a new schema repository, you'll create a
package.json
file, install and configure jsonschema-tools, and set up jsonschema-tools tests.
mkdir
my_schema_repository
cd
my_schema_repository
git
init
# Our schemas will go in the jsonschema/ directory
mkdir
jsonschema
# Create a configuration file for jsonschema-tools.
echo
-e
'schemaBasePath: ./jsonschema/\nlogLevel: info'
.jsonschema-tools.yaml
# Create a package.json file. (Modify this as desired.)
echo
"name": "my_schema_repository",
"scripts": {
"test": "mocha test/jsonschema",
"build-modified": "jsonschema-tools materialize-modified --no-git-add",
"build-new": "jsonschema-tools materialize"
},
"devDependencies": {
"@wikimedia/jsonschema-tools": "^0.6.0",
"mocha": "^6.2.0"
package.json
# Install jsonschema-tools.
npm
install
# Install jsonschema-tools tests.
mkdir
-p
test/jsonschema
echo
'use strict';
require('@wikimedia/jsonschema-tools').tests.all({ logLevel: 'info' });
test/jsonschema/repository.test.js
# Create the first git commit.
echo
'node_modules**'
>>
.gitignore
git
add
git
commit
-m
'New schema repository'
Creating a new schema
Once you are working in a repository with jsonschema-tools, we can create new schemas. By 'new schema', we mean a brand new schema lineage, not just a new schema version. To create a new schema, we need to first decide on its title (and hierarchy), create the directory structure, write a new current.yaml schema file, and materialize the schema. For this example, we'll create a new event schema that represents a Mediawiki UI button click.
NOTE: since will be writing JSONSchema, you should probably know how to do that. See this
tutorial
and
reference
for help working with JSONSchema.
mkdir
-p
jsonschema/mediawiki/desktop/button/click
Open
jsonschema/mediawiki/desktop/button/click/current.yaml
. We'll build this up piece by piece and explain each part.
Schema metadata
First we need some schema metadata that describe and identify the schema.
Note that this schema metadata is not describing any aspect of your event data.
# This is the title of the schema.
# It should match the relative path to this file's parent directory.
title
mediawiki/desktop/button/click
# Document the what the schema represents.
description
Mediawiki desktop web button clicked
# The $id uniquely identifies this schema. It should be a versioned (and extensionless) URI.
$id
/mediawiki/desktop/button/click/1.0.0
# This is the meta-schema of this schema. This should probably always be the same
# for every schema, and should point to the main JSONSchema meta-schema at json-schema.org.
$schema
Event fields
...continuing on to event data fields. Your event should be a JSON object with each field explicitly declared here.
type
object
additionalProperties
false
properties
Required event data
In addition to the
$schema
field, WMF has defined common fields for event data. These common fields allow us to have some consistency all event data, and are also used to support backend functionality (deduplication, Hive table ingestion, etc.)
$schema
Each event needs to identify its schema. Right now we are just writing the schema, but later on your code
will produce JSON event data that conforms to this schema. We need to be able to look up the schema
for any given event just from the event data itself. To do this, we re-use the JSONSchema
$schema
field in the event properties.
$schema
type
string
description
The URI identifying the JSONSchema for this event. This should be
a short URI containing only the name and version at the end of the
URI path. e.g. /schema_name/1.0.0 is acceptable. This should match
the schema's $id field.
Timestamps:
meta.dt
and
dt
These timestamps have different semantics, but in most cases they will be very close, if not the same. These are both ISO-8601 UTC datetime strings, e.g. '2020-07-01T00:00:00Z'.
Every event happens at a certain date-time. That event time should be stored in the
dt
field.
meta.dt
is the system ingestion time, i.e. the time at which the intake system has received the event. Depending on the pipeline your event is flowing through, this might be set be different levels. For events that are received first by our intake service (EventGate), this will be set by it, if it is not already set by the client.
NOTE:
meta.dt
will be used as the Kafka timestamp as well as for Hive hourly partitioning. You should allow EventGate to fill in this field so that you don't end up with incorrect timestamps.
See also
T240460
and
T267648
meta.stream
Every event should belong to a named dataset. While events are in flight, this dataset is called a stream of events. Each event needs to specify which stream it belongs to. For example, the
resource_change schema
is re-used in the `mediawiki.resource_change`, `transcludes.resource_change`, `change-prop.retry.resource_change`, etc. streams. You might want to design a generic button_clicked schema that is generic for all button clicks, but keep the different types of button click events in different streams.
We do this using the
meta.stream
field. (meta.stream is used for routing incoming events to specific streams and downstream 'datasets'. Each distinct meta.stream will correspond with certain Kafka topics and a Hive table. In most cases, the Kafka topic will be the stream name prefixed with the datacenter name where the event was received.)
There are a few more common and optional meta fields that WMF defines, but we don't need explain
them all here. For now we will write out just these 2 example
meta
fields.
Later we will show how to include the event meta schema using
$ref
### Metadata object. All events schemas should have this.
meta
type
object
properties
dt
type
string
# Whenever a format is used on a field, we require that maxLength is also set.
# See https://github.com/epoberezkin/ajv#security-risks-of-trusted-schemas
format
date-time
maxLength
128
description
Time stamp of the event, in ISO-8601 format
stream
type
string
description
Name of the stream/queue that this event belongs in
required
stream
Event data fields
Finally we can add any fields that we really want our event to have.
button_name
type
string
description
Name of the button that was clicked
page_title
type
string
description
Page the button appeared on when clicked
The new schema
Here is the new schema we just wrote:
title
mediawiki/desktop/button/click
description
Mediawiki desktop web button clicked
$id
/mediawiki/desktop/button/click/1.0.0
$schema
type
object
properties
$schema
type
string
description
The URI identifying the JSONSchema for this event. This should be
a short URI containing only the name and version at the end of the
URI path. e.g. /schema_name/1.0.0 is acceptable. This often will
(and should) match the schema's $id field.
### Metadata object. All events schemas should have this.
meta
type
object
properties
dt
type
string
format
date-time
maxLength
128
description
Time stamp of the event, in ISO-8601 format
stream
type
string
description
Name of the stream/queue that this event belongs in
required
dt
stream
button_name
type
string
description
Name of the button that was clicked
page_title
type
string
description
Page the button appeared on when clicked
examples
"$schema"
"/mediawiki/desktop/button/click/1.0.0"
"meta"
"dt"
"2019-01-01T00:00:00Z"
"stream"
"mediawiki.desktop.button-click"
},
"button_name"
"Edit
source"
"page_title"
"Delayed-choice
quantum
eraser"
Note the
examples
. This is optional, but can be nice if you want to give schema readers an example of what you expect event data to look like. Notice how the event's
$schema
matches exactly the schema's
$id
Materializing the schema
jsonschema-tools calls the process of derefencing, merging and generating the static versioned files 'materializing'. So far, we've saved this our new schema as
./jsonschema/mediawiki/desktop/button/click/current.yaml
. current.yaml will be the 'current working copy' of a schema. It can contain
$ref
URI pointers (more on this below). Any changes we make to schemas should always be done on their
current.yaml
files. We'll use jsonschema-tools to materialize
current.yaml
into a statically versioned schema file.
WMF's schema repositories are set up with npm scripts to help materialize schemas. (These scripts are just wrappers of the jsonschema-tools CLI).
To materialize a new schema, you'll run
npm run build-new
# materialize the new current.yaml schema
npm
run
build-new
./jsonschema/mediawiki/desktop/button/click/current.yaml
2022
-03-21
13
:57:14.397
+0000
Dereferencing
schema
with
$id
/mediawiki/desktop/button/click/1.0.0
using
schema
base
URIs
./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
2022
-03-21
13
:57:14.424
+0000
Materialized
schema
at
jsonschema/mediawiki/desktop/button/click/1.0.0.json.
2022
-03-21
13
:57:14.425
+0000
Materialized
schema
at
jsonschema/mediawiki/desktop/button/click/1.0.0.yaml.
2022
-03-21
13
:57:14.425
+0000
Created
latest
symlink
jsonschema/mediawiki/desktop/button/click/latest.json
->
.0.0.json.
2022
-03-21
13
:57:14.426
+0000
Created
latest
symlink
jsonschema/mediawiki/desktop/button/click/latest.yaml
->
.0.0.yaml.
2022
-03-21
13
:57:14.426
+0000
Created
extensionless
symlink
jsonschema/mediawiki/desktop/button/click/1.0.0
->
.0.0.yaml.
2022
-03-21
13
:57:14.427
+0000
Created
latest
symlink
jsonschema/mediawiki/desktop/button/click/latest
->
.0.0.yaml.
# Git add the new current.yaml schema and the materialized schema files.
git
add
./jsonschema/mediawiki/desktop/button/click/*
git
commit
-m
'Created mediawiki/desktop/button/click 1.0.0 schema'
The version to materialize will be obtained from the value of
$id
in current.yaml. Both yaml and json (by default) files will be materialized, and the versioned extensionless symlink will point to the versioned yaml file (by default).
Alternatively you can manually materialize a schema using the jsonschema-tools CLI. See
$(npm bin)/jsonschema-tools --help
for more information.
Modifying schemas
Versioned schemas should be (mostly) immutable. Once committed and merged, they may be used by many active producers and consumers. Changing an existent version should not be done (if you think you need to do it, get in touch with the Analytics or Core Platform Engineering teams). Instead, to modify a schema you should just create a new backwards compatible version.
Let's add a user_id to our event data. Edit
jsonschema/mediawiki/desktop/button/click/current.yaml
and add the following at the bottom of the schema.
# ...
user_id
type
string
description
ID of the user
# Add a user_id onto our examples field too:
examples
"$schema"
"/mediawiki/desktop/button/click/1.0.0"
"meta"
"dt"
"2019-01-01T00:00:00Z"
"stream"
"mediawiki.desktop.button-click"
},
"button_name"
"Edit
source"
"page_title"
"Delayed-choice
quantum
eraser"
"user_id"
123
Since we've changed the schema, we MUST manually change the version in the schema's
$id
field. According to
semantic versioning
, our addition of the
user_id
field should be a minor version increment. So change
$id
to:
$id
/mediawiki/desktop/button/click/1.1.0
npm run build-modified
is able to detect any current.yaml files that have modified by checking their git status.
Before
staging the modified schema in Git, run this to materialize all modified current.yaml files:
npm
run
build-modified
schemas-event-secondary@1.0.0
build-modified
/home/user/my_schema_repository
jsonschema-tools
materialize-modified
-G
2022
-03-21
13
:59:30.321
+0000
Looking
for
modified
current.yaml
schema
files
in
./jsonschema/
2022
-03-21
13
:59:30.380
+0000
Materializing
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/current.yaml...
2022
-03-21
13
:59:30.385
+0000
Dereferencing
schema
with
$id
/mediawiki/desktop/button/click/1.1.0
using
schema
base
URIs
./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
2022
-03-21
13
:59:30.405
+0000
Materialized
schema
at
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml.
2022
-03-21
13
:59:30.407
+0000
Materialized
schema
at
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json.
2022
-03-21
13
:59:30.409
+0000
Created
latest
symlink
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.yaml
->
.1.0.yaml.
2022
-03-21
13
:59:30.409
+0000
Created
latest
symlink
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.json
->
.1.0.json.
2022
-03-21
13
:59:30.409
+0000
Created
extensionless
symlink
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0
->
.1.0.yaml.
2022
-03-21
13
:59:30.411
+0000
Created
latest
symlink
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest
->
.1.0.yaml.
2022
-03-21
13
:59:30.411
+0000
New
schema
files
have
been
materialized.
Adding
them
to
git:
/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.json
git
add
jsonschema/mediawiki/desktop/button/click/*
git
commit
-m
'1.1.0 version of mediawiki/desktop/button/click'
Including sub schemas
When materializing schemas, jsonschema-tools will dereference any
$ref
pointers and merge any
allOf
it finds. This allows us to DRY up common subschemas to avoid copy/paste bugs. It also allows us to standardize and reuse common fields, e.g.
these MediaWiki entity fragment schemas
For WMF, all event schemas should have a
$schema
event field, as well as use a common event meta sub object. The Wikimedia common schema is in the
primary schema repository] at
/fragment/common
In our example schema repository, assume we have a common schema at
jsonschema/fragment/common/2.0.0
as:
title
fragment/common
description
Common schema fields for event schemas
$id
/fragment/common/2.0.0
$schema
'https://json-schema.org/draft-07/schema#'
type
object
additionalProperties
false
required
$schema
meta
dt
properties
$schema
description
A URI identifying the JSONSchema for this event. This should match an
schema's $id in a schema repository. E.g. /schema/title/1.0.0
type
string
dt
description
ISO-8601 formatted timestamp of when the event occurred/was generated in
UTC), AKA 'event time'. This is different than meta.dt, which is used as
the time the system received this event.
type
string
format
date-time
maxLength
128
meta
type
object
required
stream
properties
domain
description
Domain the event or entity pertains to
type
string
minLength
dt
description
'Time
the
event
was
received
by
the
system,
in
UTC
ISO-8601
format'
type
string
format
date-time
maxLength
128
id
description
Unique ID of this event
type
string
request_id
description
Unique ID of the request that caused the event
type
string
stream
description
Name of the stream (dataset) that this event belongs in
type
string
minLength
uri
description
Unique URI identifying the event or entity
type
string
format
uri-reference
maxLength
8192
We want to include this schema (including its
required
properties) in our button/click example schema. Let's make a new version of this schema and include it using
$ref
. Edit
jsonschema/mediawiki/desktop/button/click/current.yaml
to
title
mediawiki/desktop/button/click
description
Mediawiki desktop web button clicked
$id
/mediawiki/desktop/button/click/1.2.0
$schema
type
object
allOf
$ref
/fragment/common/2.0.0
properties
button_name
type
string
description
Name of the button that was clicked
page_title
type
string
description
Page the button appeared on when clicked
user_id
type
string
description
ID of the user
examples
"$schema"
"/mediawiki/desktop/button/click/1.0.0"
"meta"
"dt"
"2019-01-01T00:00:00Z"
"stream"
"mediawiki.desktop.button-click"
"id"
"12345678-1234-5678-1234-567812345678"
},
"button_name"
"Edit
source"
"page_title"
"Delayed-choice
quantum
eraser"
"user_id"
123
Notice that we've bumped the version number in
$id
again to 1.2.0. Materialize and commit this new schema version.
npm
run
build-modified
# ...
git
add
./jsonschema/mediawiki/desktop/button/click/*
git
commit
-m
'Using $ref to common in new version mediawiki/desktop/button/click 1.2.0'
...
The newly materialized
./jsonschema/mediawiki/desktop/button/click/1.2.0.yaml
now has both our schema and the included common schema merged together
How this works
When jsonschema-tools encounters a
$ref
, it will attempt to resolve it and then replace it with the resolved content. After dereferencing, anything
allOf
is merged together with the top level schema fields to create a fully dereferenced and merged schema without any
$ref
or
allOf
keywords.
Absolute
$ref
If the
$ref
starts with a URI protocol (http:// or file://), it will attempt to load it as is.
$ref:
will load the content at that URL.
Relative to
baseSchemaUris
jsonschema-tools can be configured (in
.jsonschema-tools.yaml
with multiple
baseSchemaUris
, the default of which is just the
schemaBasePath
(in our case,
./jsonschema
). When a
$ref
starts with a slash (
), jsonschema-tools will iterate through each of the configured
baseSchemaUris
, prepend the base URI to the
$ref
value, and attempt to resolve it. If your
baseSchemaUris: [./jsonschema,
, jsonschema-tools will look for your
$ref
path in both of those locations.
Testing schemas
jsonschema-tools comes with a series of tests that ensure your schema repository is nice and clean. We showed how to install these tests in the section above about Creating a New Schema Repository. These are mocha tests, so all we need to do is run
npm test
. These tests will ensure that your schema repository structure is correct, that your schemas have required fields, and that schema versions are backwards compatible.
Retrieved from "
Category
Event Platform
Event Platform/Schemas
Add topic