⚓ T138709 Use microformats on Wiktionary to improve term parsing
Page Menu
Phabricator
Create Task
Maniphest
T138709
Use microformats on Wiktionary to improve term parsing
Closed, Declined
Public
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
None
Authored By
jberkel
Jun 26 2016, 11:36 AM
2016-06-26 11:36:54 (UTC+0)
Tags
Mobile-Content-Service
(Incoming)
All-and-every-Wiktionary
(Backlog)
Technical-Debt
(Unsorted)
Product-Infrastructure-Team-Backlog-Deprecated
(Backlog)
Referenced Files
None
Subscribers
Aklapper
bearND
Darkdadaah
GWicke
jberkel
jeremyb
Kelson
View All 17 Subscribers
Description
(task opened as a follow-up to a conversation with
@GWicke
at Wikimania)
There is an experimental
definition term endpoint
deployed which makes some assumptions on the HTML structure of the Wiktionary markup (cf.
parseDefinition.js
).
To avoid future maintenance problems and to scale up the extraction to other languages / Wiktionaries it would be helpful to have the Wiktionary templates already include semantic information in the output. On Wikisource this has already been done successfully (in the context of ebook metadata export).
It would also be a good starting point for future Wikidata <–> Wiktionary integration work. There is already some minimal semantic information included in a few places, for example gender markers (cf. the Spanish entry
casa
):

f

A first step would be to formalize these conventions and implement them consistently on the English Wiktionary, then adapt the parsing code on the API side.
Thoughts?
Related Objects
Search...
Task Graph
Mentions
Status
Subtype
Assigned
Task
Open
Feature
None
T13996
A way to select which parts of Wiktionary articles to show
Open
Feature
None
T14213
Following a link to a language entry in Wiktionary should display only that entry
Open
Feature
None
T13998
A way to show only those languages on Wiktionary that the user is interested in
Open
Feature
None
T38881
Wiktionary needs usable API
Declined
None
T151914
Support more languages in the Wiktionary definition endpoint
Declined
None
T138709
Use microformats on Wiktionary to improve term parsing
Mentioned In
T187430: Duplicate usage examples in Wiktionary page definition endpoint
T164739: Allow page previews to display in wiktionary
T176242: [EPIC] Representing / extracting wiki-specific application-level semantics
T17017: Wikimedia static HTML dumps broken
T38881: Wiktionary needs usable API
Mentioned Here
T114949: Create mocks for Wiktionary popup when highlighting word(s)
Event Timeline
jberkel
created this task.
Jun 26 2016, 11:36 AM
2016-06-26 11:36:54 (UTC+0)
Restricted Application
added subscribers:
Zppix
Aklapper
View Herald Transcript
Jun 26 2016, 11:36 AM
2016-06-26 11:36:54 (UTC+0)
Mholloway
added a comment.
Edited
Jun 27 2016, 2:07 PM
2016-06-27 14:07:44 (UTC+0)
Comment Actions
As I wrote by email a few minutes ago, I really like the idea of Wiktionary including more semantic markup. Without it, of course, efforts like the current content service endpoint are necessarily pretty brittle and not at all scalable to other languages. Also, I wrote the existing endpoint to parse just the information needed to match the product spec created by the design team (see
T114949
), but of course being able to expose the entire page content in a structured way would be much more useful.
I'm going to tag the Wikipedia app as well for tracking purposes. User engagement with the Wiktionary definition popup feature has been limited so far, but we're interested in seeing how we can increase it, and along with improving discoverability, expanding it beyond English would surely help.
Mholloway
added a project:
Wikipedia-Android-App-Backlog
Jun 27 2016, 2:07 PM
2016-06-27 14:07:59 (UTC+0)
Mholloway
moved this task from
Needs Triage
to
Tracking
on the
Wikipedia-Android-App-Backlog
board.
Mholloway
awarded a token.
Jun 27 2016, 2:14 PM
2016-06-27 14:14:13 (UTC+0)
Mholloway
added a subscriber:
Jhernandez
Jun 27 2016, 2:18 PM
2016-06-27 14:18:51 (UTC+0)
Comment Actions
@Jhernandez
Are Wiktionary definition popups something the web team would be interested in, if we can get the endpoint into a more robust and scalable state?
Jhernandez
added a comment.
Jul 20 2016, 4:06 PM
2016-07-20 16:06:48 (UTC+0)
Comment Actions
@Mholloway
I'm not really sure, why are you asking?
Jhernandez
added a comment.
Jul 20 2016, 4:07 PM
2016-07-20 16:07:16 (UTC+0)
Comment Actions
Sounds like a good idea to me, but I'm no PO :p
Mholloway
added a comment.
Jul 21 2016, 12:38 PM
2016-07-21 12:38:37 (UTC+0)
Comment Actions
Fair 'nuff. :) Just thought of you because of your interest in node services. We don't really keep such a strict engineer/PO division of labor over here in Android-land.
GWicke
added a comment.
Jul 22 2016, 4:40 PM
2016-07-22 16:40:10 (UTC+0)
Comment Actions
@Mholloway
, could you document which elements you would like to see marked up? I know this is somewhat implicit in the current extraction logic, but I think it would be help move the discussion forward to have a concrete list that we could discuss with the community.
GWicke
added a parent task:
T38881: Wiktionary needs usable API
Jul 22 2016, 4:43 PM
2016-07-22 16:43:20 (UTC+0)
GWicke
mentioned this in
T38881: Wiktionary needs usable API
Jul 22 2016, 4:46 PM
2016-07-22 16:46:41 (UTC+0)
Mholloway
triaged this task as
Medium
priority.
Jul 22 2016, 5:42 PM
2016-07-22 17:42:55 (UTC+0)
Comment Actions
Sure, I'll do some thinking and put together a list.
Mholloway
claimed this task.
Jul 22 2016, 5:43 PM
2016-07-22 17:43:54 (UTC+0)
Mholloway
unsubscribed.
mxn
subscribed.
Jul 22 2016, 11:36 PM
2016-07-22 23:36:53 (UTC+0)
Darkdadaah
added a project:
All-and-every-Wiktionary
Jul 25 2016, 10:15 AM
2016-07-25 10:15:15 (UTC+0)
Darkdadaah
subscribed.
Alkamid
subscribed.
Jul 29 2016, 4:34 PM
2016-07-29 16:34:30 (UTC+0)
Alkamid
unsubscribed.
Mholloway
added a comment.
Edited
Aug 1 2016, 7:48 PM
2016-08-01 19:48:14 (UTC+0)
Comment Actions
It seems like adding microformats to identify
-language headers,
-part-of-speech headers,
-definitions, and
-examples
would be a good start.
Could we add class="wd-header", class="wd-part-of-speech" to the relevant header tags, and class="wd-definition" and class="wd-example" to the relevant

  • tags?
    Edit: to incorporate
    @GWicke
    's suggestion below.
    GWicke
    added a comment.
    Aug 1 2016, 8:06 PM
    2016-08-01 20:06:04 (UTC+0)
    Comment Actions
    @Mholloway
    : Those class names are very general, which makes conflicts more likely. Perhaps consider adding a prefix, such as "wd-" (wiktionary definition)?
    Mholloway
    added a comment.
    Aug 1 2016, 8:40 PM
    2016-08-01 20:40:53 (UTC+0)
    Comment Actions
    Good idea. I updated my comment above to reflect it.
    PeterBowman
    subscribed.
    Sep 12 2016, 7:35 PM
    2016-09-12 19:35:57 (UTC+0)
    WMDE-leszek
    subscribed.
    Sep 27 2016, 7:14 AM
    2016-09-27 07:14:06 (UTC+0)
    bearND
    moved this task from
    Incoming
    to
    Backlog
    on the
    Mobile-Content-Service
    board.
    Oct 3 2016, 6:11 PM
    2016-10-03 18:11:24 (UTC+0)
    Niedzielski
    added a project:
    Technical-Debt
    Nov 9 2016, 7:41 PM
    2016-11-09 19:41:31 (UTC+0)
    Niedzielski
    subscribed.
    Comment Actions
    This kind of sounds like technical debt. Please drop the tag if I'm mistaken.
    Mholloway
    added a parent task:
    T151914: Support more languages in the Wiktionary definition endpoint
    Nov 29 2016, 6:29 PM
    2016-11-29 18:29:49 (UTC+0)
    Volker_E
    subscribed.
    Jan 12 2017, 11:43 PM
    2017-01-12 23:43:04 (UTC+0)
    bearND
    subscribed.
    Jan 13 2017, 6:56 AM
    2017-01-13 06:56:34 (UTC+0)
    jberkel
    added a comment.
    Feb 23 2017, 1:29 AM
    2017-02-23 01:29:16 (UTC+0)
    Comment Actions
    I finally managed to get some time to work on this and also did some research on microformats. In the last few years this area has become increasingly confusing with a variety of options (microformats1/2, W3C microdata, schema.org, RDFa (lite), JSON-LD etc).
    In my opinion the simplest option in terms of marking up and parsing of content seems to be
    microformats2
    . They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.
    To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (
    diff
    ).

    This is an example
    This is the translation

    This allows one to write a simple extractor in a few lines of Python, with the help of the
    mf2py
    parser:
    import mf2py

    obj = mf2py.parse(url='https://en.wiktionary.org/wiki/fazer', html_parser='html5lib')
    examples = (example['value']
    for item in obj.get('items', []) if 'h-usage-example' in item['type']
    for example in item['properties']['example'])

    for example in examples:
    print(example)
    GWicke
    added a subscriber:
    Lydia_Pintscher
    Feb 23 2017, 3:08 AM
    2017-02-23 03:08:14 (UTC+0)
    Lydia_Pintscher
    added a comment.
    Feb 23 2017, 5:31 AM
    2017-02-23 05:31:03 (UTC+0)
    Comment Actions
    Follow up here today based on conversations at the colab jam: Please be aware of the work the Wikidata team is doing on supporting lexicographical data in Wikidata as part of our Wiktionary support work. It will still take some time to get that finished but it'll give you nice machine-readable data like it is in Wiktionary now.
    jberkel
    added a comment.
    Edited
    Feb 23 2017, 7:35 AM
    2017-02-23 07:35:11 (UTC+0)
    Comment Actions
    @Lydia_Pintscher
    I'm aware of the efforts of the Wikidata team, it is great to see that this is happening. The approach present here is meant to be a temporary solution until we have this data. Then there's also the chicken-egg question: we first need to get the data present on Wiktionary into Wikidata. This task will be a lot easier if we already have some semantic information present in the generated output, it would let us automate that process. That's what I meant in the initial task description:
    It would also be a good starting point for future Wikidata <–> Wiktionary integration work.
    I'm still unsure about how this aspect of the transition to Wikidata will work out in practice, what are the current ideas around this?
    Lydia_Pintscher
    added a comment.
    Feb 23 2017, 7:09 PM
    2017-02-23 19:09:46 (UTC+0)
    Comment Actions
    Just like for the rest of the data in Wikidata editors will handle it via manual entry, bots and other tools.
    jberkel
    mentioned this in
    T17017: Wikimedia static HTML dumps broken
    Mar 15 2017, 8:07 AM
    2017-03-15 08:07:14 (UTC+0)
    jberkel
    added a comment.
    Mar 15 2017, 8:15 AM
    2017-03-15 08:15:22 (UTC+0)
    Comment Actions
    @Lydia_Pintscher
    OK, so making Wiktionary easier to parse right now will help with that transition. It will be great to have at least some of the data easily accessible.
    Nemo_bis
    subscribed.
    Mar 15 2017, 9:16 AM
    2017-03-15 09:16:00 (UTC+0)
    Comment Actions
    In
    T138709#3048681
    @jberkel
    wrote:
    In my opinion the simplest option in terms of marking up and parsing of content seems to be
    microformats2
    . They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.
    To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (
    diff
    ).
    Thanks for the on-wiki work. This has a value mostly if adopted or adoptable by multiple subdomains: have you written to the grease pit and Wiktionary-l about this work? Since microformats2 is very generic, it would be useful to start writing down your "spec" on a Meta-Wiki page, so that other Wiktionary users can more easily comment (and adopt) it.
    (Note: since this is about wiki editing work, the main component would be
    WMF-General-or-Unknown
    once accepted by the editors.)
    jberkel
    added a comment.
    Mar 15 2017, 10:13 AM
    2017-03-15 10:13:53 (UTC+0)
    Comment Actions
    No I haven't – this was just a first initial test / proof of concept. To me at least it has proven useful, I can now extract usage examples quite easily from the HTML output of the templates, provided that they actually get used (Wiktionary has many cases where templates are recommended but are in fact optional).
    I'll start a discussion but it sometimes feels like a “touchy” subject, there's no clear consensus and some editors don't see the value of semantic data and prefer fewer keystrokes or “more legible markup” as they put it. I won't give up so easily though in my quest to persuade them.
    Nemo_bis
    added a comment.
    Edited
    Mar 15 2017, 10:35 AM
    2017-03-15 10:35:30 (UTC+0)
    Comment Actions
    Well, editors surely have less concerns as long as you change the HTML output while keeping wikitext identical.
    jberkel
    added a comment.
    Mar 15 2017, 10:43 AM
    2017-03-15 10:43:36 (UTC+0)
    Comment Actions
    Yes, that's the idea, editors wouldn't even notice the fact that extra markup gets generated. However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.
    Nemo_bis
    added a comment.
    Mar 21 2017, 2:03 PM
    2017-03-21 14:03:54 (UTC+0)
    Comment Actions
    However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.
    The existing templates are widely accepted, so I think there will be a rather natural push in that direction once editors can "touch" the benefits. (It's also understandable to keep some flexibility, since making a dictionary requires involvement of many people.)
    NHarateh_WMF
    added a project:
    Product-Infrastructure-Team-Backlog-Deprecated
    Apr 25 2017, 12:27 PM
    2017-04-25 12:27:23 (UTC+0)
    NHarateh_WMF
    moved this task from
    Needs triage
    to
    Needs investigation
    on the
    Product-Infrastructure-Team-Backlog-Deprecated
    board.
    Apr 25 2017, 12:31 PM
    2017-04-25 12:31:18 (UTC+0)
    NHarateh_WMF
    moved this task from
    Backlog
    to
    Incoming
    on the
    Mobile-Content-Service
    board.
    Apr 25 2017, 4:33 PM
    2017-04-25 16:33:02 (UTC+0)
    Mholloway
    removed
    Mholloway
    as the assignee of this task.
    May 12 2017, 2:38 AM
    2017-05-12 02:38:04 (UTC+0)
    Mholloway
    subscribed.
    Kelson
    subscribed.
    Jul 9 2017, 12:54 PM
    2017-07-09 12:54:35 (UTC+0)
    ssastry
    subscribed.
    Sep 13 2017, 4:05 AM
    2017-09-13 04:05:18 (UTC+0)
    Comment Actions
    Related WIP document:
    cscott
    mentioned this in
    T176242: [EPIC] Representing / extracting wiki-specific application-level semantics
    Sep 19 2017, 6:08 PM
    2017-09-19 18:08:55 (UTC+0)
    Mholloway
    mentioned this in
    T164739: Allow page previews to display in wiktionary
    Jan 9 2018, 12:48 PM
    2018-01-09 12:48:42 (UTC+0)
    jberkel
    mentioned this in
    T187430: Duplicate usage examples in Wiktionary page definition endpoint
    Feb 15 2018, 10:53 AM
    2018-02-15 10:53:46 (UTC+0)
    jeremyb
    subscribed.
    Mar 21 2018, 8:39 PM
    2018-03-21 20:39:09 (UTC+0)
    LGoto
    removed a project:
    Wikipedia-Android-App-Backlog
    Apr 11 2018, 9:51 PM
    2018-04-11 21:51:11 (UTC+0)
    Jhernandez
    lowered the priority of this task from
    Medium
    to
    Lowest
    Feb 20 2019, 4:42 PM
    2019-02-20 16:42:57 (UTC+0)
    Jhernandez
    raised the priority of this task from
    Lowest
    to
    Low
    Jhernandez
    moved this task from
    Needs investigation
    to
    Backlog
    on the
    Product-Infrastructure-Team-Backlog-Deprecated
    board.
    Jhernandez
    unsubscribed.
    Apr 2 2020, 6:46 PM
    2020-04-02 18:46:25 (UTC+0)
    LGoto
    closed this task as
    Declined
    Oct 9 2020, 4:50 PM
    2020-10-09 16:50:37 (UTC+0)
    Log In to Comment
    Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
    Wikimedia Foundation
    Code of Conduct
    Disclaimer
    CC-BY-SA
    GPL
    Credits