⚓ T138709 Use microformats on Wiktionary to improve term parsing
Page Menu
Phabricator
Create Task
Maniphest
T138709
Use microformats on Wiktionary to improve term parsing
Closed, Declined
Public
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
None
Authored By
jberkel
Jun 26 2016, 11:36 AM
2016-06-26 11:36:54 (UTC+0)
Tags
Mobile-Content-Service
(Incoming)
All-and-every-Wiktionary
(Backlog)
Technical-Debt
(Unsorted)
Product-Infrastructure-Team-Backlog-Deprecated
(Backlog)
Referenced Files
None
Subscribers
Aklapper
bearND
Darkdadaah
GWicke
jberkel
jeremyb
Kelson
View All 17 Subscribers
Description
(task opened as a follow-up to a conversation with
@GWicke
at Wikimania)
There is an experimental
definition term endpoint
deployed which makes some assumptions on the HTML structure of the Wiktionary markup (cf.
parseDefinition.js
).
To avoid future maintenance problems and to scale up the extraction to other languages / Wiktionaries it would be helpful to have the Wiktionary templates already include semantic information in the output. On Wikisource this has already been done successfully (in the context of ebook metadata export).
It would also be a good starting point for future Wikidata <–> Wiktionary integration work. There is already some minimal semantic information included in a few places, for example gender markers (cf. the Spanish entry
casa
):
f
A first step would be to formalize these conventions and implement them consistently on the English Wiktionary, then adapt the parsing code on the API side.
Thoughts?
Related Objects
Search...
Task Graph
Mentions
Status
Subtype
Assigned
Task
Open
Feature
None
T13996
A way to select which parts of Wiktionary articles to show
Open
Feature
None
T14213
Following a link to a language entry in Wiktionary should display only that entry
Open
Feature
None
T13998
A way to show only those languages on Wiktionary that the user is interested in
Open
Feature
None
T38881
Wiktionary needs usable API
Declined
None
T151914
Support more languages in the Wiktionary definition endpoint
Declined
None
T138709
Use microformats on Wiktionary to improve term parsing
Mentioned In
T187430: Duplicate usage examples in Wiktionary page definition endpoint
T164739: Allow page previews to display in wiktionary
T176242: [EPIC] Representing / extracting wiki-specific application-level semantics
T17017: Wikimedia static HTML dumps broken
T38881: Wiktionary needs usable API
Mentioned Here
T114949: Create mocks for Wiktionary popup when highlighting word(s)
Event Timeline
jberkel
created this task.
Jun 26 2016, 11:36 AM
2016-06-26 11:36:54 (UTC+0)
Restricted Application
added subscribers:
Zppix
Aklapper
View Herald Transcript
Jun 26 2016, 11:36 AM
2016-06-26 11:36:54 (UTC+0)
Mholloway
added a comment.
Edited
Jun 27 2016, 2:07 PM
2016-06-27 14:07:44 (UTC+0)
Comment Actions
As I wrote by email a few minutes ago, I really like the idea of Wiktionary including more semantic markup. Without it, of course, efforts like the current content service endpoint are necessarily pretty brittle and not at all scalable to other languages. Also, I wrote the existing endpoint to parse just the information needed to match the product spec created by the design team (see
T114949
), but of course being able to expose the entire page content in a structured way would be much more useful.
I'm going to tag the Wikipedia app as well for tracking purposes. User engagement with the Wiktionary definition popup feature has been limited so far, but we're interested in seeing how we can increase it, and along with improving discoverability, expanding it beyond English would surely help.
Mholloway
added a project:
Wikipedia-Android-App-Backlog
Jun 27 2016, 2:07 PM
2016-06-27 14:07:59 (UTC+0)
Mholloway
moved this task from
Needs Triage
to
Tracking
on the
Wikipedia-Android-App-Backlog
board.
Mholloway
awarded a token.
Jun 27 2016, 2:14 PM
2016-06-27 14:14:13 (UTC+0)
Mholloway
added a subscriber:
Jhernandez
Jun 27 2016, 2:18 PM
2016-06-27 14:18:51 (UTC+0)
Comment Actions
@Jhernandez
Are Wiktionary definition popups something the web team would be interested in, if we can get the endpoint into a more robust and scalable state?
Jhernandez
added a comment.
Jul 20 2016, 4:06 PM
2016-07-20 16:06:48 (UTC+0)
Comment Actions
@Mholloway
I'm not really sure, why are you asking?
Jhernandez
added a comment.
Jul 20 2016, 4:07 PM
2016-07-20 16:07:16 (UTC+0)
Comment Actions
Sounds like a good idea to me, but I'm no PO :p
Mholloway
added a comment.
Jul 21 2016, 12:38 PM
2016-07-21 12:38:37 (UTC+0)
Comment Actions
Fair 'nuff. :) Just thought of you because of your interest in node services. We don't really keep such a strict engineer/PO division of labor over here in Android-land.
GWicke
added a comment.
Jul 22 2016, 4:40 PM
2016-07-22 16:40:10 (UTC+0)
Comment Actions
@Mholloway
, could you document which elements you would like to see marked up? I know this is somewhat implicit in the current extraction logic, but I think it would be help move the discussion forward to have a concrete list that we could discuss with the community.
GWicke
added a parent task:
T38881: Wiktionary needs usable API
Jul 22 2016, 4:43 PM
2016-07-22 16:43:20 (UTC+0)
GWicke
mentioned this in
T38881: Wiktionary needs usable API
Jul 22 2016, 4:46 PM
2016-07-22 16:46:41 (UTC+0)
Mholloway
triaged this task as
Medium
priority.
Jul 22 2016, 5:42 PM
2016-07-22 17:42:55 (UTC+0)
Comment Actions
Sure, I'll do some thinking and put together a list.
Mholloway
claimed this task.
Jul 22 2016, 5:43 PM
2016-07-22 17:43:54 (UTC+0)
Mholloway
unsubscribed.
mxn
subscribed.
Jul 22 2016, 11:36 PM
2016-07-22 23:36:53 (UTC+0)
Darkdadaah
added a project:
All-and-every-Wiktionary
Jul 25 2016, 10:15 AM
2016-07-25 10:15:15 (UTC+0)
Darkdadaah
subscribed.
Alkamid
subscribed.
Jul 29 2016, 4:34 PM
2016-07-29 16:34:30 (UTC+0)
Alkamid
unsubscribed.
Mholloway
added a comment.
Edited
Aug 1 2016, 7:48 PM
2016-08-01 19:48:14 (UTC+0)
Comment Actions
It seems like adding microformats to identify
-language headers,
-part-of-speech headers,
-definitions, and
-examples
would be a good start.
Could we add class="wd-header", class="wd-part-of-speech" to the relevant header tags, and class="wd-definition" and class="wd-example" to the relevant
Edit: to incorporate
@GWicke
's suggestion below.
GWicke
added a comment.
Aug 1 2016, 8:06 PM
2016-08-01 20:06:04 (UTC+0)
Comment Actions
@Mholloway
: Those class names are very general, which makes conflicts more likely. Perhaps consider adding a prefix, such as "wd-" (wiktionary definition)?
Mholloway
added a comment.
Aug 1 2016, 8:40 PM
2016-08-01 20:40:53 (UTC+0)
Comment Actions
Good idea. I updated my comment above to reflect it.
PeterBowman
subscribed.
Sep 12 2016, 7:35 PM
2016-09-12 19:35:57 (UTC+0)
WMDE-leszek
subscribed.
Sep 27 2016, 7:14 AM
2016-09-27 07:14:06 (UTC+0)
bearND
moved this task from
Incoming
to
Backlog
on the
Mobile-Content-Service
board.
Oct 3 2016, 6:11 PM
2016-10-03 18:11:24 (UTC+0)
Niedzielski
added a project:
Technical-Debt
Nov 9 2016, 7:41 PM
2016-11-09 19:41:31 (UTC+0)
Niedzielski
subscribed.
Comment Actions
This kind of sounds like technical debt. Please drop the tag if I'm mistaken.
Mholloway
added a parent task:
T151914: Support more languages in the Wiktionary definition endpoint
Nov 29 2016, 6:29 PM
2016-11-29 18:29:49 (UTC+0)
Volker_E
subscribed.
Jan 12 2017, 11:43 PM
2017-01-12 23:43:04 (UTC+0)
bearND
subscribed.
Jan 13 2017, 6:56 AM
2017-01-13 06:56:34 (UTC+0)
jberkel
added a comment.
Feb 23 2017, 1:29 AM
2017-02-23 01:29:16 (UTC+0)
Comment Actions
I finally managed to get some time to work on this and also did some research on microformats. In the last few years this area has become increasingly confusing with a variety of options (microformats1/2, W3C microdata, schema.org, RDFa (lite), JSON-LD etc).
In my opinion the simplest option in terms of marking up and parsing of content seems to be
microformats2
. They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.
To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (
diff
).
This is an example
This is the translation
This allows one to write a simple extractor in a few lines of Python, with the help of the
mf2py
parser:
import mf2py
obj = mf2py.parse(url='https://en.wiktionary.org/wiki/fazer', html_parser='html5lib')
examples = (example['value']
for item in obj.get('items', []) if 'h-usage-example' in item['type']
for example in item['properties']['example'])
for example in examples:
print(example)
GWicke
added a subscriber:
Lydia_Pintscher
Feb 23 2017, 3:08 AM
2017-02-23 03:08:14 (UTC+0)
Lydia_Pintscher
added a comment.
Feb 23 2017, 5:31 AM
2017-02-23 05:31:03 (UTC+0)
Comment Actions
Follow up here today based on conversations at the colab jam: Please be aware of the work the Wikidata team is doing on supporting lexicographical data in Wikidata as part of our Wiktionary support work. It will still take some time to get that finished but it'll give you nice machine-readable data like it is in Wiktionary now.
jberkel
added a comment.
Edited
Feb 23 2017, 7:35 AM
2017-02-23 07:35:11 (UTC+0)
Comment Actions
@Lydia_Pintscher
I'm aware of the efforts of the Wikidata team, it is great to see that this is happening. The approach present here is meant to be a temporary solution until we have this data. Then there's also the chicken-egg question: we first need to get the data present on Wiktionary into Wikidata. This task will be a lot easier if we already have some semantic information present in the generated output, it would let us automate that process. That's what I meant in the initial task description:
It would also be a good starting point for future Wikidata <–> Wiktionary integration work.
I'm still unsure about how this aspect of the transition to Wikidata will work out in practice, what are the current ideas around this?
Lydia_Pintscher
added a comment.
Feb 23 2017, 7:09 PM
2017-02-23 19:09:46 (UTC+0)
Comment Actions
Just like for the rest of the data in Wikidata editors will handle it via manual entry, bots and other tools.
jberkel
mentioned this in
T17017: Wikimedia static HTML dumps broken
Mar 15 2017, 8:07 AM
2017-03-15 08:07:14 (UTC+0)
jberkel
added a comment.
Mar 15 2017, 8:15 AM
2017-03-15 08:15:22 (UTC+0)
Comment Actions
@Lydia_Pintscher
OK, so making Wiktionary easier to parse right now will help with that transition. It will be great to have at least some of the data easily accessible.
Nemo_bis
subscribed.
Mar 15 2017, 9:16 AM
2017-03-15 09:16:00 (UTC+0)
Comment Actions
In
T138709#3048681
@jberkel
wrote:
In my opinion the simplest option in terms of marking up and parsing of content seems to be
microformats2
. They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.
To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (
diff
).
Thanks for the on-wiki work. This has a value mostly if adopted or adoptable by multiple subdomains: have you written to the grease pit and Wiktionary-l about this work? Since microformats2 is very generic, it would be useful to start writing down your "spec" on a Meta-Wiki page, so that other Wiktionary users can more easily comment (and adopt) it.
(Note: since this is about wiki editing work, the main component would be
WMF-General-or-Unknown
once accepted by the editors.)
jberkel
added a comment.
Mar 15 2017, 10:13 AM
2017-03-15 10:13:53 (UTC+0)
Comment Actions
No I haven't – this was just a first initial test / proof of concept. To me at least it has proven useful, I can now extract usage examples quite easily from the HTML output of the templates, provided that they actually get used (Wiktionary has many cases where templates are recommended but are in fact optional).
I'll start a discussion but it sometimes feels like a “touchy” subject, there's no clear consensus and some editors don't see the value of semantic data and prefer fewer keystrokes or “more legible markup” as they put it. I won't give up so easily though in my quest to persuade them.
Nemo_bis
added a comment.
Edited
Mar 15 2017, 10:35 AM
2017-03-15 10:35:30 (UTC+0)
Comment Actions
Well, editors surely have less concerns as long as you change the HTML output while keeping wikitext identical.
jberkel
added a comment.
Mar 15 2017, 10:43 AM
2017-03-15 10:43:36 (UTC+0)
Comment Actions
Yes, that's the idea, editors wouldn't even notice the fact that extra markup gets generated. However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.
Nemo_bis
added a comment.
Mar 21 2017, 2:03 PM
2017-03-21 14:03:54 (UTC+0)
Comment Actions
However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.
The existing templates are widely accepted, so I think there will be a rather natural push in that direction once editors can "touch" the benefits. (It's also understandable to keep some flexibility, since making a dictionary requires involvement of many people.)
NHarateh_WMF
added a project:
Product-Infrastructure-Team-Backlog-Deprecated
Apr 25 2017, 12:27 PM
2017-04-25 12:27:23 (UTC+0)
NHarateh_WMF
moved this task from
Needs triage
to
Needs investigation
on the
Product-Infrastructure-Team-Backlog-Deprecated
board.
Apr 25 2017, 12:31 PM
2017-04-25 12:31:18 (UTC+0)
NHarateh_WMF
moved this task from
Backlog
to
Incoming
on the
Mobile-Content-Service
board.
Apr 25 2017, 4:33 PM
2017-04-25 16:33:02 (UTC+0)
Mholloway
removed
Mholloway
as the assignee of this task.
May 12 2017, 2:38 AM
2017-05-12 02:38:04 (UTC+0)
Mholloway
subscribed.
Kelson
subscribed.
Jul 9 2017, 12:54 PM
2017-07-09 12:54:35 (UTC+0)
ssastry
subscribed.
Sep 13 2017, 4:05 AM
2017-09-13 04:05:18 (UTC+0)
Comment Actions
Related WIP document:
cscott
mentioned this in
T176242: [EPIC] Representing / extracting wiki-specific application-level semantics
Sep 19 2017, 6:08 PM
2017-09-19 18:08:55 (UTC+0)
Mholloway
mentioned this in
T164739: Allow page previews to display in wiktionary
Jan 9 2018, 12:48 PM
2018-01-09 12:48:42 (UTC+0)
jberkel
mentioned this in
T187430: Duplicate usage examples in Wiktionary page definition endpoint
Feb 15 2018, 10:53 AM
2018-02-15 10:53:46 (UTC+0)
jeremyb
subscribed.
Mar 21 2018, 8:39 PM
2018-03-21 20:39:09 (UTC+0)
LGoto
removed a project:
Wikipedia-Android-App-Backlog
Apr 11 2018, 9:51 PM
2018-04-11 21:51:11 (UTC+0)
Jhernandez
lowered the priority of this task from
Medium
to
Lowest
Feb 20 2019, 4:42 PM
2019-02-20 16:42:57 (UTC+0)
Jhernandez
raised the priority of this task from
Lowest
to
Low
Jhernandez
moved this task from
Needs investigation
to
Backlog
on the
Product-Infrastructure-Team-Backlog-Deprecated
board.
Jhernandez
unsubscribed.
Apr 2 2020, 6:46 PM
2020-04-02 18:46:25 (UTC+0)
LGoto
closed this task as
Declined
Oct 9 2020, 4:50 PM
2020-10-09 16:50:37 (UTC+0)
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
Wikimedia Foundation
Code of Conduct
Disclaimer
CC-BY-SA
GPL
Credits