⚓ T589 RFC: image and oldimage tables
Page Menu
Phabricator
Create Task
Maniphest
T589
RFC: image and oldimage tables
Closed, Resolved
Public
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
Krinkle
Authored By
Qgil
Oct 8 2014, 9:10 PM
2014-10-08 21:10:02 (UTC+0)
Tags
TechCom-RFC (TechCom-RFC-Closed)
(Approved)
MediaWiki-libs-Rdbms
(Schema changes)
Referenced Files
None
Subscribers
aaron
Addshore
Aklapper
Bawolff
brooke
daniel
GWicke
View All 23 Subscribers
Description
The
image and oldimage tables RfC
proposes to change the database layout from having two tables for "current" and "old" image (file) records, to having one for the file, and one for file revisions.
Quoting
T28741
The most recent version of each image is stored in the image table, with historic versions stored in the oldimage table. This structure resembles cur and old, with all the nastiness that comes with it. In MW 1.5 we ditched cur/old in favor of page/revision, but we never did a similar thing for files.
We talked about this a bit at the 2011 Amsterdam Hackathon and decided that, while it's complex and hairy, we want to do it at some point in the future.
RFC:
Related Objects
Search...
Task Graph
Mentions
Duplicates
View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use
View Standalone Graph
to show more of the graph.
Status
Subtype
Assigned
Task
· · ·
Open
Ladsgroup
T28741
Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys)
Resolved
Krinkle
T589
RFC: image and oldimage tables
· · ·
Mentioned In
T299764: Document media recovery use case proposals and decide their priority
T155320: Implement strict mime type detection and media type inferring of audio/video files
T154520: API to return all user uploads
T6715: Allow comments longer than 255 bytes
T153333: RFC: How should we store longer revision comments?
E228: ArchCom RFC Meeting W28: image and oldimage tables (2016-07-13, #wikimedia-office)
E266: ArchCom RFC Meeting W35: image and oldimage tables (2016-08-31, #wikimedia-office)
T149534: Scaling the MediaWiki database schema
T5361: Embed image author, description, and copyright data in file metadata fields
P3425 ArchCom-RFC-2016W28-irc-E228.txt
T28741: Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys)
E187: RFC Meeting: triage meeting (2016-05-25, #wikimedia-office)
T6421: Image file extension should not be part of the name
T68108: [Epic] Store media information for files on Wikimedia Commons as structured data
T96384: Integrate file revisions with description page history
Mentioned Here
T32906: Store DjVu, PDF extracted text in a structured table instead of img_metadata
T99263: Store Pdf extracted text in a structured table instead of img_metadata
E266: ArchCom RFC Meeting W35: image and oldimage tables (2016-08-31, #wikimedia-office)
P3425 ArchCom-RFC-2016W28-irc-E228.txt
T66214: Define an official thumb API
T107595: [RFC] Multi-Content Revisions
T28741: Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys)
T135851: Preserve InnoDB table auto_increment on restart
T139346: All s4 (commons) slaves failed replication
E228: ArchCom RFC Meeting W28: image and oldimage tables (2016-07-13, #wikimedia-office)
T96384: Integrate file revisions with description page history
Duplicates Merged Here
T73198: Image and oldimage tables lack auto-incrementing primary keys
Event Timeline
There are a very large number of changes, so older changes are hidden.
Show Older Changes
Krinkle
mentioned this in
T28741: Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys)
Jul 1 2016, 5:23 PM
2016-07-01 17:23:47 (UTC+0)
Krinkle
updated the task description.
(Show Details)
Krinkle
moved this task from
Old
to
P1: Define
on the
TechCom-RFC
board.
Jul 1 2016, 5:29 PM
2016-07-01 17:29:16 (UTC+0)
Scott_WUaS
subscribed.
Jul 6 2016, 9:58 PM
2016-07-06 21:58:36 (UTC+0)
RobLa-WMF
subscribed.
Jul 8 2016, 6:22 AM
2016-07-08 06:22:17 (UTC+0)
Comment Actions
This RFC is tentatively scheduled for our ArchCom IRC meeting next week:
E228: ArchCom RFC Meeting W28: image and oldimage tables (2016-07-13, #wikimedia-office)
jcrespo
subscribed.
Jul 8 2016, 6:35 AM
2016-07-08 06:35:36 (UTC+0)
Comment Actions
These are 2 ancient rules regarding databases (that I have just made up) regarding database design:
Do no delete rows
Do not move rows between tables
Speciall,
INSERT... SELECT
is bad (either requires a lot of locking or it is insecure, we are on the second case for performance reasons), and MySQL is already giving a warning every time revisions are archived and unarchived.
Given that the parent task says: "This structure resembles cur and old, with all the nastiness that comes with it." I probably do not need to convince anybody of how badly that is. But if I had to, just have a look at bugs such as:
T139346
and
T135851#2312924
Qgil
unsubscribed.
Jul 8 2016, 9:13 AM
2016-07-08 09:13:29 (UTC+0)
RobLa-WMF
assigned this task to
Krinkle
Jul 8 2016, 7:16 PM
2016-07-08 19:16:23 (UTC+0)
Comment Actions
@Krinkle
, I don't remember if you explicitly volunteered to shepherd this one, but since you seem to be the sponsor for this change, I'm assigning this to you.
RobLa-WMF
moved this task from
P1: Define
to
Request IRC meeting
on the
TechCom-RFC
board.
Jul 9 2016, 12:52 AM
2016-07-09 00:52:43 (UTC+0)
RobLa-WMF
added a parent task:
T17441: Some tables lack unique or primary keys, may allow confusing duplicate data
Jul 12 2016, 4:24 AM
2016-07-12 04:24:57 (UTC+0)
RobLa-WMF
updated the task description.
(Show Details)
Jul 12 2016, 4:35 AM
2016-07-12 04:35:07 (UTC+0)
daniel
added a comment.
Edited
Jul 12 2016, 3:13 PM
2016-07-12 15:13:54 (UTC+0)
Comment Actions
My two cents on integrating file history with page history (
T96384
), using Multi-Content-Revisions (
T107595
):
when a new version of a file is uploaded, a new revision of the file description page is created, in which the "file" slot is updated (optionally, the "main" slot with the wikitext would also be updated).
The "file" slot has a content model called "uploaded-media" (or just "file"), which consists of structured meta-data (json?) about the uploaded file.
The most important information is the internal identifier of the file, which can be used to find
this
version of the file both on disk and via HTTP. Additional information about the file should probably be included, like size, mime type, hash (enabled migration to content addressable storage!), and maybe even extracted meta-data (exif).
this requires internal and external identifiers that remain stable. 'Vulpes_vulpes_standing_in_snow.jpg' is not sufficient, it would have to be something like '20040912170154!Vulpes_vulpes_standing_in_snow.jpg' from the start.
We could achieve this by allowing such "archive IDs" to work also on the current version. So '20040912170158!Vulpes_vulpes_standing_in_snow.jpg' should currently be an alias for the plain 'Vulpes_vulpes_standing_in_snow.jpg' (or the other way around).
For display, ImagePage will have to know about the "file" slot. Some API modules will probably also have to be aware of this.
I think all the information that is currently in the image table can be managed as "slot content", but we should keep the image table as a secondary table for quick queries by size or mime type. The image table would become a "derived" table like the link tables, that gets updated when content (of the file slot) gets saved.
The oldimage table however can be dropped, after creating "fake" revisions for each upload, with the necessary information in the file slot. The same is possible for the filearchive table, with the difference that the "fake" revisions would be created in the archive table.
The result would be:
oldimage and filearchive table can be removed
full integration of file uploads with the page history, including patrolling, reverts, undeletion, etc
image table is still available for queries
file metadata would become available in XML dumps.
brooke
subscribed.
Jul 12 2016, 4:07 PM
2016-07-12 16:07:35 (UTC+0)
Comment Actions
In
T589#2453326
@daniel
wrote:
My two cents on integrating file history with page history (
T96384
), using Multi-Content-Revisions (
T107595
):
when a new version of a file is uploaded, a new revision of the file description page is created, in which the "file" slot is updated (optionally, the "main" slot with the wikitext would also be updated).
The "file" slot has a content model called "uploaded-media" (or just "file"), which consists of structured meta-data (json?) about the uploaded file.
+1
The most important information is the internal identifier of the file, which can be used to find
this
version of the file both on disk and via HTTP. Additional information about the file should probably be included, like size, mime type, hash (enabled migration to content addressable storage!), and maybe even extracted meta-data (exif).
Potentially some overlap with derived data (derived from the file), but consistent storage is of course nice.
this requires internal and external identifiers that remain stable. 'Vulpes_vulpes_standing_in_snow.jpg' is not sufficient, it would have to be something like '20040912170154!Vulpes_vulpes_standing_in_snow.jpg' from the start.
We could achieve this by allowing such "archive IDs" to work also on the current version. So '20040912170158!Vulpes_vulpes_standing_in_snow.jpg' should currently be an alias for the plain 'Vulpes_vulpes_standing_in_snow.jpg' (or the other way around).
Ah, this comes back to needing permanent storage ids; I'm not sure I like the aliasing idea though it may work more simply than forcing an immediate migration...
For display, ImagePage will have to know about the "file" slot. Some API modules will probably also have to be aware of this.
I think all the information that is currently in the image table can be managed as "slot content", but we should keep the image table as a secondary table for quick queries by size or mime type. The image table would become a "derived" table like the link tables, that gets updated when content (of the file slot) gets saved.
Sounds sensible... image table is mostly used for metadata lookups, but does get used for listings. I think mime type searches could be done more effectively by tagging the search index on the pages than anything we do with the image table (unless you're just pulling a list of all file of a certain type, which is probably rare).
The oldimage table however can be dropped, after creating "fake" revisions for each upload, with the necessary information in the file slot. The same is possible for the filearchive table, with the difference that the "fake" revisions would be created in the archive table.
These'll need to either stay for old items, or stay long enough for a migration.
The result would be:
oldimage and filearchive table can be removed
full integration of file uploads with the page history, including patrolling, reverts, undeletion, etc
image table is still available for queries
file metadata would become available in XML dumps.
Whee! :)
RobLa-WMF
added a project:
TechCom-Has-shepherd
Jul 13 2016, 5:28 AM
2016-07-13 05:28:41 (UTC+0)
RobLa-WMF
moved this task from
Backlog
to
Krinkle
on the
TechCom-Has-shepherd
board.
GWicke
subscribed.
Jul 13 2016, 8:22 PM
2016-07-13 20:22:44 (UTC+0)
Comment Actions
The RFC could use a more detailed problem statement, which would help to establish a scope. For example, to me it is not entirely clear whether the issues and solutions described in
T66214: Define an official thumb API
are in scope or not, or how those two proposals would interact.
RobLa-WMF
mentioned this in
P3425 ArchCom-RFC-2016W28-irc-E228.txt
Jul 13 2016, 10:06 PM
2016-07-13 22:06:17 (UTC+0)
daniel
added a comment.
Jul 14 2016, 4:44 PM
2016-07-14 16:44:07 (UTC+0)
Comment Actions
Was handling of image revisions via multi-content revisions discussed during yesterday's the IRC session?
Krinkle
added a comment.
Edited
Jul 14 2016, 5:45 PM
2016-07-14 17:45:25 (UTC+0)
Comment Actions
In
T589#2459544
@GWicke
wrote:
The RFC could use a more detailed problem statement, which would help to establish a scope. For example, to me it is not entirely clear whether the issues and solutions described in
T66214: Define an official thumb API
are in scope or not.
The design of the query API for imageinfo and thumbnails, and the url pattern for thumbnails itself is orthogonal to this RFC. It shouldn't matter to this RFC whether those interfaces use file names (that resolve to the current version of a mutable/versioned file), or refer to a specific version directly (using the filename/timestamp or content hash). Either public interface can be made on top of the current or here-proposed schema. In fact, even the implementation of
T66214
shouldn't be affected by the schema changes. The implementation details that change in this RFC are mostly hidden behind abstract methods.
The only difference is that this RFC would introduce a new "file revision ID" concept that we may or may not want to support as parameter in the imageinfo query API.
File hashes are already stored and queryable.
T66214
may need to add an index or link table for the content hashes, but that's an unrelated schema change.
I also assume that
T66214
has no intention of changing the fact that we use canonical references inside wikitext/html (e.g, not embedding an older file directly). The expansion to refer to content hashes and/or versioned thumbnail urls would apply to the parser (and would need to purge pages when files change). If we want to support embedding old versions directly in wikitext, that would be a separate change also.
Krinkle
added a subscriber:
tstarling
Jul 14 2016, 6:55 PM
2016-07-14 18:55:23 (UTC+0)
Comment Actions
Summary of yesterday's IRC discussion of this RFC follows. (Full chat log at
P3425
.)
"Problems" section added to
File revisions should have better unique identifiers than "current file title + timestamp".
Uploading file revisions must not involve rows moving the across database tables.
This RFC would provide stable file IDs. While they are not intended to become public identifiers,
T66214
might want to use them. It would allow thumbnails (addressed by content hash) to be associated with a file ID rather than a file name. This would allow a thumbnail storage to have more stable and hierarchical structure. Making purges easier. Especially when users rename a file.
Having a primary key will make future schema changes easier.
Merely adding a primary key to the current schema would not solve Problem 2. As such, Proposal 1 was rejected.
Several different schema migration strategies were discussed afterward in
#wikimedia-tech
. Ideas from
@tstarling
and
@jcrespo
are now summarised at
P3425 ArchCom-RFC-2016W28-irc-E228.txt
21:02:21 #startmeeting RfC: image and oldimage tables
21:02:21 Meeting started Wed Jul 13 21:02:21 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:02:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:02:21 The meeting name has been set to 'rfc__image_and_oldimage_tables'
21:02:41 #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
21:03:03 hi everyone!
21:03:27 https://phabricator.wikimedia.org/T589 is the topic for today
21:03:54 Krinkle: can you give a quick summary of the topic?
21:04:03 (Hi Everyone :)
10
21:04:12 Yep
11
21:04:25 Reference: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables - outlining here now
12
21:05:14 hmm
13
21:05:19 There is a fair number of problems with how we store multimedia in MediaWiki. This RFC will focus specifically on internal storage model for files and revisions.
14
21:05:27 (we're also trying to have a field narrowing conversation as defined here: https://www.mediawiki.org/wiki/Good_meetings#Taxonomy )
15
21:06:13 (if we're succesful, that means: "tentative consensus about the limited set of options, with someone assigned to clearly document the tentative consensus in the consensus-building forum of record")
16
21:06:19 I think we have a lot of flexibility on where we want to go with this. Historically, the direction has been to replace it with a page/revision-like set up where files and file revisions have numerical IDs.
17
21:06:24 if the idea is to get rid of INSERT SELECT then moving it to multi-content revisions won't do that
18
21:06:27 But recently, several other ideas have come up.
19
21:07:07 when jcrespo complained about how inefficient INSERT SELECT is, he linked to an incident involving filearchive, i.e. file deletion
20
21:07:09 For the moment, I've summarised two problems on top of the page: "File revisions should have better unique identifiers" and "Files should not be identified by their caption, but be given a language-neutral identifier."
21
21:07:39 TimStarling: Yeah, I think whatever we come up with, the idea of rows being moved should definitely go away.
22
21:07:48 Which to me seems like proposal #1 is unfit.
23
21:07:50 those two problem statements have some overlap with the idea to use content hashes
24
21:08:03 https://phabricator.wikimedia.org/T66214
25
21:08:54 gwicke: That RFC primarily deals with the thumbnail API, and HTTP api for files. It doesn't neccecarily dictate the storage model and/or how people reference them in wikitext/html.
26
21:08:57 as I said on the task, I am not entirely sure whether this is in scope, or how it would interact
27
21:09:06 #info topic of discussion - what is the scope of this RFC
28
21:09:53 One use case you could think about is, if we have file IDs, those will likely be per-wiki, which makes them unsuitable for use from a foreign file repo (e.g. if we have {{#file:1234}} how should that work cross-wiki, do we still allow shadowing of file names?
29
21:09:56 much of the motivation for content-based addressing is to solve the cache invalidation / consistency issue
30
21:10:24 which is also a part of the motivation for moving towards more stable ids that don't change when images are renamed
31
21:10:54 Another use case is to disallow mutation of files, and require a 1:1 mapping of a file upload and a description page - in which case re-upload would require updating all usage - which isn't practical cross-wiki and to third-parties. Though redirects could be use as canonical alias for frequently updated files.
32
21:11:22 you could have a 1:n mapping from descriptions to files
33
21:11:39 m:n if you factor in languages
34
21:11:55 Yeah, though I think that would move internal details to the user land too much. We can duplicate internally on a per-fileblob level.
35
21:12:05 By using hashes indeed.
36
21:12:16 Having a 1:n mapping for mutable files and descriptions doesn't seem useful.
37
21:12:34 we already have that, and it's shown as a "file history"
38
21:12:47 #1 is a good idea regardless right? All tables should have primary keys...
39
21:13:00 there is one description page per file history, not multiple for the same "file"
40
21:13:05 And it doesn't prevent us from implementing any of the other solutions
41
21:13:29 legoktm: Yeah, it would be a very minimal change, but wouldn't solve any of the underlying problems such as killing the concept of rows being moved across tables, which is the main blocker (/me adds to problem statement)
42
21:13:32 Krinkle: I think we agree, it's already 1:n
43
21:13:39 one description, n file revisions
44
21:14:05 And having the primary key will make future schema changes (whatever they are) much easier
45
21:14:08 #info 14:12:48 #1 is a good idea regardless right? All tables should have primary keys...
46
21:14:18 gwicke: Ah, yes.
47
21:14:29 gwicke: I was thinking 1 versioned file, multiple versioned description pages
48
21:14:32 which we don't.
49
21:14:53 yeah
50
21:14:53 not currently
51
21:15:07 although with languages, that might be in scope (?)
52
21:15:20 #info [primary keys] wouldn't solve any of the underlying problems such as killing the concept of rows being moved across tables, which is the main blocker (/me adds to problem statement)
53
21:16:06 at a high level, is the goal of this RFC to clean up how the entities in the media / image space are structured?
54
21:16:20 I'd like to disqualify option 1 from the RFC. But I'm curious of people think that would be worth the effort to do first. Personally I don't think it would benefit much given virtually any future direction would end up replacing that intermediary state completely.
55
21:16:38 also, are concerns around multi-project & multi-language media use in scope?
56
21:16:55 gwicke: What do you mean by 'how they are structured'?
57
21:17:14 Krinkle: how the data model is structured
58
21:17:24 (I hope all this Wikicommons' media including video can feed in to, conceptually, a Google Street View with OpenSimulator some years ahead, where we can all "wiki-add" to Google Street View and OpenSim (conceptually again) since WMF may have a film-realistic, interactive, 3D, group build-able, with avatars planned for all 8k languages with Time Slider virtual earth for wiki STEM research, on the horizon :)
59
21:17:28 Krinkle: I'm not sure if option 1 solves the goals of what you want to do, but I think all tables do need primary keys regardless, so unless your plan is to remove the tables entirely, I don't see how it can be disqualified.
60
21:17:29 for example, how many descriptions can we have per media blob
61
21:18:07 * aude waves
62
21:18:14 legoktm: option 2 involves a schema change that also includes primary keys. option 3 involves dropping the tables effectively, and re-creating them as secondary data (like categories) - which would have primary keys, but for completely new rows.
63
21:18:40 #info Krinkle wants to remove option #1, legoktm suggests we need primary keys regardless
64
21:18:41 gwicke: I think whether descriptions are smashes into 1 wikitext blob or multiple should be discussed separately.
65
21:18:42 so really option 1 is already part of option 2 and 3?
66
21:19:20 Krinkle: it affects the data model, in the sense that each language would probably have its own page id & history
67
21:19:21 fine by me to drop option 1
68
21:19:37 option 1 is: add img_id and oi_id to the existing tables
69
21:19:40 while the license is probably tied to the blob itself
70
21:19:50 legoktm: Kind of, option 1 has one aspect the other one doeesn't: It makes moving rows more complicated by lazy-assinging primary keys as they move.
71
21:19:53 Which is worse in some ways.
72
21:20:31 It means you still don't have a revision ID until after it becomes the N-1 revision.
73
21:20:43 okay
74
21:20:54 As long as the tables have primary keys, I don't have an opinion :)
75
21:21:02 Cool
76
21:21:27 jynus: welcome! we've already decided everything; hope you don't mind ;-)
77
21:21:36 * aude reads the rfc
78
21:21:52 gwicke: If we want to split translations of descriptions onto separate wiki pages (e.g. sub pages as opposed to slots), I'd prefer to keep that separate. It woudl also involve creating a linking table of file to wiki page.
79
21:21:53 the particular strategy suggested in #2 may need some refinement
80
21:21:59 * robla now resumes trying to be a responsible chair
81
21:22:19 of course with cur/old we renamed old and proxied cur, introducing a completely new page table
82
21:22:24 which I think was the right way to do it
83
21:22:35 INSERT SELECT is not inneficient, it is dangerous
84
21:22:35 I think option 2 would be good to do even if we consider option 3 later. It would put us in a better shape. But the migration is very large, so we shoudl keep in mind the migration cost.
85
21:22:59 revision was also a completely new table
86
21:23:16 basically a new abstraction layer which allowed non-intrusive migration
87
21:23:19 I'm not entirely sure that we need an image id in addition to description ids & blob hashes
88
21:23:27 Krinkle: we will have some linking like that when we add structured data support for commons
89
21:23:34 but then sounds like the rfc is more generally about mediawiki
90
21:23:41 Yeah.
91
21:24:17 TimStarling: Right. If we create them as new tables, we could populate them one by one and do it again until it is caught up with the master, and then switch masters.
92
21:24:28 with structured data, descriptions might be part of the data item which is multilingual
93
21:24:49 I don't know how oldimage and image are in terms of scale.
94
21:24:56 Probably bigger than enwiki was at time of live-1.5
95
21:25:08 Can we afford to have a complete copy of it for a short time?
96
21:25:11 jynus: ^
97
21:25:16 (in the same db)
98
21:26:31 forget about implementation
99
21:26:47 on WMF, that is my job to figure it out
100
21:26:55 yeah, but we are figuring it out
101
21:27:02 so do not worry about it
102
21:27:17 not the implementation
103
21:27:19 jynus: I was just wondering whether it is feasible space-wise to have a complete duplicate of 'image' and 'oldimage' in the commons prod as part of the migration script.
104
21:27:21 the migration, I mean
105
21:27:34 why do you need that?
106
21:27:43 :D
107
21:28:11 jynus: Tim mentioned the best way to migrate with minimal read-only time is to create new tables and populate those instead of changing the existing one.
108
21:28:24 jynus: I think a lot of the thought around migration is to make sure we have something that works outside of the Wikimedia context (as well)
109
21:28:34 I wouldn't be surprised if full migration would take days if not weeks.
110
21:28:48 cannot we just convert image into image_revision?
111
21:28:55 so, in a future world where descriptions are in wikidata & this RFC introduces an image id, we'd have a mapping from name to image id, Q-item to image ID(?), and image id to list of blob hashes?
112
21:29:13 gwicke: That sounds pretty good.
113
21:29:24 Note the concern about the practical use of image IDs, though.
114
21:29:33 this RFC is about image revision backend, we're not going to have this discussion on terms of "don't think about that, that's the DBA's job"
115
21:29:46 I do not mean that
116
21:29:51 Where file names (terrible as they are) are kind-of practical to use globally (across wikis), file IDs would obviously conflict if kept numerical.
117
21:30:03 I mean that sometimes you block tourselves thinking "that cannot be done"
118
21:30:17 and 99% of the times things can be done
119
21:30:35 discuss *if* you want a final state or not
120
21:30:39 Krinkle: maybe globally, then need to be qualified with interwiki prefix or something
121
21:30:42 ok, but I think you're jumping in without being fully up to date on the discussion
122
21:30:49 Okay, as long as it doens't require a larger hard drive or newer hardware just to migrate.
123
21:30:49 there is always a way
124
21:31:01 to make globally unique
125
21:31:03 I do wonder if it could be name to blob hashes, name to Q-item, and Q-item to blob hashes instead
126
21:31:17 I read the backlog
127
21:31:55 ok, can we talk about what you're proposing then?
128
21:32:21 convert image to image_revision -- not sure what this means
129
21:32:53 I am not proposing anything, I am asking why you want to duplicate the 'image' and 'oldimage' tables
130
21:33:55 here image_revision (in my mind) is the image table
131
21:34:31 jynus: The current tables have each row describe the object in full (with no primary keys). With 'image' holding the current revs and oldimage holding the non-current ones. one the new schemas proposed would involve the 'image' table no longer containing full descriptions (just a pointer to the current revision in image_revision) and both current and
132
21:34:31 non-current rows being in image_revision.
133
21:34:57 I know that
134
21:35:15 I just do not see why you want to duplicate them
135
21:35:22 it was just an idea
136
21:35:30 I'm happy to back off from it if you have a better idea
137
21:35:37 tell me about that idea
138
21:36:30 Firstly, the software would need to know which table to query, and whether to expect the old or new system in it during the migration.
139
21:36:34 I do not see how it fits proposal 1 or 2
140
21:36:57 ok, so you are not duplicating things
141
21:37:13 you just have 4 tables during the migration
142
21:37:28 jynus: one idea to migrate to the new schema was to create the newly schemad and named tables, import all the data, and drop/flip once finised.
143
21:37:29 Yes
144
21:37:32 which is ok
145
21:37:41 that doesn't need *duplicating data*
146
21:37:49 legoktm: Re [14:20] As long as the tables have primary keys, I don't have an opinion :) ... what are the steps for anticipating orders of magnitude more primary keys ... say for modeling a) brain neurons and b) at the much smaller nano-level (in a hypothetical Google Street View with OpenSimulator, conceptually and some years' ahead, and for brain research? Thanks.
147
21:37:54 nothing against it
148
21:38:01 and it does not require extra resources
149
21:38:34 my idea was to do it in a similar way to how page/revision migration was done
150
21:38:48 jynus: Well, commonswiki would temporarily hold a significantly larger data set. Since we'd have essentially a copy of those two tables?
151
21:39:01 Or does mariadb have a way to import string and binrary values by reference?
152
21:39:19 but images would be on one format or another, not both, right?
153
21:39:19 that is, add a new table (say imagerevision) which has one row per image revision, but with very small rows
154
21:39:52 If what Tim says is #2, I like that better
155
21:40:11 "just adding PKs" would be very inneficient
156
21:40:19 jynus: No, the software would keep using the old tables until the new tables are completely ready for use (with a small read-only time to catch up)
157
21:40:19 and denoramized
158
21:40:39 no, I do not like that, Krinkle
159
21:40:45 Exactly.
160
21:41:04 we can do better
161
21:41:07 All of this is #2. We're just talking about how the migration would go.
162
21:41:11 the image table currently has things like img_height, img_metadata
163
21:41:22 we can migrate progresively
164
21:41:25 img_metadata in particular can be enormous
165
21:41:37 #info that is, add a new table (say imagerevision) which has one row per image revision, but with very small rows [if this is proposal #2] I like that better
166
21:41:41 ftr, I'm still not convinced that an image id would help us solve the changing-image-dimension or more generally image metadata caching problem
167
21:41:55 so you could introduce imagerevision which would just give a pointer to the location of the full image metadata
168
21:42:12 yes, and that would be relativelly small
169
21:42:22 TimStarling: What location is that?
170
21:42:39 initially it could point to the existing image and oldimage tables
171
21:42:47 I do not have 100% clear the fields
172
21:43:00 but I think those are details
173
21:43:13 * gwicke heads out for lunch
174
21:43:58 I would push for a non-freezing migration- being temporarily compatible with both or filling it up in parallel
175
21:44:11 so...we have three proposals in the current RFC, and option #2 is the one that this group seems to be the most interested in fleshing out. is that right?
176
21:44:11 TimStarling: Hmm I guess we could even not have the pointers if this file-metadata table becomes keyed by image_revision
177
21:44:42 potentially
178
21:44:55 it may be that I'm overcomplicating things
179
21:45:00 jynus: Yeah, but I don't think that is feasible due to the lack of the needed indexes and primary keys. I don't see a way to make the software maintain both in parallel.
180
21:45:29 Anyhow, let's do actually worry about migration later.
181
21:45:39 Let's continue about what direction we want to go in.
182
21:45:39 since the problem we were dealing with with cur/old was that the old table at the time had the actual text in it, something like 90% of our entire database and we didn't even have the disk space to duplicate it
183
21:46:03 Yeah, good point.
184
21:46:06 It predates ES
185
21:46:16 Krinle, remember that I am going to add an id to the watchlist table (which has not PK) in a hot way
186
21:46:39 so there are many tricks to do
187
21:46:51 what is better
188
21:46:54 Adding keys is simple imho. The software can be feature flagged even as to whether to create/query those.
189
21:46:56 #info old cur/old->page migration happened before External Storage existed
190
21:47:00 I added an autoincrement PK to the logging table, it was a nuisance IIRC, but possible
191
21:47:19 I plan to reduce our database size magically: https://phabricator.wikimedia.org/T139055
192
21:47:35 that was my point about "do not worry too much about that"
193
21:47:44 I have you covered
194
21:48:21 If we go with option 2. What about file IDs? Would they be appropiate for use in APIs and user generated content? How would this go cross-wiki?
195
21:48:28 #info jynus plans to employ InnoDB compression, discussed in T139055
196
21:48:28 T139055: Test InnoDB compression - https://phabricator.wikimedia.org/T139055
197
21:48:41 Krinkle: you mean like https://www.mediawiki.org/wiki/User:NeilK/Multimedia2011/Titles ?
198
21:48:57 with file IDs used as the main user-visible identifiers?
199
21:49:28 [[File:38798231]] etc.
200
21:50:01 TimStarling: Yeah, I mean, initially I guess that wouldn't be relevant, since we'd still have file description pages, which have a unique wiki page name, and transclusion of files is indirectly behind resolving the page title first.
201
21:50:23 here file means, actual files that can be overrided by another revision or revision?
202
21:50:24 But it is something we may wish we had done differently if we don't think about it now.
203
21:50:31 * robla looks in the backlog for gwicke's task number regarding file ids
204
21:50:57 I am a long way from convinced on this
205
21:51:24 * gwicke returns with burrito
206
21:51:28 I'm okay with saying that file IDs will stay internal like we do with page IDs.
207
21:51:29 T66214 is the task gwicke mentioned earlier
208
21:51:29 T66214: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214
209
21:52:08 the more I think, the more this is non-trivial
210
21:52:13 it would however move the group of file name problems for a later RFC (need to have unique names at upload time, sometimes akward to use in wikitext, not stable, subject to redirects being deleted etc.)
211
21:52:25 purging caches? will it affect them?
212
21:52:57 jynus: The current option proposed here doesn't affect that in any way. However if we adopt stable file IDs, all that will becomem significantly easier, not harder.
213
21:53:13 (I need to to check the loging and how it is stored on db)
214
21:53:20 jynus: those problems are what content hashes are designed to solve
215
21:54:21 Krinkle: this is mostly independent of the backend issue isn't it?
216
21:54:27 from what I can tell, for most use cases listed in the present RFC, content hashes would actually have better caching properties
217
21:54:29 TimStarling: re "I am a long way from convinced on this" - can you elaborate? I'd agree that there doesn't seem to be an easy solution to having a unique file ID that is stable and yet usable across wikis.
218
21:54:37 we can add a UI to access file IDs later if we want to
219
21:54:49 Yeah, totally, we don't need to tackle that at all.
220
21:55:04 we're nearing the end of our time. we started the meeting with 3 proposals in the RFC, and option 2 seems the most viable at the moment. is that right?
221
21:55:07 The only that would become exposed is file revision ID, not file ID.
222
21:55:16 Since we'd use that instead of the current (file name, timestamp) tuple.
223
21:55:21 in APIs
224
21:55:42 TimStarling: Right?
225
21:55:49 yes
226
21:55:53 (and that tuple is already fragmented by wiki, so no concerns there)
227
21:56:38 revision id would be unique, as with page revisions
228
21:56:38 sounds good, robla:
229
21:56:41 TimStarling: I suppose link tables will also stay the same, for the same reason we changed them to be based on titles instead of page ids.
230
21:57:06 wait, would this be for the file, or the page description?
231
21:57:08 jynus: Yeah, but unlike file IDs (which would need to make sense from another wiki), file revisions are never referenced without a wiki context.
232
21:57:20 yes
233
21:57:23 URLs make sense in a global context
234
21:57:27 as in, you are right
235
21:57:31 autoincrement IDs, not so much
236
21:57:38 UUIDs could work
237
21:57:46 you could assign a 128-bit random ID or something
238
21:58:09 but it's a discussion for another day
239
21:58:26 gwicke: If we adopt something better than user-generated page names for public file identifiers in the future, the file page would presumably use that same ID in its title.
240
21:58:46 I think there is generaly suppor for that, but maybe this need to mature a bit more with other outstanding issues related?
241
21:58:52 would the ID identify a blob, or a description of a blob?
242
21:59:05 But I agree with Tim that we should leave that as-is for the purpose of this RFC (we keep using file "titles" as the file page and transclusion name for instant commons etc.)
243
21:59:33 we need to support titles anyway, at least for b/c
244
21:59:36 Krinkle: any decisions you hope to make sure we document here?
245
21:59:44 so I don't think we're boxing ourselves in by considering it later
246
22:00:34 gwicke: The file ID I theorise here would identify the file as mutable entity. a revisioned wiktext description page would use that ID in its page name. (e.g. File:12345)
247
22:00:44 But anyway, issues. Let's keep that outside the scope for now.
248
22:00:51 I'm going to summarize this as "Krinkle updated the RFC to give 3 options, we discussed them, and #2 seems to be the current favorite"
249
22:01:01 LGTM.
250
22:01:25 * robla is going to hit #endmeeting in a couple of minutes; furhter discussion welcome on #wikimedia-tech
251
22:01:27 option #1 is now excluded, but option #3 was not really discussed
252
22:01:58 I'll update the task and call for fleshing out the details with regards to the exact schema (separate image meta data? any other columns to be dropped?) and how to migrate there (rename and mutate, or re-create, or both for a while)