English subtitles for clip: File:08 Abstract Wikipedia.webm

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
1
00:00:00,000 --> 00:00:07,710
I'm running late. I am not here. So I'm here to
talk about. I'm here to talk about abstract

2
00:00:07,710 --> 00:00:13,080
Wikipedia language versus knowledge. So I've
spoiled my talk a little bit by presenting

3
00:00:13,080 --> 00:00:17,640
earlier a little bit on plan. Sorry about that.
Hope it's not too boring for you. So I'm James

4
00:00:17,640 --> 00:00:22,650
Forrester. I'm the tech lead for the abstract
Wikipedia team, working for the Wikimedia

5
00:00:22,650 --> 00:00:29,670
Foundation. And filter three presenting here
today. So I wanted to give a little bit of

6
00:00:29,670 --> 00:00:35,310
context around what the motivations are for for
wiki functions. And the alphabet Wikipedia

7
00:00:35,310 --> 00:00:42,750
dream. Vision shouldn't match it sounds much
better than a dream, right? So vision. And so I

8
00:00:42,750 --> 00:00:48,030
just just a quick canter through, you know,
what's leading us to our project and where we

9
00:00:48,030 --> 00:00:53,820
are with it. So here's the agenda. There's a
bunch of reasons for why we're building but he

10
00:00:53,820 --> 00:00:58,290
functions which hopefully will resonate with a
lot of people who are current wikipedians. And

11
00:00:58,290 --> 00:01:03,960
there's a lot of vision for abstract Wikipedia,
which hopefully will inspire and not sound

12
00:01:03,960 --> 00:01:11,730
boring. So wiki work unshared. So there are a
lot of Wikipedia's I mean, depending on what you

13
00:01:11,730 --> 00:01:15,360
count as an active Wikipedia, there are a
different number. But there are over 300

14
00:01:15,390 --> 00:01:21,480
Wikipedia domains in different languages that
are live right now. That is a different concept

15
00:01:21,480 --> 00:01:26,250
of knowledge each time but it's the same concept
of the work, right? There's the creating

16
00:01:26,250 --> 00:01:29,490
updating articles, there's the petroleum all the
different things that are very familiar to

17
00:01:29,490 --> 00:01:36,120
people who edit Wikipedia, curate, Wikipedia
make Wikimedia amazingly successful. And these

18
00:01:36,120 --> 00:01:39,780
are the same pieces of work whether you speak
English or Urdu, whether you speak you know,

19
00:01:39,810 --> 00:01:46,290
French or or Swahili, but they don't have the
same size. Right interfere has 125,000 monthly

20
00:01:46,290 --> 00:01:52,350
contributors, roughly average to bid 19,000
pages and the newer namespace, the module

21
00:01:52,350 --> 00:01:57,060
namespace, one and a half billion speakers,
about 900 million of which turn up each month.

22
00:01:57,810 --> 00:02:05,520
The Hauser Wikipedia has 60 monthly contributors
500 pages in Lua namespace, which is already

23
00:02:05,520 --> 00:02:11,820
quite big 70 million speakers in the world, but
only 200,000 of them show up each month. And the

24
00:02:11,820 --> 00:02:16,830
nature of people showing up may be different for
different people, you know, 1010 hours

25
00:02:16,830 --> 00:02:21,690
continually reading on your desktop computer
reading 10,000 pages versus a snippet you read,

26
00:02:21,960 --> 00:02:27,690
you know, on Google, we can both basically those
as one person, the nature of that consumption is

27
00:02:27,690 --> 00:02:35,460
different. So here is a screenshot from the
module lewer namespace of the Malay Wikipedia.

28
00:02:37,590 --> 00:02:42,120
So one of the things you might notice here, so
Malay, is the principal language spoken in

29
00:02:42,120 --> 00:02:47,550
Malaysia, it is written in the Latin script. It
is not however, based on English, but all these

30
00:02:47,580 --> 00:02:53,880
articles are these modules are named in English.
And that's because they aren't for the Malay

31
00:02:53,880 --> 00:02:58,650
community, they were written for the English
Wikipedia, and then mostly copied across. So

32
00:02:58,650 --> 00:03:06,810
here is the one for age, right? As you can see,
it's written in English for English speakers for

33
00:03:06,810 --> 00:03:11,070
English people, people who speak English, at
least pretty fluidly, if not completely filmed

34
00:03:11,070 --> 00:03:17,370
it. And in fact, if you look at the history of
this article, it was imported in one great big

35
00:03:17,370 --> 00:03:22,710
bulk in 2017. And essentially hasn't been
touched since except one time where someone very

36
00:03:22,710 --> 00:03:27,930
helpfully actually translated a bunch of the
labels that he would output so that on Malay

37
00:03:28,140 --> 00:03:34,650
wiki, it looks useful, right. And that's a great
thing to do. But here's English with PTOs

38
00:03:35,430 --> 00:03:41,490
version of this the original, if that makes
sense. Once they start forking off, it has

39
00:03:41,490 --> 00:03:45,390
template data, you know, stuff that helps you
when you're using Visual Editor actually use

40
00:03:45,390 --> 00:03:53,100
this script, not present on on Malay at all. It
has revision history with dozens and dozens and

41
00:03:53,100 --> 00:03:57,690
dozens of improvements may be good improvement
of a bad equivalent do not exist on the lay

42
00:03:57,690 --> 00:04:06,930
Wikipedia. And this is replicated on 86.
Communities. Well, 85 Because English is one of

43
00:04:06,930 --> 00:04:14,100
them right? across different Wikipedias. This is
just a very simple script that takes a date, and

44
00:04:14,130 --> 00:04:18,090
the current date or a different date and
calculates how old someone was at that point.

45
00:04:18,390 --> 00:04:23,400
Right. So it's a really useful thing for writing
prose is a really boring thing for taking two

46
00:04:23,400 --> 00:04:27,720
numbers apart from each other. And you know, it
doesn't differ necessarily by different

47
00:04:27,720 --> 00:04:33,300
contexts. It's a thing we could share. But we
currently don't have that technology. And in

48
00:04:33,300 --> 00:04:38,850
fact, you can look at a lot of the versions of
the age module across the different Wikipedia's

49
00:04:39,000 --> 00:04:43,530
and the Malay example we picked because it's
kind of an extreme version of this, but lots of

50
00:04:43,530 --> 00:04:47,400
wikipedians have this problem, including the
English Wikipedia, where people find something

51
00:04:47,400 --> 00:04:52,410
that works for them on German or Arabic, they
copy it over to English localize it a bit, but

52
00:04:52,410 --> 00:04:56,340
the English Wikipedia also to a much smaller
extent suffers from the problem of having

53
00:04:56,340 --> 00:05:02,280
modules that are kind of disconnected from the
original ethos So this is why we built wiki

54
00:05:02,280 --> 00:05:06,750
functions. Fundamentally, the dream for wiki
functions is that you're gonna have a site where

55
00:05:06,750 --> 00:05:11,820
you can go, which has one version of this Lua
script, this module, this function that works

56
00:05:11,820 --> 00:05:17,310
for everyone, depending on whether you're Malay,
Wikipedia, or English, Wikipedia, or German or

57
00:05:17,310 --> 00:05:23,730
any of the others. This is still a dream, not a
reality does not work. But it's here, right? So

58
00:05:23,730 --> 00:05:29,400
here is an age function. Now forgive the
screenshots, they look a bit tacky, we have

59
00:05:29,400 --> 00:05:37,740
actually changed the design as of like, two
weeks ago. So these things are still landing.

60
00:05:37,920 --> 00:05:41,310
There are other things here that are really
interesting, though, you'll notice under type,

61
00:05:41,310 --> 00:05:48,480
it says Gregorian proleptic calendar date, and
that is perfectly correct and perfectly unusable

62
00:05:48,480 --> 00:05:52,230
for almost every human on Earth. But like,
that's a little thing where we, as a community

63
00:05:52,230 --> 00:05:56,190
can agree Actually, why don't we just say date,
you know, and then the other ones can say that

64
00:05:56,190 --> 00:06:00,540
we have ones instead. But you can see here, you
write the value 23 In the month of May, that's a

65
00:06:00,540 --> 00:06:06,900
drop down, that's not writing the word anyway.
And the positive integer no 45. And you were

66
00:06:06,900 --> 00:06:11,370
unfunctional. Yeah, it dropped down to a common
error versus before the column error. And it

67
00:06:11,370 --> 00:06:17,640
would output a bit of text that says, you know,
how old son was today as of that date. So when

68
00:06:17,640 --> 00:06:22,800
he functions is here, you know, that function
exists, but it's not here, here, right. So

69
00:06:22,800 --> 00:06:26,730
here's the front page where he functions, and I
can go to it now you can, you can log in and

70
00:06:26,730 --> 00:06:31,560
create a function. And here's an example of a
real function that exists also a not function.

71
00:06:31,560 --> 00:06:36,930
This was one of the simplest functions someone
built, you put in true, and you get back false,

72
00:06:37,020 --> 00:06:42,180
you put in false, you get back true, world
changing stuff. But the important thing here is

73
00:06:42,180 --> 00:06:46,620
that there's one version of that function that
can be used as the basis for every bit of

74
00:06:46,650 --> 00:06:51,930
Boolean logic, you're using the entire system.
And this can be translated. So if I go back,

75
00:06:52,170 --> 00:06:57,570
you'll see here it says, takes a truth value and
returns its opposite. That's the description in

76
00:06:57,600 --> 00:07:03,690
English. But if I click edit here, which doesn't
appear because this is a slide deck, you'll be

77
00:07:03,690 --> 00:07:10,320
able to set it in a dozen different languages or
actually 650, something like that, roughly

78
00:07:10,320 --> 00:07:15,810
languages. The concept of what languages starts
to get very difficult, and I will not pretend to

79
00:07:15,810 --> 00:07:21,330
answer that here. This is controlled by the
community. So you'll see here, this is not I

80
00:07:21,330 --> 00:07:25,560
have written the correct version of the NOC
function, you will use it this is written on the

81
00:07:25,560 --> 00:07:28,950
wiki controlled by the wicked community on
wicked functions. It has a bunch of

82
00:07:28,950 --> 00:07:33,510
implementations a bunch of tests. And so here's
an implementation in JavaScript that someone's

83
00:07:33,510 --> 00:07:39,420
written. And here's a different implementation
in Python. And so the important thing here is

84
00:07:39,420 --> 00:07:43,560
that both of them are equally valid, we're not
making decisions for the community about what

85
00:07:43,830 --> 00:07:47,820
what language they should write in. And we have
to make some decisions about what we're going to

86
00:07:47,820 --> 00:07:52,680
put support, but we want to be open rather than
make decisions for people. You can also write it

87
00:07:52,680 --> 00:07:57,330
and this is the thing that we think is really
important in a composition. And so this is not a

88
00:07:57,330 --> 00:08:01,080
programming language. I mean, it is, but it's
not somebody else's Preben language, this is the

89
00:08:01,080 --> 00:08:06,630
wiki functions kind of way. And we hope at some
point, this will be something that you can

90
00:08:06,630 --> 00:08:12,390
actually represent visually. So you can already
read it in prose. If you I was logged in.

91
00:08:12,420 --> 00:08:17,340
However, in Polish, this prose would be in
Polish, right? It's not, it's not the most

92
00:08:17,340 --> 00:08:22,170
beautiful prose, you know, we're still quite
techy, and we want to be more welcoming. But

93
00:08:22,200 --> 00:08:26,790
this is in Polish, this is not in English. In
fact, pretty much no words appear here that are

94
00:08:26,790 --> 00:08:31,590
in English tool. If you're logged in, or you
just went to the Polish entry through Google,

95
00:08:31,800 --> 00:08:36,210
you would end up thinking this website was a
Polish website, in Polish for Polish people. And

96
00:08:36,210 --> 00:08:42,930
the same, theoretically is true for all 800 of
our lectures. So what's the status right now. So

97
00:08:42,930 --> 00:08:46,320
you can create functions, you create
implementations create test cases, you can do it

98
00:08:46,320 --> 00:08:50,670
with Boolean strings. And you do it with lists
of those more is coming soon. But we do not

99
00:08:50,670 --> 00:08:57,120
support things like dates or numbers yet, that's
coming, what's coming later, calling functions

100
00:08:57,150 --> 00:09:01,440
embedded within a Wikipedia article. So you
can't actually use this as a replacement for the

101
00:09:01,470 --> 00:09:07,530
age module on English Wikipedia or Malay,
Wikipedia or any of the others coming. You can't

102
00:09:07,530 --> 00:09:11,250
call those functions either from tools or
gadgets or third party apps. That's a really big

103
00:09:11,250 --> 00:09:14,370
thing, because it's not just power, Wikipedia,
its power and all the other tools across

104
00:09:14,370 --> 00:09:20,670
Wikimedia universe and third parties. And you
can't yet use wiki data items or lexemes. But

105
00:09:20,670 --> 00:09:25,980
that's a big thing we definitely want to work on
very soon. And so this comes to language versus

106
00:09:25,980 --> 00:09:33,720
knowledge. So if you look at wiki data, it has a
lot of structured data statements about the

107
00:09:33,720 --> 00:09:39,540
world. It says Jupiter is the fifth point is the
largest planet, it is a gas giant. It doesn't

108
00:09:39,540 --> 00:09:41,640
necessarily let you say

109
00:09:43,350 --> 00:09:49,020
things about statements. Certain forms of
statements are not things you can make about

110
00:09:49,980 --> 00:09:55,590
things in wiki data, it's not there yet. And yet
you can with a function say different kinds of

111
00:09:55,590 --> 00:10:00,450
things. So here, this is an example function
called reverse string which takes substring and

112
00:10:00,450 --> 00:10:07,290
reverses it. And this it has input of deposits.
And if you press run function, you will get the

113
00:10:07,290 --> 00:10:15,000
output stupid. This is a toy example from my
leads daughter who finds it hilarious, but she

114
00:10:15,000 --> 00:10:22,500
has seven. There are other examples, which is
still pretty toy example. So this is an encode

115
00:10:22,500 --> 00:10:28,140
NATO phonetic alphabet, right. So you type in
the word wiki, and it says whiskey India, kilo,

116
00:10:28,140 --> 00:10:32,790
India, obviously, we all know that, right? This
isn't necessarily a thing that everyone's going

117
00:10:32,790 --> 00:10:36,990
to use all the time. But it is particularly
relevant if you actually have an article about

118
00:10:36,990 --> 00:10:42,540
the NATO phonetic alphabet. And we might be able
to provide readers with the ability to like run

119
00:10:42,540 --> 00:10:47,070
an actual example in the article as they're
reading about it. So rather than try and get

120
00:10:47,070 --> 00:10:51,360
their head around what it means they can explore
what it means. This is still knowledge. But it's

121
00:10:51,360 --> 00:10:55,500
not really the kind of written form of articles
that we've been generally providing a Wikipedia.

122
00:10:57,450 --> 00:11:02,880
There's possibly the way to make content after
this. So these are some examples that we

123
00:11:02,880 --> 00:11:08,700
actually created as a team as a way of exploring
what it would be to do natural language

124
00:11:08,700 --> 00:11:12,960
generation. So these are not. So these are not
templates in the wiki Tech Center, the sense of

125
00:11:13,200 --> 00:11:17,580
templates in the sense of the AI sense of the
term, you have some three inputs, and you press

126
00:11:17,580 --> 00:11:22,290
the button, and it outputs a statement. So
Berlin city, Germany, and it says Berlin is the

127
00:11:22,290 --> 00:11:27,060
largest city in Germany, if I change that word
from Berlin to Stuttgart, it will say shortcut

128
00:11:27,090 --> 00:11:31,980
is the largest city in Germany, which might not
be true, you can, in fact, say, you know, Berlin

129
00:11:31,980 --> 00:11:36,900
is the largest city on the moon, so probably not
true. And then there's a version of this in

130
00:11:36,900 --> 00:11:40,200
Germany, we actually wrote versions of this in a
bunch of different languages to kind of explore

131
00:11:40,200 --> 00:11:46,350
what it is that couldn't make this useful. This
in and of itself, isn't necessarily incredibly

132
00:11:46,350 --> 00:11:53,310
powerful. But early usage of wiki functions
that's already there right now, is this function

133
00:11:53,310 --> 00:12:01,020
here. So we data has this set of concept content
called lexemes, which are words you know,

134
00:12:01,140 --> 00:12:06,780
entries about words in all the different
languages the world, when you say, a verb in

135
00:12:06,780 --> 00:12:12,930
English, generally, it doesn't change very much
in different forms, if you think so you stand

136
00:12:13,050 --> 00:12:17,610
and you know, you stood and you're standing, but
the you know, there's only three or four

137
00:12:17,610 --> 00:12:23,310
variations. But in other languages, there can be
dozens or even hundreds of STEM endings of those

138
00:12:23,310 --> 00:12:28,530
words. And those all needs to be in wiki data so
that it can automatically look up, oh, this word

139
00:12:28,530 --> 00:12:34,560
is probably this word and context. So this
function here is a relatively simple one that

140
00:12:34,560 --> 00:12:38,670
gets you the nominative plural of the Finnish
verb. So you type in an input, and it assumes

141
00:12:38,670 --> 00:12:44,730
you are correct, it is a French verb, noun, and
it gets you the results. So you can say van

142
00:12:44,730 --> 00:12:51,150
hempy goes to Van Hammond. But, and here's what
the wiki data lexeme entry for that would look

143
00:12:51,150 --> 00:12:55,320
like. So you see, graphic features don't have
singular versus not a plural, they have

144
00:12:55,320 --> 00:13:03,000
different forms. And then there's also this data
on Wiktionary, which is not connected to wiki

145
00:13:03,000 --> 00:13:06,750
data to pull this data, it's all stored
separately on dictionary, and in fact, each

146
00:13:06,750 --> 00:13:11,280
dictionary so this is a screenshot from the
English Wiktionary. But the French dictionary,

147
00:13:11,370 --> 00:13:15,210
and the Swedish dictionary and the Finnish which
are each have to have their own version of its

148
00:13:15,210 --> 00:13:21,510
content. But here is a page which actually lets
you bulk create all of these different STEM

149
00:13:21,510 --> 00:13:28,620
endings onto wiki data in one go. And let you
get it colored by wiki function. So this

150
00:13:29,010 --> 00:13:34,890
slightly inappropriately not inappropriate,
ahead of our planning, looks into the internal

151
00:13:34,890 --> 00:13:39,480
API for wiki functions and calls the wiki
functions function that I just showed, and

152
00:13:39,480 --> 00:13:44,430
outputs a whole bunch of outputs. So if I
remember as a young kid trying to learn Latin,

153
00:13:44,520 --> 00:13:49,560
and being disheartened to realize that for each
irregular verb, I had to memorize 147 stem

154
00:13:49,560 --> 00:13:55,860
endings, which made me sad. And now a function
couldn't remember all of these formulas, and I

155
00:13:56,250 --> 00:14:01,350
don't speak very much Latin anymore, it turns
out so that's kind of an option of where we can

156
00:14:01,350 --> 00:14:07,680
go with wiki functions itself. But then the
bigger picture is the alphabet with computer

157
00:14:07,680 --> 00:14:13,380
vision. And so this ties back in to the the kind
of sample Berlin is the largest city in Germany

158
00:14:13,560 --> 00:14:18,990
idea. So you can take some kind of natural
language functions really basic thing. So here

159
00:14:18,990 --> 00:14:28,140
this takes in a word in English and outputs a,
the plural of it based on some smart paradigms,

160
00:14:28,140 --> 00:14:33,900
about what words turn into what forms in
English. So baby to babies, for instance. And

161
00:14:33,930 --> 00:14:39,330
you can see, down in the left here, plural of
chair is chairs, plural, a flurry is fairies,

162
00:14:39,390 --> 00:14:45,300
you know, so it's not, it's not just slapping an
S on the end, but it's not very clever, long

163
00:14:45,300 --> 00:14:49,620
term, we'd want to pull this data from wiki data
rather than probably having functions for it.

164
00:14:50,520 --> 00:14:55,560
But you can also build that together. So here
this takes a person name and a police name, and

165
00:14:55,560 --> 00:15:00,270
says Albert Einstein was born in Rome, or James
Forrestal was born in London or All those

166
00:15:00,270 --> 00:15:06,360
different things that you could say. And you can
see how little fragments of content statements

167
00:15:06,360 --> 00:15:10,350
like this could be built together into a way
that can actually make a really quite

168
00:15:10,380 --> 00:15:18,480
significant statement. So this, this is a piece
of text that we cobbled together as a team, the

169
00:15:18,480 --> 00:15:23,910
natural language generation kind of volunteers
in working with the team built, which is

170
00:15:23,940 --> 00:15:29,880
essentially the lead of the mercury article in I
mean, it's amalgamated from a bunch of different

171
00:15:29,880 --> 00:15:34,650
wikis, but basically, it's what you would write
as the lead article. So Mercury has a whole

172
00:15:34,650 --> 00:15:38,970
bunch of statements, you can pull out a wiki
data right now looks like when she was born.

173
00:15:39,870 --> 00:15:43,980
There are some interesting questions about what
her nationality is that actually, if you read

174
00:15:43,980 --> 00:15:47,610
different Wikipedia articles about marriage
theory, you're having very different views as to

175
00:15:47,610 --> 00:15:52,890
what nation she was a citizen of, which is one
of those kinds of political things that becomes

176
00:15:52,890 --> 00:15:58,860
very difficult to put into nuance in Wicked
data. But the thing that's really important

177
00:15:58,860 --> 00:16:05,550
here, is that Mercury was the first woman to win
the Nobel Prize, she was awarded Nobel Prize in

178
00:16:05,550 --> 00:16:11,640
chemistry, and the Nobel Prize in Physics. And
actually, she was the only human to win two

179
00:16:11,640 --> 00:16:17,070
Nobel Prizes in different fields, which is not a
state, that is a really interesting thing to

180
00:16:17,070 --> 00:16:21,450
say. Generally, we are going to want to say
something like that, in any article lead about

181
00:16:21,450 --> 00:16:27,030
Mercury. And it's also not a thing that the
structure of wiki data lets you say inside wiki

182
00:16:27,030 --> 00:16:32,730
data. But with wiki functions, you could say
something like this. And spoiler, this text is

183
00:16:32,730 --> 00:16:40,260
actually generated text from a set of both what
we call constructors, and then renders the

184
00:16:40,260 --> 00:16:46,080
constructors construct each individual fragment
and sentence. And then the renderer turns that

185
00:16:46,080 --> 00:16:52,860
from data into English, and in this case, into
Hebrew, and if you say, actually, this is wrong,

186
00:16:52,890 --> 00:16:58,860
you can make a typo correction, you can change.
So you see Nobel Prize in Chemistry 1909, change

187
00:16:58,860 --> 00:17:03,870
it to 911, they're just updates because, you
know, like a wiki text template, this is just a

188
00:17:03,870 --> 00:17:08,640
thing where it get data in the data out. So the
plus side of that is you can update it, once it

189
00:17:08,640 --> 00:17:12,480
updates in all the languages. The downside of
that is you update it once it updated all the

190
00:17:12,480 --> 00:17:16,350
languages. And so that has all of the moderation
concerns that you have around people changing

191
00:17:16,350 --> 00:17:20,430
templates, all the moderation concerns you have
about people changing inputs to things, or

192
00:17:20,430 --> 00:17:25,710
things on wiki data. And so this is not in any
way a solved problem. But it's a thing that we

193
00:17:25,710 --> 00:17:31,440
have as a dream. So and particularly the dream
here is not that the some magic AI is going to

194
00:17:31,440 --> 00:17:36,510
come along and write content, I think a that's
not going to happen. AI is generally based on

195
00:17:36,510 --> 00:17:40,950
Wikipedia. And so it can't really very helpfully
write Wikipedia without going into a kind of

196
00:17:40,980 --> 00:17:47,640
spiral of decay, but also be That's not who we
are as Wikimedia, like the humans in the loop of

197
00:17:47,640 --> 00:17:52,170
the magic, Pixie Dust, right? We it's not an
accident that we have humans write Wikipedia

198
00:17:52,170 --> 00:17:57,960
articles. That is actually why we're here to do
it. And so the idea here is that you have, you

199
00:17:57,960 --> 00:18:03,570
know, input through control from Wikimedia
communities, the state statements of fact, will

200
00:18:03,570 --> 00:18:09,540
live on wiki data controlled by that community,
the translation of statements into pros will

201
00:18:09,540 --> 00:18:16,110
live on wiki functions, and then the output will
live somewhere. We don't know where that is. But

202
00:18:16,110 --> 00:18:23,940
you see here, you can just change the text here
from English into in this case creation. And it

203
00:18:23,940 --> 00:18:32,850
magically outcomes with prose in, in creation,
not English. So cool. This is a dream, this is a

204
00:18:32,850 --> 00:18:37,890
vision. There's a whole bunch of content written
by humans, that's really great. What what what

205
00:18:37,890 --> 00:18:43,560
are we going to do with this vision? Well, a,
apparently I need to build it at some point. But

206
00:18:43,560 --> 00:18:48,240
also be what even once you've got a technology
as an available thing, what do we as the

207
00:18:48,240 --> 00:18:54,450
Wikimedia community want to do with it, one of
the things that we think might be an early win,

208
00:18:54,780 --> 00:19:01,230
an early place to start with this work is wiki
data item descriptions. So if you look here, a

209
00:19:01,230 --> 00:19:05,790
lot of these languages don't have anything
defined at all, the ones that do defined are all

210
00:19:05,790 --> 00:19:10,440
saying the very basic thing that the University
of Lagos is a university in Nigeria, except,

211
00:19:10,500 --> 00:19:15,900
interestingly, the ones in English and Hazar,
which both mentioned that is a public university

212
00:19:16,050 --> 00:19:23,760
in Nigeria. But that's something that a wiki
functions can bot, as you can think of it could

213
00:19:23,790 --> 00:19:29,460
replace those automatic statements made by
individually in a per language basis, by a kind

214
00:19:29,460 --> 00:19:33,570
of can language translation, still controlled by
the wiki data community. So

215
00:19:33,000 --> 00:19:38,370
if you change it to former university, because
it's now closed, they all get updated instantly,

216
00:19:38,520 --> 00:19:44,610
or the country of Nigeria changes its name
because it, you know, splits in powerful doubles

217
00:19:44,610 --> 00:19:48,300
and sites or whatever. And those will all get
updated without a human having to go through

218
00:19:48,300 --> 00:19:53,820
each of 200 languages and manually changing the
label. But that's, you know, still a bit scary.

219
00:19:53,850 --> 00:19:57,540
And that's just wiki data item descriptions,
right. There are other areas that are slightly

220
00:19:57,540 --> 00:20:02,790
less scary. So things slightly more scary. So
there are automatic lists. So this is a list of

221
00:20:02,790 --> 00:20:08,070
national capitals by population. This is an
article on English period. These are statements

222
00:20:08,070 --> 00:20:12,300
of fact, the incidence of faculty do not change
very often. But when they change, we generally

223
00:20:12,300 --> 00:20:16,860
don't remember to change them all the different
places, right? If I update the article on

224
00:20:16,860 --> 00:20:21,930
plough, to say, the what the population, the
population of its, its capitalist, and someone

225
00:20:21,930 --> 00:20:26,700
moved in, so it's not zero anymore. Do I also
remember to go to update the entry on this table

226
00:20:26,910 --> 00:20:30,810
that I probably don't even know it exists. And,
you know, if I went into special what leads

227
00:20:30,810 --> 00:20:35,730
here, I'd eventually find it 500 pages down, no
one's ever going to do that. And so content gets

228
00:20:35,730 --> 00:20:40,620
out of date to have to sync with each other, be
really nice to pull this all together. But

229
00:20:40,740 --> 00:20:45,630
that's also like real user facing content in a
way that we need a document descriptions to have

230
00:20:45,630 --> 00:20:50,130
the way. So we're not going to, we're not going
to throw some switch, you know, mad scientist

231
00:20:50,130 --> 00:20:54,240
style and tough you're gonna have to deal with
it. This is very much a proposal idea that we're

232
00:20:54,240 --> 00:21:01,920
going to work through as a community. More
contentiously, even still, so info boxes, info

233
00:21:01,920 --> 00:21:07,590
boxes are a big area of concern. And by that I
don't just mean that different wikis have

234
00:21:07,590 --> 00:21:12,990
different contents in their different wikis,
info boxes, you know, if you compare the article

235
00:21:13,020 --> 00:21:17,190
on mercury in English, Polish, German, they
disagree about her nationality, they also

236
00:21:17,190 --> 00:21:22,290
disagree about her name. But more importantly,
there are things that appear in English that

237
00:21:22,290 --> 00:21:26,910
don't appear in German for a long time, her
signature, I think, appeared in the infobox and

238
00:21:26,910 --> 00:21:33,540
finally got dropped a few years ago. Audrey,
who's a brilliant person? And is Minister

239
00:21:33,540 --> 00:21:39,030
petition fairs in Taiwan, or at least was until
the election yesterday. I don't know if that's

240
00:21:39,030 --> 00:21:44,880
still the case anyway? Or do we have a whole
bunch of data about them? Once stored in wiki

241
00:21:44,880 --> 00:21:50,430
data, all theoretically, could be automatically
put into info boxes in some form using a

242
00:21:50,430 --> 00:21:57,900
function. But English repeater, especially more
than actually, I think any other Wikipedia has a

243
00:21:57,900 --> 00:22:02,730
really contentious view about info boxes, where
it's not even agreed on the wiki of what they

244
00:22:02,730 --> 00:22:06,360
should look like. And so you have some wiki
projects that decided they have a different view

245
00:22:06,360 --> 00:22:12,120
of Infoboxes to other wiki projects and long
running very slow edit wars back and forth as to

246
00:22:12,120 --> 00:22:17,310
which Edit Template to use, you know, is
Napoleon amok? Or is he a military figure?

247
00:22:17,400 --> 00:22:21,840
Because they get different Infoboxes. And that's
really important to some people. I think it's

248
00:22:21,840 --> 00:22:27,360
more important that our readers get the right
answers with you know, but you know, it's not

249
00:22:27,360 --> 00:22:32,340
for me, as an individual, let alone me from the
foundation to tell the community how to run

250
00:22:32,340 --> 00:22:39,840
itself, right. But I want to point out sometimes
when it goes in the wrong way. Anyway, that's

251
00:22:39,840 --> 00:22:44,490
the kind of overview overview of what the idea
for the abstract we can be efficient is there is

252
00:22:45,180 --> 00:22:50,970
there is no plan to replace articles that is,
like, pointless, and not a good value for

253
00:22:51,060 --> 00:22:55,830
anyone. But hopefully, it's giving you some
ideas about things that we could do with our

254
00:22:55,830 --> 00:23:00,810
track Wikipedia. Hopefully, you're even excited
for wiki functions to actually ship at some

255
00:23:00,810 --> 00:23:08,010
point. I am. And yeah, I got a hopefully 15
minutes now for questions if anybody has any.

256
00:23:08,760 --> 00:23:17,610
When does it ship? Okay, we shipped wiki
functions itself in July last year, so that got

257
00:23:17,610 --> 00:23:23,520
it very early out the door, you can you at the
time, only a very small number of Blessed

258
00:23:23,520 --> 00:23:28,500
humans, you know, were given the ability to
actually write functions or even run functions.

259
00:23:29,910 --> 00:23:34,620
We have done a lot of work on the security side
on the scaling side. So that security Happy,

260
00:23:35,130 --> 00:23:42,480
happy enough with us, at this point that we now
allow any community member to create functions,

261
00:23:42,810 --> 00:23:47,940
only approved community members approved by the
community can actually make the function live so

262
00:23:47,940 --> 00:23:52,590
that other people will run them. And then once a
function is approved as an alive function, any

263
00:23:52,590 --> 00:23:58,440
human logged out or logged in, can run it by
putting inputs into it. So I could go right now

264
00:23:58,440 --> 00:24:03,900
to the article, you know, to the entry from the
gate and make true go to false and false go to

265
00:24:03,900 --> 00:24:06,960
true and I don't need to be logged in for that.
I certainly don't need any special rights.

266
00:24:08,430 --> 00:24:16,110
However, right? You know, a car is great inside
the garage, but you actually want to drive it on

267
00:24:16,110 --> 00:24:20,910
the road. The thing a lot of people want to do
is use where he functions to power their apps

268
00:24:20,910 --> 00:24:26,040
and to power Wikipedia articles and to pull in
wiki data content. Those are big things we're

269
00:24:26,040 --> 00:24:32,550
focused on with as a team, we've got a meeting
next week, we got next week to actually decide

270
00:24:32,580 --> 00:24:36,480
on the kind of priorities for the coming quarter
and, you know, share that with the community as

271
00:24:36,570 --> 00:24:43,620
as when But assuming the team agrees with me, I
think we're probably going to be big focus on on

272
00:24:43,620 --> 00:24:50,010
types and on wiki data usage and a second focus
on using it on Wikipedia. But it's months, not

273
00:24:50,010 --> 00:24:58,050
weeks, unfortunately at best. Um That's a good
question. So the first thing I'd say is that

274
00:24:58,080 --> 00:25:02,310
ultimately wiki functions is community Logic
controlled by the wiki functions community,

275
00:25:02,610 --> 00:25:07,590
which you can be part of, if you go sign up,
right, you know, it's not controlled by me. And

276
00:25:07,590 --> 00:25:12,930
so what functions the community focuses on and
thinks are important is kind of in their

277
00:25:12,930 --> 00:25:20,280
control, not ours, we will come support them in
the directions they want to go. I think the

278
00:25:20,280 --> 00:25:27,420
simple natural language generation concepts are
actually already existing across the Wikimedia

279
00:25:27,480 --> 00:25:33,180
wikis. In some cases, right, there's a module,
Lua function where you will pass it a list of

280
00:25:33,180 --> 00:25:39,720
between one and n entries, and it will say foo
or foo and bar or food comma bar comma, and, you

281
00:25:39,720 --> 00:25:45,240
know, black bears. And so that kind of, of
templating or language generation already

282
00:25:45,240 --> 00:25:49,620
exists. And kind of that's a lift and shift over
to wiki functions is not necessarily a thing

283
00:25:49,620 --> 00:26:03,840
that will take a very long time. The problem is
it's very easy to come up with very naive and

284
00:26:03,840 --> 00:26:11,070
simple heuristics for how to describe entities
like biographies or whatever born then died

285
00:26:11,070 --> 00:26:16,320
then. And they always run into the buffers of,
you know, Wikipedia is an amazing place. And it

286
00:26:16,320 --> 00:26:22,680
has articles about people, where we don't have
any of that data. Right? So there are articles

287
00:26:22,680 --> 00:26:26,760
or I mean, you know, there's this wonderful
standard blog that goes around the web, that's,

288
00:26:26,790 --> 00:26:32,730
you know, 50, lies, programmers belief about
names, or dates or whatever, right. So we have

289
00:26:32,730 --> 00:26:37,320
Wikipedia articles about people, we don't know
the date of birth, or the date of death. And we

290
00:26:37,320 --> 00:26:42,150
don't, and sometimes we just say with this
wonderful term, Florrie out as in when they were

291
00:26:42,150 --> 00:26:47,670
flourishing, so we have like, well, they
published a book in 522. So they must have been

292
00:26:47,670 --> 00:26:51,810
alive then. But we don't exactly know whether
they were alive 20 years before or after that.

293
00:26:51,810 --> 00:26:58,740
So we're tough. For some of them, we don't know
their names, we have kind of the author of x,

294
00:26:59,070 --> 00:27:07,800
unknown is, you know, and so often, there's a
risk with Wikipedia work that we can let perfect

295
00:27:07,800 --> 00:27:15,420
be the enemy of the good. But at the same time,
it's very easy to get stuck into only answering

296
00:27:15,420 --> 00:27:22,590
questions for English content, where, frankly,
there's quite a lot of that already. I I'm very

297
00:27:22,590 --> 00:27:28,680
happy when Vidya gets value out of the things we
build, but the early Wikipedia, the Daggubati,

298
00:27:28,680 --> 00:27:34,410
Wikipedia, the Igbo, wikipedians, are all much,
much smaller, and much more, the potential for

299
00:27:34,410 --> 00:27:39,570
helping readers is much higher there with less
work, you know, the same amount of community

300
00:27:39,570 --> 00:27:47,400
work could go further. So this is not answering
your question at all, is it? Like in practice,

301
00:27:47,520 --> 00:27:53,280
the the vast majority of the current technically
minded community members in Wikimedia movement

302
00:27:53,370 --> 00:27:58,860
currently speak English or German, natively, or
at least fluently, right. And that's not great,

303
00:27:58,920 --> 00:28:02,220
because there's a whole lot of communities out
there have one of those people sitting around

304
00:28:02,220 --> 00:28:06,390
who can help out. But it does mean that in
practice, a large amount of the wiki functions

305
00:28:06,390 --> 00:28:10,620
content, which you know, whether it's even
written by someone who doesn't speak those

306
00:28:10,620 --> 00:28:18,000
languages, very well will very early on get
English and German versions. And so I have

307
00:28:18,600 --> 00:28:24,360
pretty confident that very early on it will be
usable in for English content. I just worry

308
00:28:24,390 --> 00:28:29,520
about using that as the benchmark to aim for,
because I think we may aim at the wrong things

309
00:28:29,520 --> 00:28:35,460
by building that. But I want to build it soon.
So people use it and tell us what actually works

310
00:28:35,460 --> 00:28:40,500
and doesn't work. And then we can course correct
then rather than endlessly building kind of

311
00:28:40,500 --> 00:28:44,640
castles in the sky of my mind about what does
and doesn't work without validating it in the

312
00:28:44,640 --> 00:28:50,010
real world. I like to say that, you know, until
the user is using it all software is vaporware.

313
00:28:50,430 --> 00:29:00,120
And this is more vaporware than most. Yeah,
yeah. So so this is kind of one of the things I

314
00:29:00,150 --> 00:29:08,670
put my hand on the scale for so early on. Yeah,
early on in the vision of abstract Wikipedia. We

315
00:29:08,670 --> 00:29:14,490
said it was multilingual. And we said both
multilingual in terms of all natural languages,

316
00:29:14,520 --> 00:29:18,540
but also programming languages. And so the
question was, we're currently got two

317
00:29:18,540 --> 00:29:24,270
programming languages. Why two? Why these two?
So the two we have our JavaScript and Python,

318
00:29:24,780 --> 00:29:36,090
specifically. QuickJS, version equiscript 2020,
an d rust, Python 311 Alpha off the top of my

319
00:29:36,090 --> 00:29:41,520
head by those two is because those are the two
that we could build into a web assembler compact

320
00:29:41,520 --> 00:29:47,340
profile enough to us to ship into production.
Why two, is because when we originally planned

321
00:29:47,340 --> 00:29:51,810
it out, we're going to ship with one. And I said
if we shipped with one that means we shipped

322
00:29:51,810 --> 00:29:56,370
with none. Because, you know, we'll say oh,
we'll add a second one later and we'll have

323
00:29:56,370 --> 00:30:01,650
built everything around the idea of having only
one it's very Even when you promise yourself,

324
00:30:01,650 --> 00:30:04,680
you're going to add support for two, if you
don't actually ship with two, you don't ship

325
00:30:04,680 --> 00:30:09,900
with the ability. So that's why we shipped with
two, those two in particular are just because

326
00:30:09,900 --> 00:30:13,950
those are the two really easy programming
languages that committee members, you know,

327
00:30:13,980 --> 00:30:19,890
general people on the web, understand and use,
and that we can can wrap together. We're

328
00:30:19,890 --> 00:30:23,610
definitely not planning to restrict it to just
those two. So one of the things we want to

329
00:30:23,640 --> 00:30:29,010
support is different versions of the same
programming language. So Python 311, versus 310,

330
00:30:29,010 --> 00:30:33,930
versus 312. And so as new versions of Python
come out, new runners will be available,

331
00:30:34,530 --> 00:30:38,610
community members will get a notification
saying, hey, this function you're watching

332
00:30:38,640 --> 00:30:42,420
doesn't actually run on the new version of
Python, do you want to update it? So it does or

333
00:30:42,420 --> 00:30:47,760
add some more tests or whatever. We also want to
support other programming languages. The very

334
00:30:47,760 --> 00:30:54,660
easy obvious answer to the next bring language
for us to add is, is Lua. The problem with that

335
00:30:55,080 --> 00:30:59,880
is that lots of people at Wikipedia already use
Lua, but they use media window, which is kind of

336
00:30:59,880 --> 00:31:05,850
a fork and dated fork of Lua. And I would be I
think we would need to do a fair bit of work to

337
00:31:05,850 --> 00:31:11,370
explain to people what isn't isn't in scope for
those things, because for example, there's lots

338
00:31:11,370 --> 00:31:16,140
of functions that are disabled in MediaWiki.
Look, because they have security issues. Whereas

339
00:31:16,140 --> 00:31:22,530
we have a security box that Touchwood doesn't
mean that we wouldn't need that. But in return

340
00:31:23,010 --> 00:31:29,220
it MediaWiki Lua injects a whole bunch of useful
functions and framing devices that we wouldn't

341
00:31:29,250 --> 00:31:32,940
make available. And so people were like, well,
hang on, why does it work here or not here? No,

342
00:31:32,970 --> 00:31:38,460
it'd be a complicated migration. Other people
have asked if we can add things like rust or go,

343
00:31:38,790 --> 00:31:47,790
those are exciting languages. Probably pretty
easy from a technology, technology point of view

344
00:31:47,790 --> 00:31:53,880
for us to support but possibly pretty hard from
a kind of community understanding, especially a

345
00:31:53,940 --> 00:31:59,070
someone wanted C++, which means we have to then
compile into objects, those objects down to live

346
00:31:59,070 --> 00:32:03,180
somewhere, they need to cache them, and expunge
them in a different way. And then the community

347
00:32:03,180 --> 00:32:07,320
like Well hang on which version, I looked at
revision x, but you're running revision, why

348
00:32:07,350 --> 00:32:11,640
that kind of stuff is even more confusing when
it's not literally take this text and run it in

349
00:32:11,640 --> 00:32:21,570
the interpreter. We will not support every
language forever, you know, languages have life

350
00:32:21,630 --> 00:32:27,810
to them. Nowadays, lifecycle of languages coming
in languages going out, we just I in a different

351
00:32:27,810 --> 00:32:35,040
hatted me, just disabled node 16 running in the
CI system for Wikimedia things like that are

352
00:32:35,040 --> 00:32:39,930
going to have to happen for wiki functions to
keep running. And so that means it's not good

353
00:32:39,930 --> 00:32:45,150
enough to do a fire and forget, people have to
keep their functions, the implementations

354
00:32:45,150 --> 00:32:50,310
updated for new versions of languages, new
community requests and requirements. And so

355
00:32:50,790 --> 00:32:55,650
building not just a, someone got got a
notification in their watch list, but actually

356
00:32:55,650 --> 00:33:00,960
an active curation dashboard for these are the
functions lots of people are using, these are

357
00:33:00,960 --> 00:33:04,800
things that are really slow. And if someone
could make a faster version, that'd be great. Or

358
00:33:04,950 --> 00:33:08,280
these are the versions where it's in python
three, seven, that's end of life, so please

359
00:33:08,280 --> 00:33:13,230
migrate it to a newer version, that'd be good.
So that's kind of a bunch of things in my

360
00:33:13,230 --> 00:33:18,810
thoughts around language support, but we're not
fixed on particular language. We already do a

361
00:33:18,810 --> 00:33:24,390
really poor job as a movement of explaining the
provenance of content to users, right? You're

362
00:33:24,390 --> 00:33:29,400
looking at an article. And the text is from the
article except the bits of text that are from a

363
00:33:29,400 --> 00:33:32,910
template, except the bits of text that are from
Commons, except the bit of text that from wiki

364
00:33:32,910 --> 00:33:36,360
data, except the bit of text that are actually
interpolated by a Lua script between three of

365
00:33:36,360 --> 00:33:41,370
these different sources. And we kind of just
like, oh, yeah, it's Wikipedia, go with it.

366
00:33:41,910 --> 00:33:47,550
Adding wiki functions adds a fourth layer to
that. And I agree. If if there's an already the

367
00:33:47,550 --> 00:33:51,720
straw that breaks the camel's back, this is
definitely risking that. And I think, I mean,

368
00:33:51,720 --> 00:33:55,950
not just for editors, but for readers, like we
probably need to do a good job of explaining.

369
00:33:56,520 --> 00:34:00,750
Why should you trust this? Why should you
believe this? How do you know where this comes

370
00:34:00,750 --> 00:34:06,000
from? And I think, and this is a bit larger
scale than just wiki functions. But I think

371
00:34:06,330 --> 00:34:11,160
we're probably going to want to do something
like badging source content from different

372
00:34:11,160 --> 00:34:16,320
sources with little icons or little last updated
flags or whatever things like that. So people

373
00:34:16,320 --> 00:34:19,590
can understand, at least to some extent, as
they're reading and looking at it, where it

374
00:34:19,590 --> 00:34:25,260
comes from. On the editing side, even more, so
we're going to have to do a really good job of

375
00:34:25,320 --> 00:34:30,810
onboarding people and holding them through. What
does it mean when it says their ages? 39 didn't

376
00:34:30,810 --> 00:34:35,850
need to have a birthday this morning? It should
say 40 How do I how do I fix that? And like the,

377
00:34:36,060 --> 00:34:39,990
what ultimately you need to find out is oh, it
was cash and it will be fixed in an hour. Right?

378
00:34:40,020 --> 00:34:45,450
But that's, that's like arcane, Wikimedia stack
knowledge that humans shouldn't ever have to

379
00:34:45,450 --> 00:34:49,440
learn, I or deletion have to learn, but
certainly editors should not have to learn that.

380
00:34:49,860 --> 00:35:00,120
And so if we're doing things like that, then we
need to I don't think we probably want to turn

381
00:35:00,240 --> 00:35:04,380
To Read mode into an edit mode where there's a
little edit box, and if you click into it

382
00:35:04,380 --> 00:35:08,160
magically, you're not editing Wikipedia, you're
editing wiki data or editing, wiki functions or

383
00:35:08,160 --> 00:35:14,790
whatever. But we do need some way of learning
that there is a kind of through line between how

384
00:35:14,790 --> 00:35:19,410
the content gets to your page. And maybe like
demonstrating the little pipelines or something

385
00:35:19,410 --> 00:35:24,630
like automatically diagramming. The content
flows that end up with content being shown to

386
00:35:24,630 --> 00:35:31,950
you might be something to explore. But I am now
freelancing way outside my team's work. And the

387
00:35:31,950 --> 00:35:37,020
web team may have used as to me saying this
publicly. So hey, did that answer that question?

388
00:35:37,350 --> 00:35:42,780
Cool. All right. I think that's us at time. So
thank you all very much. And hopefully this was

389
00:35:42,810 --> 00:35:46,530
interesting or at least not too shocking or
worrying. Thank you.