English subtitles for clip: File:08 Abstract Wikipedia.webm
Jump to navigation
Jump to search
1 00:00:00,000 --> 00:00:07,710 I'm running late. I am not here. So I'm here to talk about. I'm here to talk about abstract 2 00:00:07,710 --> 00:00:13,080 Wikipedia language versus knowledge. So I've spoiled my talk a little bit by presenting 3 00:00:13,080 --> 00:00:17,640 earlier a little bit on plan. Sorry about that. Hope it's not too boring for you. So I'm James 4 00:00:17,640 --> 00:00:22,650 Forrester. I'm the tech lead for the abstract Wikipedia team, working for the Wikimedia 5 00:00:22,650 --> 00:00:29,670 Foundation. And filter three presenting here today. So I wanted to give a little bit of 6 00:00:29,670 --> 00:00:35,310 context around what the motivations are for for wiki functions. And the alphabet Wikipedia 7 00:00:35,310 --> 00:00:42,750 dream. Vision shouldn't match it sounds much better than a dream, right? So vision. And so I 8 00:00:42,750 --> 00:00:48,030 just just a quick canter through, you know, what's leading us to our project and where we 9 00:00:48,030 --> 00:00:53,820 are with it. So here's the agenda. There's a bunch of reasons for why we're building but he 10 00:00:53,820 --> 00:00:58,290 functions which hopefully will resonate with a lot of people who are current wikipedians. And 11 00:00:58,290 --> 00:01:03,960 there's a lot of vision for abstract Wikipedia, which hopefully will inspire and not sound 12 00:01:03,960 --> 00:01:11,730 boring. So wiki work unshared. So there are a lot of Wikipedia's I mean, depending on what you 13 00:01:11,730 --> 00:01:15,360 count as an active Wikipedia, there are a different number. But there are over 300 14 00:01:15,390 --> 00:01:21,480 Wikipedia domains in different languages that are live right now. That is a different concept 15 00:01:21,480 --> 00:01:26,250 of knowledge each time but it's the same concept of the work, right? There's the creating 16 00:01:26,250 --> 00:01:29,490 updating articles, there's the petroleum all the different things that are very familiar to 17 00:01:29,490 --> 00:01:36,120 people who edit Wikipedia, curate, Wikipedia make Wikimedia amazingly successful. And these 18 00:01:36,120 --> 00:01:39,780 are the same pieces of work whether you speak English or Urdu, whether you speak you know, 19 00:01:39,810 --> 00:01:46,290 French or or Swahili, but they don't have the same size. Right interfere has 125,000 monthly 20 00:01:46,290 --> 00:01:52,350 contributors, roughly average to bid 19,000 pages and the newer namespace, the module 21 00:01:52,350 --> 00:01:57,060 namespace, one and a half billion speakers, about 900 million of which turn up each month. 22 00:01:57,810 --> 00:02:05,520 The Hauser Wikipedia has 60 monthly contributors 500 pages in Lua namespace, which is already 23 00:02:05,520 --> 00:02:11,820 quite big 70 million speakers in the world, but only 200,000 of them show up each month. And the 24 00:02:11,820 --> 00:02:16,830 nature of people showing up may be different for different people, you know, 1010 hours 25 00:02:16,830 --> 00:02:21,690 continually reading on your desktop computer reading 10,000 pages versus a snippet you read, 26 00:02:21,960 --> 00:02:27,690 you know, on Google, we can both basically those as one person, the nature of that consumption is 27 00:02:27,690 --> 00:02:35,460 different. So here is a screenshot from the module lewer namespace of the Malay Wikipedia. 28 00:02:37,590 --> 00:02:42,120 So one of the things you might notice here, so Malay, is the principal language spoken in 29 00:02:42,120 --> 00:02:47,550 Malaysia, it is written in the Latin script. It is not however, based on English, but all these 30 00:02:47,580 --> 00:02:53,880 articles are these modules are named in English. And that's because they aren't for the Malay 31 00:02:53,880 --> 00:02:58,650 community, they were written for the English Wikipedia, and then mostly copied across. So 32 00:02:58,650 --> 00:03:06,810 here is the one for age, right? As you can see, it's written in English for English speakers for 33 00:03:06,810 --> 00:03:11,070 English people, people who speak English, at least pretty fluidly, if not completely filmed 34 00:03:11,070 --> 00:03:17,370 it. And in fact, if you look at the history of this article, it was imported in one great big 35 00:03:17,370 --> 00:03:22,710 bulk in 2017. And essentially hasn't been touched since except one time where someone very 36 00:03:22,710 --> 00:03:27,930 helpfully actually translated a bunch of the labels that he would output so that on Malay 37 00:03:28,140 --> 00:03:34,650 wiki, it looks useful, right. And that's a great thing to do. But here's English with PTOs 38 00:03:35,430 --> 00:03:41,490 version of this the original, if that makes sense. Once they start forking off, it has 39 00:03:41,490 --> 00:03:45,390 template data, you know, stuff that helps you when you're using Visual Editor actually use 40 00:03:45,390 --> 00:03:53,100 this script, not present on on Malay at all. It has revision history with dozens and dozens and 41 00:03:53,100 --> 00:03:57,690 dozens of improvements may be good improvement of a bad equivalent do not exist on the lay 42 00:03:57,690 --> 00:04:06,930 Wikipedia. And this is replicated on 86. Communities. Well, 85 Because English is one of 43 00:04:06,930 --> 00:04:14,100 them right? across different Wikipedias. This is just a very simple script that takes a date, and 44 00:04:14,130 --> 00:04:18,090 the current date or a different date and calculates how old someone was at that point. 45 00:04:18,390 --> 00:04:23,400 Right. So it's a really useful thing for writing prose is a really boring thing for taking two 46 00:04:23,400 --> 00:04:27,720 numbers apart from each other. And you know, it doesn't differ necessarily by different 47 00:04:27,720 --> 00:04:33,300 contexts. It's a thing we could share. But we currently don't have that technology. And in 48 00:04:33,300 --> 00:04:38,850 fact, you can look at a lot of the versions of the age module across the different Wikipedia's 49 00:04:39,000 --> 00:04:43,530 and the Malay example we picked because it's kind of an extreme version of this, but lots of 50 00:04:43,530 --> 00:04:47,400 wikipedians have this problem, including the English Wikipedia, where people find something 51 00:04:47,400 --> 00:04:52,410 that works for them on German or Arabic, they copy it over to English localize it a bit, but 52 00:04:52,410 --> 00:04:56,340 the English Wikipedia also to a much smaller extent suffers from the problem of having 53 00:04:56,340 --> 00:05:02,280 modules that are kind of disconnected from the original ethos So this is why we built wiki 54 00:05:02,280 --> 00:05:06,750 functions. Fundamentally, the dream for wiki functions is that you're gonna have a site where 55 00:05:06,750 --> 00:05:11,820 you can go, which has one version of this Lua script, this module, this function that works 56 00:05:11,820 --> 00:05:17,310 for everyone, depending on whether you're Malay, Wikipedia, or English, Wikipedia, or German or 57 00:05:17,310 --> 00:05:23,730 any of the others. This is still a dream, not a reality does not work. But it's here, right? So 58 00:05:23,730 --> 00:05:29,400 here is an age function. Now forgive the screenshots, they look a bit tacky, we have 59 00:05:29,400 --> 00:05:37,740 actually changed the design as of like, two weeks ago. So these things are still landing. 60 00:05:37,920 --> 00:05:41,310 There are other things here that are really interesting, though, you'll notice under type, 61 00:05:41,310 --> 00:05:48,480 it says Gregorian proleptic calendar date, and that is perfectly correct and perfectly unusable 62 00:05:48,480 --> 00:05:52,230 for almost every human on Earth. But like, that's a little thing where we, as a community 63 00:05:52,230 --> 00:05:56,190 can agree Actually, why don't we just say date, you know, and then the other ones can say that 64 00:05:56,190 --> 00:06:00,540 we have ones instead. But you can see here, you write the value 23 In the month of May, that's a 65 00:06:00,540 --> 00:06:06,900 drop down, that's not writing the word anyway. And the positive integer no 45. And you were 66 00:06:06,900 --> 00:06:11,370 unfunctional. Yeah, it dropped down to a common error versus before the column error. And it 67 00:06:11,370 --> 00:06:17,640 would output a bit of text that says, you know, how old son was today as of that date. So when 68 00:06:17,640 --> 00:06:22,800 he functions is here, you know, that function exists, but it's not here, here, right. So 69 00:06:22,800 --> 00:06:26,730 here's the front page where he functions, and I can go to it now you can, you can log in and 70 00:06:26,730 --> 00:06:31,560 create a function. And here's an example of a real function that exists also a not function. 71 00:06:31,560 --> 00:06:36,930 This was one of the simplest functions someone built, you put in true, and you get back false, 72 00:06:37,020 --> 00:06:42,180 you put in false, you get back true, world changing stuff. But the important thing here is 73 00:06:42,180 --> 00:06:46,620 that there's one version of that function that can be used as the basis for every bit of 74 00:06:46,650 --> 00:06:51,930 Boolean logic, you're using the entire system. And this can be translated. So if I go back, 75 00:06:52,170 --> 00:06:57,570 you'll see here it says, takes a truth value and returns its opposite. That's the description in 76 00:06:57,600 --> 00:07:03,690 English. But if I click edit here, which doesn't appear because this is a slide deck, you'll be 77 00:07:03,690 --> 00:07:10,320 able to set it in a dozen different languages or actually 650, something like that, roughly 78 00:07:10,320 --> 00:07:15,810 languages. The concept of what languages starts to get very difficult, and I will not pretend to 79 00:07:15,810 --> 00:07:21,330 answer that here. This is controlled by the community. So you'll see here, this is not I 80 00:07:21,330 --> 00:07:25,560 have written the correct version of the NOC function, you will use it this is written on the 81 00:07:25,560 --> 00:07:28,950 wiki controlled by the wicked community on wicked functions. It has a bunch of 82 00:07:28,950 --> 00:07:33,510 implementations a bunch of tests. And so here's an implementation in JavaScript that someone's 83 00:07:33,510 --> 00:07:39,420 written. And here's a different implementation in Python. And so the important thing here is 84 00:07:39,420 --> 00:07:43,560 that both of them are equally valid, we're not making decisions for the community about what 85 00:07:43,830 --> 00:07:47,820 what language they should write in. And we have to make some decisions about what we're going to 86 00:07:47,820 --> 00:07:52,680 put support, but we want to be open rather than make decisions for people. You can also write it 87 00:07:52,680 --> 00:07:57,330 and this is the thing that we think is really important in a composition. And so this is not a 88 00:07:57,330 --> 00:08:01,080 programming language. I mean, it is, but it's not somebody else's Preben language, this is the 89 00:08:01,080 --> 00:08:06,630 wiki functions kind of way. And we hope at some point, this will be something that you can 90 00:08:06,630 --> 00:08:12,390 actually represent visually. So you can already read it in prose. If you I was logged in. 91 00:08:12,420 --> 00:08:17,340 However, in Polish, this prose would be in Polish, right? It's not, it's not the most 92 00:08:17,340 --> 00:08:22,170 beautiful prose, you know, we're still quite techy, and we want to be more welcoming. But 93 00:08:22,200 --> 00:08:26,790 this is in Polish, this is not in English. In fact, pretty much no words appear here that are 94 00:08:26,790 --> 00:08:31,590 in English tool. If you're logged in, or you just went to the Polish entry through Google, 95 00:08:31,800 --> 00:08:36,210 you would end up thinking this website was a Polish website, in Polish for Polish people. And 96 00:08:36,210 --> 00:08:42,930 the same, theoretically is true for all 800 of our lectures. So what's the status right now. So 97 00:08:42,930 --> 00:08:46,320 you can create functions, you create implementations create test cases, you can do it 98 00:08:46,320 --> 00:08:50,670 with Boolean strings. And you do it with lists of those more is coming soon. But we do not 99 00:08:50,670 --> 00:08:57,120 support things like dates or numbers yet, that's coming, what's coming later, calling functions 100 00:08:57,150 --> 00:09:01,440 embedded within a Wikipedia article. So you can't actually use this as a replacement for the 101 00:09:01,470 --> 00:09:07,530 age module on English Wikipedia or Malay, Wikipedia or any of the others coming. You can't 102 00:09:07,530 --> 00:09:11,250 call those functions either from tools or gadgets or third party apps. That's a really big 103 00:09:11,250 --> 00:09:14,370 thing, because it's not just power, Wikipedia, its power and all the other tools across 104 00:09:14,370 --> 00:09:20,670 Wikimedia universe and third parties. And you can't yet use wiki data items or lexemes. But 105 00:09:20,670 --> 00:09:25,980 that's a big thing we definitely want to work on very soon. And so this comes to language versus 106 00:09:25,980 --> 00:09:33,720 knowledge. So if you look at wiki data, it has a lot of structured data statements about the 107 00:09:33,720 --> 00:09:39,540 world. It says Jupiter is the fifth point is the largest planet, it is a gas giant. It doesn't 108 00:09:39,540 --> 00:09:41,640 necessarily let you say 109 00:09:43,350 --> 00:09:49,020 things about statements. Certain forms of statements are not things you can make about 110 00:09:49,980 --> 00:09:55,590 things in wiki data, it's not there yet. And yet you can with a function say different kinds of 111 00:09:55,590 --> 00:10:00,450 things. So here, this is an example function called reverse string which takes substring and 112 00:10:00,450 --> 00:10:07,290 reverses it. And this it has input of deposits. And if you press run function, you will get the 113 00:10:07,290 --> 00:10:15,000 output stupid. This is a toy example from my leads daughter who finds it hilarious, but she 114 00:10:15,000 --> 00:10:22,500 has seven. There are other examples, which is still pretty toy example. So this is an encode 115 00:10:22,500 --> 00:10:28,140 NATO phonetic alphabet, right. So you type in the word wiki, and it says whiskey India, kilo, 116 00:10:28,140 --> 00:10:32,790 India, obviously, we all know that, right? This isn't necessarily a thing that everyone's going 117 00:10:32,790 --> 00:10:36,990 to use all the time. But it is particularly relevant if you actually have an article about 118 00:10:36,990 --> 00:10:42,540 the NATO phonetic alphabet. And we might be able to provide readers with the ability to like run 119 00:10:42,540 --> 00:10:47,070 an actual example in the article as they're reading about it. So rather than try and get 120 00:10:47,070 --> 00:10:51,360 their head around what it means they can explore what it means. This is still knowledge. But it's 121 00:10:51,360 --> 00:10:55,500 not really the kind of written form of articles that we've been generally providing a Wikipedia. 122 00:10:57,450 --> 00:11:02,880 There's possibly the way to make content after this. So these are some examples that we 123 00:11:02,880 --> 00:11:08,700 actually created as a team as a way of exploring what it would be to do natural language 124 00:11:08,700 --> 00:11:12,960 generation. So these are not. So these are not templates in the wiki Tech Center, the sense of 125 00:11:13,200 --> 00:11:17,580 templates in the sense of the AI sense of the term, you have some three inputs, and you press 126 00:11:17,580 --> 00:11:22,290 the button, and it outputs a statement. So Berlin city, Germany, and it says Berlin is the 127 00:11:22,290 --> 00:11:27,060 largest city in Germany, if I change that word from Berlin to Stuttgart, it will say shortcut 128 00:11:27,090 --> 00:11:31,980 is the largest city in Germany, which might not be true, you can, in fact, say, you know, Berlin 129 00:11:31,980 --> 00:11:36,900 is the largest city on the moon, so probably not true. And then there's a version of this in 130 00:11:36,900 --> 00:11:40,200 Germany, we actually wrote versions of this in a bunch of different languages to kind of explore 131 00:11:40,200 --> 00:11:46,350 what it is that couldn't make this useful. This in and of itself, isn't necessarily incredibly 132 00:11:46,350 --> 00:11:53,310 powerful. But early usage of wiki functions that's already there right now, is this function 133 00:11:53,310 --> 00:12:01,020 here. So we data has this set of concept content called lexemes, which are words you know, 134 00:12:01,140 --> 00:12:06,780 entries about words in all the different languages the world, when you say, a verb in 135 00:12:06,780 --> 00:12:12,930 English, generally, it doesn't change very much in different forms, if you think so you stand 136 00:12:13,050 --> 00:12:17,610 and you know, you stood and you're standing, but the you know, there's only three or four 137 00:12:17,610 --> 00:12:23,310 variations. But in other languages, there can be dozens or even hundreds of STEM endings of those 138 00:12:23,310 --> 00:12:28,530 words. And those all needs to be in wiki data so that it can automatically look up, oh, this word 139 00:12:28,530 --> 00:12:34,560 is probably this word and context. So this function here is a relatively simple one that 140 00:12:34,560 --> 00:12:38,670 gets you the nominative plural of the Finnish verb. So you type in an input, and it assumes 141 00:12:38,670 --> 00:12:44,730 you are correct, it is a French verb, noun, and it gets you the results. So you can say van 142 00:12:44,730 --> 00:12:51,150 hempy goes to Van Hammond. But, and here's what the wiki data lexeme entry for that would look 143 00:12:51,150 --> 00:12:55,320 like. So you see, graphic features don't have singular versus not a plural, they have 144 00:12:55,320 --> 00:13:03,000 different forms. And then there's also this data on Wiktionary, which is not connected to wiki 145 00:13:03,000 --> 00:13:06,750 data to pull this data, it's all stored separately on dictionary, and in fact, each 146 00:13:06,750 --> 00:13:11,280 dictionary so this is a screenshot from the English Wiktionary. But the French dictionary, 147 00:13:11,370 --> 00:13:15,210 and the Swedish dictionary and the Finnish which are each have to have their own version of its 148 00:13:15,210 --> 00:13:21,510 content. But here is a page which actually lets you bulk create all of these different STEM 149 00:13:21,510 --> 00:13:28,620 endings onto wiki data in one go. And let you get it colored by wiki function. So this 150 00:13:29,010 --> 00:13:34,890 slightly inappropriately not inappropriate, ahead of our planning, looks into the internal 151 00:13:34,890 --> 00:13:39,480 API for wiki functions and calls the wiki functions function that I just showed, and 152 00:13:39,480 --> 00:13:44,430 outputs a whole bunch of outputs. So if I remember as a young kid trying to learn Latin, 153 00:13:44,520 --> 00:13:49,560 and being disheartened to realize that for each irregular verb, I had to memorize 147 stem 154 00:13:49,560 --> 00:13:55,860 endings, which made me sad. And now a function couldn't remember all of these formulas, and I 155 00:13:56,250 --> 00:14:01,350 don't speak very much Latin anymore, it turns out so that's kind of an option of where we can 156 00:14:01,350 --> 00:14:07,680 go with wiki functions itself. But then the bigger picture is the alphabet with computer 157 00:14:07,680 --> 00:14:13,380 vision. And so this ties back in to the the kind of sample Berlin is the largest city in Germany 158 00:14:13,560 --> 00:14:18,990 idea. So you can take some kind of natural language functions really basic thing. So here 159 00:14:18,990 --> 00:14:28,140 this takes in a word in English and outputs a, the plural of it based on some smart paradigms, 160 00:14:28,140 --> 00:14:33,900 about what words turn into what forms in English. So baby to babies, for instance. And 161 00:14:33,930 --> 00:14:39,330 you can see, down in the left here, plural of chair is chairs, plural, a flurry is fairies, 162 00:14:39,390 --> 00:14:45,300 you know, so it's not, it's not just slapping an S on the end, but it's not very clever, long 163 00:14:45,300 --> 00:14:49,620 term, we'd want to pull this data from wiki data rather than probably having functions for it. 164 00:14:50,520 --> 00:14:55,560 But you can also build that together. So here this takes a person name and a police name, and 165 00:14:55,560 --> 00:15:00,270 says Albert Einstein was born in Rome, or James Forrestal was born in London or All those 166 00:15:00,270 --> 00:15:06,360 different things that you could say. And you can see how little fragments of content statements 167 00:15:06,360 --> 00:15:10,350 like this could be built together into a way that can actually make a really quite 168 00:15:10,380 --> 00:15:18,480 significant statement. So this, this is a piece of text that we cobbled together as a team, the 169 00:15:18,480 --> 00:15:23,910 natural language generation kind of volunteers in working with the team built, which is 170 00:15:23,940 --> 00:15:29,880 essentially the lead of the mercury article in I mean, it's amalgamated from a bunch of different 171 00:15:29,880 --> 00:15:34,650 wikis, but basically, it's what you would write as the lead article. So Mercury has a whole 172 00:15:34,650 --> 00:15:38,970 bunch of statements, you can pull out a wiki data right now looks like when she was born. 173 00:15:39,870 --> 00:15:43,980 There are some interesting questions about what her nationality is that actually, if you read 174 00:15:43,980 --> 00:15:47,610 different Wikipedia articles about marriage theory, you're having very different views as to 175 00:15:47,610 --> 00:15:52,890 what nation she was a citizen of, which is one of those kinds of political things that becomes 176 00:15:52,890 --> 00:15:58,860 very difficult to put into nuance in Wicked data. But the thing that's really important 177 00:15:58,860 --> 00:16:05,550 here, is that Mercury was the first woman to win the Nobel Prize, she was awarded Nobel Prize in 178 00:16:05,550 --> 00:16:11,640 chemistry, and the Nobel Prize in Physics. And actually, she was the only human to win two 179 00:16:11,640 --> 00:16:17,070 Nobel Prizes in different fields, which is not a state, that is a really interesting thing to 180 00:16:17,070 --> 00:16:21,450 say. Generally, we are going to want to say something like that, in any article lead about 181 00:16:21,450 --> 00:16:27,030 Mercury. And it's also not a thing that the structure of wiki data lets you say inside wiki 182 00:16:27,030 --> 00:16:32,730 data. But with wiki functions, you could say something like this. And spoiler, this text is 183 00:16:32,730 --> 00:16:40,260 actually generated text from a set of both what we call constructors, and then renders the 184 00:16:40,260 --> 00:16:46,080 constructors construct each individual fragment and sentence. And then the renderer turns that 185 00:16:46,080 --> 00:16:52,860 from data into English, and in this case, into Hebrew, and if you say, actually, this is wrong, 186 00:16:52,890 --> 00:16:58,860 you can make a typo correction, you can change. So you see Nobel Prize in Chemistry 1909, change 187 00:16:58,860 --> 00:17:03,870 it to 911, they're just updates because, you know, like a wiki text template, this is just a 188 00:17:03,870 --> 00:17:08,640 thing where it get data in the data out. So the plus side of that is you can update it, once it 189 00:17:08,640 --> 00:17:12,480 updates in all the languages. The downside of that is you update it once it updated all the 190 00:17:12,480 --> 00:17:16,350 languages. And so that has all of the moderation concerns that you have around people changing 191 00:17:16,350 --> 00:17:20,430 templates, all the moderation concerns you have about people changing inputs to things, or 192 00:17:20,430 --> 00:17:25,710 things on wiki data. And so this is not in any way a solved problem. But it's a thing that we 193 00:17:25,710 --> 00:17:31,440 have as a dream. So and particularly the dream here is not that the some magic AI is going to 194 00:17:31,440 --> 00:17:36,510 come along and write content, I think a that's not going to happen. AI is generally based on 195 00:17:36,510 --> 00:17:40,950 Wikipedia. And so it can't really very helpfully write Wikipedia without going into a kind of 196 00:17:40,980 --> 00:17:47,640 spiral of decay, but also be That's not who we are as Wikimedia, like the humans in the loop of 197 00:17:47,640 --> 00:17:52,170 the magic, Pixie Dust, right? We it's not an accident that we have humans write Wikipedia 198 00:17:52,170 --> 00:17:57,960 articles. That is actually why we're here to do it. And so the idea here is that you have, you 199 00:17:57,960 --> 00:18:03,570 know, input through control from Wikimedia communities, the state statements of fact, will 200 00:18:03,570 --> 00:18:09,540 live on wiki data controlled by that community, the translation of statements into pros will 201 00:18:09,540 --> 00:18:16,110 live on wiki functions, and then the output will live somewhere. We don't know where that is. But 202 00:18:16,110 --> 00:18:23,940 you see here, you can just change the text here from English into in this case creation. And it 203 00:18:23,940 --> 00:18:32,850 magically outcomes with prose in, in creation, not English. So cool. This is a dream, this is a 204 00:18:32,850 --> 00:18:37,890 vision. There's a whole bunch of content written by humans, that's really great. What what what 205 00:18:37,890 --> 00:18:43,560 are we going to do with this vision? Well, a, apparently I need to build it at some point. But 206 00:18:43,560 --> 00:18:48,240 also be what even once you've got a technology as an available thing, what do we as the 207 00:18:48,240 --> 00:18:54,450 Wikimedia community want to do with it, one of the things that we think might be an early win, 208 00:18:54,780 --> 00:19:01,230 an early place to start with this work is wiki data item descriptions. So if you look here, a 209 00:19:01,230 --> 00:19:05,790 lot of these languages don't have anything defined at all, the ones that do defined are all 210 00:19:05,790 --> 00:19:10,440 saying the very basic thing that the University of Lagos is a university in Nigeria, except, 211 00:19:10,500 --> 00:19:15,900 interestingly, the ones in English and Hazar, which both mentioned that is a public university 212 00:19:16,050 --> 00:19:23,760 in Nigeria. But that's something that a wiki functions can bot, as you can think of it could 213 00:19:23,790 --> 00:19:29,460 replace those automatic statements made by individually in a per language basis, by a kind 214 00:19:29,460 --> 00:19:33,570 of can language translation, still controlled by the wiki data community. So 215 00:19:33,000 --> 00:19:38,370 if you change it to former university, because it's now closed, they all get updated instantly, 216 00:19:38,520 --> 00:19:44,610 or the country of Nigeria changes its name because it, you know, splits in powerful doubles 217 00:19:44,610 --> 00:19:48,300 and sites or whatever. And those will all get updated without a human having to go through 218 00:19:48,300 --> 00:19:53,820 each of 200 languages and manually changing the label. But that's, you know, still a bit scary. 219 00:19:53,850 --> 00:19:57,540 And that's just wiki data item descriptions, right. There are other areas that are slightly 220 00:19:57,540 --> 00:20:02,790 less scary. So things slightly more scary. So there are automatic lists. So this is a list of 221 00:20:02,790 --> 00:20:08,070 national capitals by population. This is an article on English period. These are statements 222 00:20:08,070 --> 00:20:12,300 of fact, the incidence of faculty do not change very often. But when they change, we generally 223 00:20:12,300 --> 00:20:16,860 don't remember to change them all the different places, right? If I update the article on 224 00:20:16,860 --> 00:20:21,930 plough, to say, the what the population, the population of its, its capitalist, and someone 225 00:20:21,930 --> 00:20:26,700 moved in, so it's not zero anymore. Do I also remember to go to update the entry on this table 226 00:20:26,910 --> 00:20:30,810 that I probably don't even know it exists. And, you know, if I went into special what leads 227 00:20:30,810 --> 00:20:35,730 here, I'd eventually find it 500 pages down, no one's ever going to do that. And so content gets 228 00:20:35,730 --> 00:20:40,620 out of date to have to sync with each other, be really nice to pull this all together. But 229 00:20:40,740 --> 00:20:45,630 that's also like real user facing content in a way that we need a document descriptions to have 230 00:20:45,630 --> 00:20:50,130 the way. So we're not going to, we're not going to throw some switch, you know, mad scientist 231 00:20:50,130 --> 00:20:54,240 style and tough you're gonna have to deal with it. This is very much a proposal idea that we're 232 00:20:54,240 --> 00:21:01,920 going to work through as a community. More contentiously, even still, so info boxes, info 233 00:21:01,920 --> 00:21:07,590 boxes are a big area of concern. And by that I don't just mean that different wikis have 234 00:21:07,590 --> 00:21:12,990 different contents in their different wikis, info boxes, you know, if you compare the article 235 00:21:13,020 --> 00:21:17,190 on mercury in English, Polish, German, they disagree about her nationality, they also 236 00:21:17,190 --> 00:21:22,290 disagree about her name. But more importantly, there are things that appear in English that 237 00:21:22,290 --> 00:21:26,910 don't appear in German for a long time, her signature, I think, appeared in the infobox and 238 00:21:26,910 --> 00:21:33,540 finally got dropped a few years ago. Audrey, who's a brilliant person? And is Minister 239 00:21:33,540 --> 00:21:39,030 petition fairs in Taiwan, or at least was until the election yesterday. I don't know if that's 240 00:21:39,030 --> 00:21:44,880 still the case anyway? Or do we have a whole bunch of data about them? Once stored in wiki 241 00:21:44,880 --> 00:21:50,430 data, all theoretically, could be automatically put into info boxes in some form using a 242 00:21:50,430 --> 00:21:57,900 function. But English repeater, especially more than actually, I think any other Wikipedia has a 243 00:21:57,900 --> 00:22:02,730 really contentious view about info boxes, where it's not even agreed on the wiki of what they 244 00:22:02,730 --> 00:22:06,360 should look like. And so you have some wiki projects that decided they have a different view 245 00:22:06,360 --> 00:22:12,120 of Infoboxes to other wiki projects and long running very slow edit wars back and forth as to 246 00:22:12,120 --> 00:22:17,310 which Edit Template to use, you know, is Napoleon amok? Or is he a military figure? 247 00:22:17,400 --> 00:22:21,840 Because they get different Infoboxes. And that's really important to some people. I think it's 248 00:22:21,840 --> 00:22:27,360 more important that our readers get the right answers with you know, but you know, it's not 249 00:22:27,360 --> 00:22:32,340 for me, as an individual, let alone me from the foundation to tell the community how to run 250 00:22:32,340 --> 00:22:39,840 itself, right. But I want to point out sometimes when it goes in the wrong way. Anyway, that's 251 00:22:39,840 --> 00:22:44,490 the kind of overview overview of what the idea for the abstract we can be efficient is there is 252 00:22:45,180 --> 00:22:50,970 there is no plan to replace articles that is, like, pointless, and not a good value for 253 00:22:51,060 --> 00:22:55,830 anyone. But hopefully, it's giving you some ideas about things that we could do with our 254 00:22:55,830 --> 00:23:00,810 track Wikipedia. Hopefully, you're even excited for wiki functions to actually ship at some 255 00:23:00,810 --> 00:23:08,010 point. I am. And yeah, I got a hopefully 15 minutes now for questions if anybody has any. 256 00:23:08,760 --> 00:23:17,610 When does it ship? Okay, we shipped wiki functions itself in July last year, so that got 257 00:23:17,610 --> 00:23:23,520 it very early out the door, you can you at the time, only a very small number of Blessed 258 00:23:23,520 --> 00:23:28,500 humans, you know, were given the ability to actually write functions or even run functions. 259 00:23:29,910 --> 00:23:34,620 We have done a lot of work on the security side on the scaling side. So that security Happy, 260 00:23:35,130 --> 00:23:42,480 happy enough with us, at this point that we now allow any community member to create functions, 261 00:23:42,810 --> 00:23:47,940 only approved community members approved by the community can actually make the function live so 262 00:23:47,940 --> 00:23:52,590 that other people will run them. And then once a function is approved as an alive function, any 263 00:23:52,590 --> 00:23:58,440 human logged out or logged in, can run it by putting inputs into it. So I could go right now 264 00:23:58,440 --> 00:24:03,900 to the article, you know, to the entry from the gate and make true go to false and false go to 265 00:24:03,900 --> 00:24:06,960 true and I don't need to be logged in for that. I certainly don't need any special rights. 266 00:24:08,430 --> 00:24:16,110 However, right? You know, a car is great inside the garage, but you actually want to drive it on 267 00:24:16,110 --> 00:24:20,910 the road. The thing a lot of people want to do is use where he functions to power their apps 268 00:24:20,910 --> 00:24:26,040 and to power Wikipedia articles and to pull in wiki data content. Those are big things we're 269 00:24:26,040 --> 00:24:32,550 focused on with as a team, we've got a meeting next week, we got next week to actually decide 270 00:24:32,580 --> 00:24:36,480 on the kind of priorities for the coming quarter and, you know, share that with the community as 271 00:24:36,570 --> 00:24:43,620 as when But assuming the team agrees with me, I think we're probably going to be big focus on on 272 00:24:43,620 --> 00:24:50,010 types and on wiki data usage and a second focus on using it on Wikipedia. But it's months, not 273 00:24:50,010 --> 00:24:58,050 weeks, unfortunately at best. Um That's a good question. So the first thing I'd say is that 274 00:24:58,080 --> 00:25:02,310 ultimately wiki functions is community Logic controlled by the wiki functions community, 275 00:25:02,610 --> 00:25:07,590 which you can be part of, if you go sign up, right, you know, it's not controlled by me. And 276 00:25:07,590 --> 00:25:12,930 so what functions the community focuses on and thinks are important is kind of in their 277 00:25:12,930 --> 00:25:20,280 control, not ours, we will come support them in the directions they want to go. I think the 278 00:25:20,280 --> 00:25:27,420 simple natural language generation concepts are actually already existing across the Wikimedia 279 00:25:27,480 --> 00:25:33,180 wikis. In some cases, right, there's a module, Lua function where you will pass it a list of 280 00:25:33,180 --> 00:25:39,720 between one and n entries, and it will say foo or foo and bar or food comma bar comma, and, you 281 00:25:39,720 --> 00:25:45,240 know, black bears. And so that kind of, of templating or language generation already 282 00:25:45,240 --> 00:25:49,620 exists. And kind of that's a lift and shift over to wiki functions is not necessarily a thing 283 00:25:49,620 --> 00:26:03,840 that will take a very long time. The problem is it's very easy to come up with very naive and 284 00:26:03,840 --> 00:26:11,070 simple heuristics for how to describe entities like biographies or whatever born then died 285 00:26:11,070 --> 00:26:16,320 then. And they always run into the buffers of, you know, Wikipedia is an amazing place. And it 286 00:26:16,320 --> 00:26:22,680 has articles about people, where we don't have any of that data. Right? So there are articles 287 00:26:22,680 --> 00:26:26,760 or I mean, you know, there's this wonderful standard blog that goes around the web, that's, 288 00:26:26,790 --> 00:26:32,730 you know, 50, lies, programmers belief about names, or dates or whatever, right. So we have 289 00:26:32,730 --> 00:26:37,320 Wikipedia articles about people, we don't know the date of birth, or the date of death. And we 290 00:26:37,320 --> 00:26:42,150 don't, and sometimes we just say with this wonderful term, Florrie out as in when they were 291 00:26:42,150 --> 00:26:47,670 flourishing, so we have like, well, they published a book in 522. So they must have been 292 00:26:47,670 --> 00:26:51,810 alive then. But we don't exactly know whether they were alive 20 years before or after that. 293 00:26:51,810 --> 00:26:58,740 So we're tough. For some of them, we don't know their names, we have kind of the author of x, 294 00:26:59,070 --> 00:27:07,800 unknown is, you know, and so often, there's a risk with Wikipedia work that we can let perfect 295 00:27:07,800 --> 00:27:15,420 be the enemy of the good. But at the same time, it's very easy to get stuck into only answering 296 00:27:15,420 --> 00:27:22,590 questions for English content, where, frankly, there's quite a lot of that already. I I'm very 297 00:27:22,590 --> 00:27:28,680 happy when Vidya gets value out of the things we build, but the early Wikipedia, the Daggubati, 298 00:27:28,680 --> 00:27:34,410 Wikipedia, the Igbo, wikipedians, are all much, much smaller, and much more, the potential for 299 00:27:34,410 --> 00:27:39,570 helping readers is much higher there with less work, you know, the same amount of community 300 00:27:39,570 --> 00:27:47,400 work could go further. So this is not answering your question at all, is it? Like in practice, 301 00:27:47,520 --> 00:27:53,280 the the vast majority of the current technically minded community members in Wikimedia movement 302 00:27:53,370 --> 00:27:58,860 currently speak English or German, natively, or at least fluently, right. And that's not great, 303 00:27:58,920 --> 00:28:02,220 because there's a whole lot of communities out there have one of those people sitting around 304 00:28:02,220 --> 00:28:06,390 who can help out. But it does mean that in practice, a large amount of the wiki functions 305 00:28:06,390 --> 00:28:10,620 content, which you know, whether it's even written by someone who doesn't speak those 306 00:28:10,620 --> 00:28:18,000 languages, very well will very early on get English and German versions. And so I have 307 00:28:18,600 --> 00:28:24,360 pretty confident that very early on it will be usable in for English content. I just worry 308 00:28:24,390 --> 00:28:29,520 about using that as the benchmark to aim for, because I think we may aim at the wrong things 309 00:28:29,520 --> 00:28:35,460 by building that. But I want to build it soon. So people use it and tell us what actually works 310 00:28:35,460 --> 00:28:40,500 and doesn't work. And then we can course correct then rather than endlessly building kind of 311 00:28:40,500 --> 00:28:44,640 castles in the sky of my mind about what does and doesn't work without validating it in the 312 00:28:44,640 --> 00:28:50,010 real world. I like to say that, you know, until the user is using it all software is vaporware. 313 00:28:50,430 --> 00:29:00,120 And this is more vaporware than most. Yeah, yeah. So so this is kind of one of the things I 314 00:29:00,150 --> 00:29:08,670 put my hand on the scale for so early on. Yeah, early on in the vision of abstract Wikipedia. We 315 00:29:08,670 --> 00:29:14,490 said it was multilingual. And we said both multilingual in terms of all natural languages, 316 00:29:14,520 --> 00:29:18,540 but also programming languages. And so the question was, we're currently got two 317 00:29:18,540 --> 00:29:24,270 programming languages. Why two? Why these two? So the two we have our JavaScript and Python, 318 00:29:24,780 --> 00:29:36,090 specifically. QuickJS, version equiscript 2020, an d rust, Python 311 Alpha off the top of my 319 00:29:36,090 --> 00:29:41,520 head by those two is because those are the two that we could build into a web assembler compact 320 00:29:41,520 --> 00:29:47,340 profile enough to us to ship into production. Why two, is because when we originally planned 321 00:29:47,340 --> 00:29:51,810 it out, we're going to ship with one. And I said if we shipped with one that means we shipped 322 00:29:51,810 --> 00:29:56,370 with none. Because, you know, we'll say oh, we'll add a second one later and we'll have 323 00:29:56,370 --> 00:30:01,650 built everything around the idea of having only one it's very Even when you promise yourself, 324 00:30:01,650 --> 00:30:04,680 you're going to add support for two, if you don't actually ship with two, you don't ship 325 00:30:04,680 --> 00:30:09,900 with the ability. So that's why we shipped with two, those two in particular are just because 326 00:30:09,900 --> 00:30:13,950 those are the two really easy programming languages that committee members, you know, 327 00:30:13,980 --> 00:30:19,890 general people on the web, understand and use, and that we can can wrap together. We're 328 00:30:19,890 --> 00:30:23,610 definitely not planning to restrict it to just those two. So one of the things we want to 329 00:30:23,640 --> 00:30:29,010 support is different versions of the same programming language. So Python 311, versus 310, 330 00:30:29,010 --> 00:30:33,930 versus 312. And so as new versions of Python come out, new runners will be available, 331 00:30:34,530 --> 00:30:38,610 community members will get a notification saying, hey, this function you're watching 332 00:30:38,640 --> 00:30:42,420 doesn't actually run on the new version of Python, do you want to update it? So it does or 333 00:30:42,420 --> 00:30:47,760 add some more tests or whatever. We also want to support other programming languages. The very 334 00:30:47,760 --> 00:30:54,660 easy obvious answer to the next bring language for us to add is, is Lua. The problem with that 335 00:30:55,080 --> 00:30:59,880 is that lots of people at Wikipedia already use Lua, but they use media window, which is kind of 336 00:30:59,880 --> 00:31:05,850 a fork and dated fork of Lua. And I would be I think we would need to do a fair bit of work to 337 00:31:05,850 --> 00:31:11,370 explain to people what isn't isn't in scope for those things, because for example, there's lots 338 00:31:11,370 --> 00:31:16,140 of functions that are disabled in MediaWiki. Look, because they have security issues. Whereas 339 00:31:16,140 --> 00:31:22,530 we have a security box that Touchwood doesn't mean that we wouldn't need that. But in return 340 00:31:23,010 --> 00:31:29,220 it MediaWiki Lua injects a whole bunch of useful functions and framing devices that we wouldn't 341 00:31:29,250 --> 00:31:32,940 make available. And so people were like, well, hang on, why does it work here or not here? No, 342 00:31:32,970 --> 00:31:38,460 it'd be a complicated migration. Other people have asked if we can add things like rust or go, 343 00:31:38,790 --> 00:31:47,790 those are exciting languages. Probably pretty easy from a technology, technology point of view 344 00:31:47,790 --> 00:31:53,880 for us to support but possibly pretty hard from a kind of community understanding, especially a 345 00:31:53,940 --> 00:31:59,070 someone wanted C++, which means we have to then compile into objects, those objects down to live 346 00:31:59,070 --> 00:32:03,180 somewhere, they need to cache them, and expunge them in a different way. And then the community 347 00:32:03,180 --> 00:32:07,320 like Well hang on which version, I looked at revision x, but you're running revision, why 348 00:32:07,350 --> 00:32:11,640 that kind of stuff is even more confusing when it's not literally take this text and run it in 349 00:32:11,640 --> 00:32:21,570 the interpreter. We will not support every language forever, you know, languages have life 350 00:32:21,630 --> 00:32:27,810 to them. Nowadays, lifecycle of languages coming in languages going out, we just I in a different 351 00:32:27,810 --> 00:32:35,040 hatted me, just disabled node 16 running in the CI system for Wikimedia things like that are 352 00:32:35,040 --> 00:32:39,930 going to have to happen for wiki functions to keep running. And so that means it's not good 353 00:32:39,930 --> 00:32:45,150 enough to do a fire and forget, people have to keep their functions, the implementations 354 00:32:45,150 --> 00:32:50,310 updated for new versions of languages, new community requests and requirements. And so 355 00:32:50,790 --> 00:32:55,650 building not just a, someone got got a notification in their watch list, but actually 356 00:32:55,650 --> 00:33:00,960 an active curation dashboard for these are the functions lots of people are using, these are 357 00:33:00,960 --> 00:33:04,800 things that are really slow. And if someone could make a faster version, that'd be great. Or 358 00:33:04,950 --> 00:33:08,280 these are the versions where it's in python three, seven, that's end of life, so please 359 00:33:08,280 --> 00:33:13,230 migrate it to a newer version, that'd be good. So that's kind of a bunch of things in my 360 00:33:13,230 --> 00:33:18,810 thoughts around language support, but we're not fixed on particular language. We already do a 361 00:33:18,810 --> 00:33:24,390 really poor job as a movement of explaining the provenance of content to users, right? You're 362 00:33:24,390 --> 00:33:29,400 looking at an article. And the text is from the article except the bits of text that are from a 363 00:33:29,400 --> 00:33:32,910 template, except the bits of text that are from Commons, except the bit of text that from wiki 364 00:33:32,910 --> 00:33:36,360 data, except the bit of text that are actually interpolated by a Lua script between three of 365 00:33:36,360 --> 00:33:41,370 these different sources. And we kind of just like, oh, yeah, it's Wikipedia, go with it. 366 00:33:41,910 --> 00:33:47,550 Adding wiki functions adds a fourth layer to that. And I agree. If if there's an already the 367 00:33:47,550 --> 00:33:51,720 straw that breaks the camel's back, this is definitely risking that. And I think, I mean, 368 00:33:51,720 --> 00:33:55,950 not just for editors, but for readers, like we probably need to do a good job of explaining. 369 00:33:56,520 --> 00:34:00,750 Why should you trust this? Why should you believe this? How do you know where this comes 370 00:34:00,750 --> 00:34:06,000 from? And I think, and this is a bit larger scale than just wiki functions. But I think 371 00:34:06,330 --> 00:34:11,160 we're probably going to want to do something like badging source content from different 372 00:34:11,160 --> 00:34:16,320 sources with little icons or little last updated flags or whatever things like that. So people 373 00:34:16,320 --> 00:34:19,590 can understand, at least to some extent, as they're reading and looking at it, where it 374 00:34:19,590 --> 00:34:25,260 comes from. On the editing side, even more, so we're going to have to do a really good job of 375 00:34:25,320 --> 00:34:30,810 onboarding people and holding them through. What does it mean when it says their ages? 39 didn't 376 00:34:30,810 --> 00:34:35,850 need to have a birthday this morning? It should say 40 How do I how do I fix that? And like the, 377 00:34:36,060 --> 00:34:39,990 what ultimately you need to find out is oh, it was cash and it will be fixed in an hour. Right? 378 00:34:40,020 --> 00:34:45,450 But that's, that's like arcane, Wikimedia stack knowledge that humans shouldn't ever have to 379 00:34:45,450 --> 00:34:49,440 learn, I or deletion have to learn, but certainly editors should not have to learn that. 380 00:34:49,860 --> 00:35:00,120 And so if we're doing things like that, then we need to I don't think we probably want to turn 381 00:35:00,240 --> 00:35:04,380 To Read mode into an edit mode where there's a little edit box, and if you click into it 382 00:35:04,380 --> 00:35:08,160 magically, you're not editing Wikipedia, you're editing wiki data or editing, wiki functions or 383 00:35:08,160 --> 00:35:14,790 whatever. But we do need some way of learning that there is a kind of through line between how 384 00:35:14,790 --> 00:35:19,410 the content gets to your page. And maybe like demonstrating the little pipelines or something 385 00:35:19,410 --> 00:35:24,630 like automatically diagramming. The content flows that end up with content being shown to 386 00:35:24,630 --> 00:35:31,950 you might be something to explore. But I am now freelancing way outside my team's work. And the 387 00:35:31,950 --> 00:35:37,020 web team may have used as to me saying this publicly. So hey, did that answer that question? 388 00:35:37,350 --> 00:35:42,780 Cool. All right. I think that's us at time. So thank you all very much. And hopefully this was 389 00:35:42,810 --> 00:35:46,530 interesting or at least not too shocking or worrying. Thank you.