English subtitles for clip: File:WikiConference RU - Wikidata Query Service Tutorial in Tunisian - Part 1.webm
Jump to navigation
Jump to search
1 00:00:00,000 --> 00:00:06,000 Hello everyone. Today, we will demonstrate the Wikidata query service 2 00:00:06,000 --> 00:00:12,000 which allows you to extract data from Wikidata. 3 00:00:12,000 --> 00:00:16,000 But, before we begin, let's see what Wikidata is. 4 00:00:16,000 --> 00:00:24,000 Wikidata is a structured data base that attributes an element to a concept, whether a human 5 00:00:24,000 --> 00:00:30,000 a building, a country, etc. Afterwards, each element is described 6 00:00:30,000 --> 00:00:36,000 by declarations 7 00:00:36,000 --> 00:00:42,000 in the form of triples: subject - predicate - object. 8 00:00:42,000 --> 00:00:48,000 This is the equivalent of a subject - verb - object in a sentence. 9 00:00:48,000 --> 00:01:00,000 These declarations can be the name (label), description, or alternative names (aliases) of the element in multiple languages. 10 00:01:00,000 --> 00:01:06,000 These statements can also be his class, his country of origin. 11 00:01:06,000 --> 00:01:17,000 These declarations can even be external identifiers in other databases or a link to the item's Wikipedia page, etc. 12 00:01:17,000 --> 00:01:28,000 This organization of Wikidata makes data extraction and analysis very easy. 13 00:01:28,000 --> 00:01:33,000 We can ask Wikidata to give us details we need. 14 00:01:33,000 --> 00:01:39,000 This would obviously be done using the Wikidata query system. 15 00:01:39,000 --> 00:01:53,000 This system is accessible at https://query.wikidata.org. 16 00:01:53,000 --> 00:01:59,000 When you access the URL, you will find this GUI. 17 00:01:59,000 --> 00:02:06,000 You will find a field to write the query (the question in SPARQL). 18 00:02:06,000 --> 00:02:13,000 You will also find a query assistant which will make this task easier for you. 19 00:02:13,000 --> 00:02:19,000 But, we are not going to use the query assistant in this tutorial. 20 00:02:19,000 --> 00:02:26,000 Okay, let's try something simple. 21 00:02:26,000 --> 00:02:32,000 To begin, you need to write SELECT. 22 00:02:32,000 --> 00:02:43,000 Then, you must define the name of the variable (the information) you want to extract. 23 00:02:43,000 --> 00:02:48,000 To define a variable, you must write its name preceded by a question mark (?). 24 00:02:48,000 --> 00:03:07,000 Then you write WHERE and open and close a brace. 25 00:03:07,000 --> 00:03:12,000 Then you define the conditions that must be met by the variable in the form of triples just like in Wikidata. 26 00:03:12,000 --> 00:03:27,000 We put the protein variable as a subject. Then, we choose a particular predicate (type of relationship). 27 00:03:27,000 --> 00:03:36,000 We write wdt: then we press Ctrl+Space. Next, we start typing the name of the variable. 28 00:03:36,000 --> 00:03:48,000 We choose the “nature of the element” property (P31). 29 00:03:48,000 --> 00:03:56,000 For the object, we write wd: then we press Ctrl+Space. After this, we write "protein" to find the corresponding element. 30 00:03:56,000 --> 00:04:03,000 Then, we choose the element and end the condition with a point. We finally start the query by clicking on the blue button. 31 00:04:03,000 --> 00:04:21,000 The query can also be started by pressing Ctrl+Enter. Let's go. 32 00:04:21,000 --> 00:04:33,000 That's it. We found 986793 proteins in eight seconds. 33 00:04:33,000 --> 00:04:39,000 You can say that this number is very large. You can limit the number of results 34 00:04:39,000 --> 00:04:48,000 by writing LIMIT then the sample size. For this example, we will restrict the number to 500. 35 00:04:48,000 --> 00:04:52,000 We obtain 500 results. 36 00:04:52,000 --> 00:05:01,000 But, as you want, the results are displayed using their language-independent Wikidata ID. 37 00:05:01,000 --> 00:05:06,000 So how can we display element names in a given language? 38 00:05:06,000 --> 00:05:15,000 We write SERVICE then Ctrl+Space and we will find a line which begins with SERVICE wikibase:label. 39 00:05:15,000 --> 00:05:26,000 We eliminate everything that is not part of the chosen line. 40 00:05:26,000 --> 00:05:32,000 This line allows you to give you the name of the elements in a given language. 41 00:05:32,000 --> 00:05:38,000 Let's say we're going to choose the Russian language, we're going to put "ru" here. That's all. 42 00:05:38,000 --> 00:05:45,000 We click Enter and see what will happen. We find that nothing happens 43 00:05:45,000 --> 00:06:02,000 because we did not add a variable for the name. So, we need to add proteinLabel (variable name + Label) as a variable. 44 00:06:02,000 --> 00:06:18,000 That's all! We get the name of each variable in Russian. 45 00:06:18,000 --> 00:06:30,000 Now, as you saw on Wikidata, there are several declaration types defined per language. 46 00:06:30,000 --> 00:06:43,000 We have the label, description and aliases. How can we extract them? 47 00:06:43,000 --> 00:06:48,000 From now on, we do not use SERVICE wikibase:label. We thus declare a condition which begins with ?protein 48 00:06:48,000 --> 00:06:59,000 Then, we write rdfs:label. Then, we define a variable for the name that calls ?data. 49 00:06:59,000 --> 00:07:12,000 So, we write rdfs:label if we need the label, schema:description to get the description 50 00:07:12,000 --> 00:07:27,000 and skos:altLabel for aliases. We will choose rdfs:label for this example. 51 00:07:27,000 --> 00:07:38,000 We click on Enter. We find that the labels are not filtered by language. 52 00:07:38,000 --> 00:07:42,000 I want only labels in a given language (Russian) to be displayed. 53 00:07:42,000 --> 00:07:46,000 I will now start writing FILTER. 54 00:07:46,000 --> 00:08:04,000 Then, I will write (LANG(?data) = "ru"). That's all. 55 00:08:04,000 --> 00:08:18,000 Thus, we obtain only the labels in Russian. 56 00:08:18,000 --> 00:08:24,000 You don't just want to have the labels. You also want to have the aliases. 57 00:08:24,000 --> 00:08:35,000 You must specify that the ?data variable obeys one of two conditions. 58 00:08:35,000 --> 00:08:58,000 In basic situations, each condition is written in a line and it is linked to the other conditions by a logical AND. 59 00:08:58,000 --> 00:09:05,000 However, in our situation, we need two conditions linked by a logical OR. 60 00:09:05,000 --> 00:09:35,000 In such a situation, we use UNION. Thus, we will find the labels (rdfs:label) and the aliases (skos:altLabel) at the same time. 61 00:09:35,000 --> 00:10:04,000 We will see what this will give. We find that we have two names for the same element here and there. 62 00:10:04,000 --> 00:10:25,000 Now you can say that you need to view all variables instead of naming them after SELECT. 63 00:10:25,000 --> 00:10:46,000 We could simply put an asterisk (*) after SELECT. That's all! 64 00:10:46,000 --> 00:11:02,000 I don't know if the query will take a long time to give its results. 65 00:11:02,000 --> 00:11:16,000 It's good! Here are the results with identifiers as well as protein labels and aliases on Wikidata! 66 00:11:16,000 --> 00:11:26,000 Brief. I now want to determine the number of result rows. 67 00:11:26,000 --> 00:11:31,000 Normally this is 928 lines. But, it would be interesting to see how to do this using SPARQL. 68 00:11:31,000 --> 00:11:34,000 It's simple. We use an aggregate function called COUNT. 69 00:11:34,000 --> 00:11:44,000 Aggregate functions generally allow you to generate statistics based on the results of a SPARQL query. 70 00:11:44,000 --> 00:11:54,000 So, we write COUNT(*) and run the query. We get nothing because we did not define the calculated expression as a variable. 71 00:11:54,000 --> 00:12:06,000 What we're going to do is we're going to put the defined expression in parentheses and define it as a variable using AS and then the variable name (?count). 72 00:12:06,000 --> 00:12:36,000 We launch. 928. This is obviously the expected result. 73 00:12:36,000 --> 00:12:52,000 Now, we want to calculate the number of names (labels and aliases) for each protein. 74 00:12:52,000 --> 00:13:09,000 So, we write ?protein as a variable and put the other variables as COUNT arguments. 75 00:13:09,000 --> 00:13:21,000 We do this since we want to calculate the number of names based on the proteins. 76 00:13:21,000 --> 00:13:28,000 Now, if we run, we get that the query is not well formulated. 77 00:13:28,000 --> 00:13:41,000 What we should do is see the original variables kept after SELECT outside the aggregate function. In this context, we only have ?protein. 78 00:13:41,000 --> 00:14:24,000 We now write GROUP BY ?protein and the query would be functional. 79 00:14:24,000 --> 00:14:30,000 This takes time. It's good. This lasted 34 seconds. 80 00:14:30,000 --> 00:14:44,000 Generally, if the query requires more time to generate results (60 seconds), a timeout will occur and the system will shut down without success. 81 00:14:44,000 --> 00:14:58,000 Now we see the number of names (labels and aliases) in Russian for each element. But, these numbers are not ranked. 82 00:14:58,000 --> 00:15:10,000 We could order these numbers using ORDER BY at the end of the query. 83 00:15:10,000 --> 00:16:58,000 After ORDER BY, we can put ?count if the order is ascending or DESC(?count) if the order is descending. Be careful to write ?count and not ?data. 84 00:16:58,000 --> 00:17:51,000 Now, this data can be viewed in several formats: Histogram, Scatterplot, etc. Just click on the eye and choose the appropriate format. 85 00:17:51,000 --> 00:18:02,000 However, you must activate the wording of the elements for this option to be possible. 86 00:18:02,000 --> 00:19:00,000 I'm going to add the SERVICE wikibase:label line as well as a ?proteinLabel variable that I'm going to add after both SELECT and GROUP BY. 87 00:19:00,000 --> 00:19:12,000 Hopefully the query lasts less than 60 seconds so that we can find results and avoid downtime. 88 00:19:12,000 --> 00:20:02,000 You can click here to see when the system updated its data. This allows you to know if the results reflect the current state of Wikidata. 89 00:20:02,000 --> 00:20:18,000 Well, we can clearly see that what we feared happened. We had a downtime. No problem. That's the principle. 90 90 00:20:18,000 --> 00:20:46,000 I'm going to revert the query to how it was before this task. 91 00:20:46,000 --> 00:20:52,000 Sometimes there will be a need to integrate automatically updated results from the Wikidata query system into an application or system. 92 00:20:52,000 --> 00:21:18,000 This could be done by clicking on Code. This will generate code allowing the results to be included in an HTML page or a Java, Python, JavaScript or R program. 93 00:21:18,000 --> 00:21:31,000 You can also download the results in JSON, TSV or CSV format by clicking Download. 94 00:21:31,000 --> 00:21:43,000 You can also click Link to generate a link to the results without the query field. 95 00:21:43,000 --> 00:22:02,000 If you want to generate a link to the query, you can click on the link icon on the left. This will generate a URL to the request. 96 00:22:02,000 --> 00:22:21,000 For those who are not familiar with Wikidata and SPARQL, they can draw inspiration from certain examples by clicking on the folder icon. 97 00:22:21,000 --> 00:23:08,000 You can search for the example you want to see and adapt it by adding conditions and labels or changing the language or an element if necessary. 98 00:23:08,000 --> 00:23:36,000 Regarding aggregate functions, we don't just have COUNT which we used earlier. We also have MAX (maximum), MIN (minimum) and other functions. 99 00:23:36,000 --> 00:23:59,000 We can thus calculate the minimum number of inhabitants per country for a given continent. The principle is still based on the use of GROUP BY and the definition of variables.