English subtitles for clip: File:WikiConference RU - Wikidata Query Service Tutorial in Tunisian - Part 1.webm

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
1
00:00:00,000 --> 00:00:06,000
Hello everyone. Today, we will demonstrate the Wikidata query service

2
00:00:06,000 --> 00:00:12,000
which allows you to extract data from Wikidata.

3
00:00:12,000 --> 00:00:16,000
But, before we begin, let's see what Wikidata is.

4
00:00:16,000 --> 00:00:24,000
Wikidata is a structured data base that attributes an element to a concept, whether a human

5
00:00:24,000 --> 00:00:30,000
a building, a country, etc. Afterwards, each element is described

6
00:00:30,000 --> 00:00:36,000
by declarations

7
00:00:36,000 --> 00:00:42,000
in the form of triples: subject - predicate - object.

8
00:00:42,000 --> 00:00:48,000
This is the equivalent of a subject - verb - object in a sentence.

9
00:00:48,000 --> 00:01:00,000
These declarations can be the name (label), description, or alternative names (aliases) of the element in multiple languages.

10
00:01:00,000 --> 00:01:06,000
These statements can also be his class, his country of origin.

11
00:01:06,000 --> 00:01:17,000
These declarations can even be external identifiers in other databases or a link to the item's Wikipedia page, etc.

12
00:01:17,000 --> 00:01:28,000
This organization of Wikidata makes data extraction and analysis very easy.

13
00:01:28,000 --> 00:01:33,000
We can ask Wikidata to give us details we need.

14
00:01:33,000 --> 00:01:39,000
This would obviously be done using the Wikidata query system.

15
00:01:39,000 --> 00:01:53,000
This system is accessible at https://query.wikidata.org.

16
00:01:53,000 --> 00:01:59,000
When you access the URL, you will find this GUI.

17
00:01:59,000 --> 00:02:06,000
You will find a field to write the query (the question in SPARQL).

18
00:02:06,000 --> 00:02:13,000
You will also find a query assistant which will make this task easier for you.

19
00:02:13,000 --> 00:02:19,000
But, we are not going to use the query assistant in this tutorial.

20
00:02:19,000 --> 00:02:26,000
Okay, let's try something simple.

21
00:02:26,000 --> 00:02:32,000
To begin, you need to write SELECT.

22
00:02:32,000 --> 00:02:43,000
Then, you must define the name of the variable (the information) you want to extract.

23
00:02:43,000 --> 00:02:48,000
To define a variable, you must write its name preceded by a question mark (?).

24
00:02:48,000 --> 00:03:07,000
Then you write WHERE and open and close a brace.

25
00:03:07,000 --> 00:03:12,000
Then you define the conditions that must be met by the variable in the form of triples just like in Wikidata.

26
00:03:12,000 --> 00:03:27,000
We put the protein variable as a subject. Then, we choose a particular predicate (type of relationship).

27
00:03:27,000 --> 00:03:36,000
We write wdt: then we press Ctrl+Space. Next, we start typing the name of the variable.

28
00:03:36,000 --> 00:03:48,000
We choose the “nature of the element” property (P31).

29
00:03:48,000 --> 00:03:56,000
For the object, we write wd: then we press Ctrl+Space. After this, we write "protein" to find the corresponding element.

30
00:03:56,000 --> 00:04:03,000
Then, we choose the element and end the condition with a point. We finally start the query by clicking on the blue button.

31
00:04:03,000 --> 00:04:21,000
The query can also be started by pressing Ctrl+Enter. Let's go.

32
00:04:21,000 --> 00:04:33,000
That's it. We found 986793 proteins in eight seconds.

33
00:04:33,000 --> 00:04:39,000
You can say that this number is very large. You can limit the number of results

34
00:04:39,000 --> 00:04:48,000
by writing LIMIT then the sample size. For this example, we will restrict the number to 500.

35
00:04:48,000 --> 00:04:52,000
We obtain 500 results.

36
00:04:52,000 --> 00:05:01,000
But, as you want, the results are displayed using their language-independent Wikidata ID.

37
00:05:01,000 --> 00:05:06,000
So how can we display element names in a given language?

38
00:05:06,000 --> 00:05:15,000
We write SERVICE then Ctrl+Space and we will find a line which begins with SERVICE wikibase:label.

39
00:05:15,000 --> 00:05:26,000
We eliminate everything that is not part of the chosen line.

40
00:05:26,000 --> 00:05:32,000
This line allows you to give you the name of the elements in a given language.

41
00:05:32,000 --> 00:05:38,000
Let's say we're going to choose the Russian language, we're going to put "ru" here. That's all.

42
00:05:38,000 --> 00:05:45,000
We click Enter and see what will happen. We find that nothing happens

43
00:05:45,000 --> 00:06:02,000
because we did not add a variable for the name. So, we need to add proteinLabel (variable name + Label) as a variable.

44
00:06:02,000 --> 00:06:18,000
That's all! We get the name of each variable in Russian.

45
00:06:18,000 --> 00:06:30,000
Now, as you saw on Wikidata, there are several declaration types defined per language.

46
00:06:30,000 --> 00:06:43,000
We have the label, description and aliases. How can we extract them?

47
00:06:43,000 --> 00:06:48,000
From now on, we do not use SERVICE wikibase:label. We thus declare a condition which begins with ?protein

48
00:06:48,000 --> 00:06:59,000
Then, we write rdfs:label. Then, we define a variable for the name that calls ?data.

49
00:06:59,000 --> 00:07:12,000
So, we write rdfs:label if we need the label, schema:description to get the description

50
00:07:12,000 --> 00:07:27,000
and skos:altLabel for aliases. We will choose rdfs:label for this example.

51
00:07:27,000 --> 00:07:38,000
We click on Enter. We find that the labels are not filtered by language.

52
00:07:38,000 --> 00:07:42,000
I want only labels in a given language (Russian) to be displayed.

53
00:07:42,000 --> 00:07:46,000
I will now start writing FILTER.

54
00:07:46,000 --> 00:08:04,000
Then, I will write (LANG(?data) = "ru"). That's all.

55
00:08:04,000 --> 00:08:18,000
Thus, we obtain only the labels in Russian.

56
00:08:18,000 --> 00:08:24,000
You don't just want to have the labels. You also want to have the aliases.

57
00:08:24,000 --> 00:08:35,000
You must specify that the ?data variable obeys one of two conditions.

58
00:08:35,000 --> 00:08:58,000
In basic situations, each condition is written in a line and it is linked to the other conditions by a logical AND.

59
00:08:58,000 --> 00:09:05,000
However, in our situation, we need two conditions linked by a logical OR.

60
00:09:05,000 --> 00:09:35,000
In such a situation, we use UNION. Thus, we will find the labels (rdfs:label) and the aliases (skos:altLabel) at the same time.

61
00:09:35,000 --> 00:10:04,000
We will see what this will give. We find that we have two names for the same element here and there.

62
00:10:04,000 --> 00:10:25,000
Now you can say that you need to view all variables instead of naming them after SELECT.

63
00:10:25,000 --> 00:10:46,000
We could simply put an asterisk (*) after SELECT. That's all!

64
00:10:46,000 --> 00:11:02,000
I don't know if the query will take a long time to give its results.

65
00:11:02,000 --> 00:11:16,000
It's good! Here are the results with identifiers as well as protein labels and aliases on Wikidata!

66
00:11:16,000 --> 00:11:26,000
Brief. I now want to determine the number of result rows.

67
00:11:26,000 --> 00:11:31,000
Normally this is 928 lines. But, it would be interesting to see how to do this using SPARQL.

68
00:11:31,000 --> 00:11:34,000
It's simple. We use an aggregate function called COUNT.

69
00:11:34,000 --> 00:11:44,000
Aggregate functions generally allow you to generate statistics based on the results of a SPARQL query.

70
00:11:44,000 --> 00:11:54,000
So, we write COUNT(*) and run the query. We get nothing because we did not define the calculated expression as a variable.

71
00:11:54,000 --> 00:12:06,000
What we're going to do is we're going to put the defined expression in parentheses and define it as a variable using AS and then the variable name (?count).

72
00:12:06,000 --> 00:12:36,000
We launch. 928. This is obviously the expected result.

73
00:12:36,000 --> 00:12:52,000
Now, we want to calculate the number of names (labels and aliases) for each protein.

74
00:12:52,000 --> 00:13:09,000
So, we write ?protein as a variable and put the other variables as COUNT arguments.

75
00:13:09,000 --> 00:13:21,000
We do this since we want to calculate the number of names based on the proteins.

76
00:13:21,000 --> 00:13:28,000
Now, if we run, we get that the query is not well formulated.

77
00:13:28,000 --> 00:13:41,000
What we should do is see the original variables kept after SELECT outside the aggregate function. In this context, we only have ?protein.

78
00:13:41,000 --> 00:14:24,000
We now write GROUP BY ?protein and the query would be functional.

79
00:14:24,000 --> 00:14:30,000
This takes time. It's good. This lasted 34 seconds.

80
00:14:30,000 --> 00:14:44,000
Generally, if the query requires more time to generate results (60 seconds), a timeout will occur and the system will shut down without success.

81
00:14:44,000 --> 00:14:58,000
Now we see the number of names (labels and aliases) in Russian for each element. But, these numbers are not ranked.

82
00:14:58,000 --> 00:15:10,000
We could order these numbers using ORDER BY at the end of the query.

83
00:15:10,000 --> 00:16:58,000
After ORDER BY, we can put ?count if the order is ascending or DESC(?count) if the order is descending. Be careful to write ?count and not ?data.

84
00:16:58,000 --> 00:17:51,000
Now, this data can be viewed in several formats: Histogram, Scatterplot, etc. Just click on the eye and choose the appropriate format.

85
00:17:51,000 --> 00:18:02,000
However, you must activate the wording of the elements for this option to be possible.

86
00:18:02,000 --> 00:19:00,000
I'm going to add the SERVICE wikibase:label line as well as a ?proteinLabel variable that I'm going to add after both SELECT and GROUP BY.

87
00:19:00,000 --> 00:19:12,000
Hopefully the query lasts less than 60 seconds so that we can find results and avoid downtime.

88
00:19:12,000 --> 00:20:02,000
You can click here to see when the system updated its data. This allows you to know if the results reflect the current state of Wikidata.

89
00:20:02,000 --> 00:20:18,000
Well, we can clearly see that what we feared happened. We had a downtime. No problem. That's the principle. 90

90
00:20:18,000 --> 00:20:46,000
I'm going to revert the query to how it was before this task.

91
00:20:46,000 --> 00:20:52,000
Sometimes there will be a need to integrate automatically updated results from the Wikidata query system into an application or system.

92
00:20:52,000 --> 00:21:18,000
This could be done by clicking on Code. This will generate code allowing the results to be included in an HTML page or a Java, Python, JavaScript or R program.

93
00:21:18,000 --> 00:21:31,000
You can also download the results in JSON, TSV or CSV format by clicking Download.

94
00:21:31,000 --> 00:21:43,000
You can also click Link to generate a link to the results without the query field.

95
00:21:43,000 --> 00:22:02,000
If you want to generate a link to the query, you can click on the link icon on the left. This will generate a URL to the request.

96
00:22:02,000 --> 00:22:21,000
For those who are not familiar with Wikidata and SPARQL, they can draw inspiration from certain examples by clicking on the folder icon.

97
00:22:21,000 --> 00:23:08,000
You can search for the example you want to see and adapt it by adding conditions and labels or changing the language or an element if necessary.

98
00:23:08,000 --> 00:23:36,000
Regarding aggregate functions, we don't just have COUNT which we used earlier. We also have MAX (maximum), MIN (minimum) and other functions.

99
00:23:36,000 --> 00:23:59,000
We can thus calculate the minimum number of inhabitants per country for a given continent. The principle is still based on the use of GROUP BY and the definition of variables.