English subtitles for clip: File:WikiConference RU - Wikidata Query Service Tutorial in Tunisian - Part 2.webm

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
1
00:00:00,000 --> 00:00:10,000
Hello everyone. In the first video, we illustrated the principles of the Wikidata query service.

2
00:00:10,000 --> 00:00:18,000
Now, we will see more advanced techniques related to the formulation of SPARQL queries

3
00:00:18,000 --> 00:00:30,000
I told you earlier that you can view results in several formats. An example is the geographic map.

4
00:00:30,000 --> 00:01:06,000
Let's start by writing a query. Let's extract the geographic coordinates of a city.

5
00:01:06,000 --> 00:01:44,000
Let us specify that the cities we want to visualize are found in Russia.

6
00:01:44,000 --> 00:02:02,000
We launch the request. We found several countries.

7
00:02:02,000 --> 00:02:14,000
You can click on the eye to choose the geographic map as a viewing mode.

8
00:02:14,000 --> 00:02:32,000
If we press “Geographic map” in the drop-down list, we find the map. We wait a bit.

9
00:02:32,000 --> 00:03:00,000
This could not be displayed. Probably, this is due to the lack of Wikidata element labels.

10
00:03:00,000 --> 00:03:36,000
We will now add ?ville rdfs:label ?villeLabel (for the labels) and FILTER(LANG(?villeLabel)="ru") (so as not to keep only the labels in Russian). We only keep ?cityLabel and ?coord after SELECT.

11
00:03:36,000 --> 00:03:56,000
We found almost the same result with the Russian wordings.

12
00:03:56,000 --> 00:04:16,000
We now see whether the geographic map will be displayed or not. Apparently not.

13
00:04:16,000 --> 00:04:36,000
The problem is the number of sites. What we are going to do is we are going to limit the number of sites to 500.

14
00:04:36,000 --> 00:04:46,000
We check if the geographic map can be generated. That works.

15
00:04:46,000 --> 00:04:56,000
We see that some sites are not in Russia. This is a Wikidata deficiency that needs to be adjusted.

16
00:04:56,000 --> 00:05:06,000
This proves that SPARQL is not only useful for data extraction but also for validation of Wikidata structured data.

17
00:05:06,000 --> 00:05:32,000
This way, you can play with the visualizations to choose the most suitable format for the data.

18
00:05:32,000 --> 00:05:50,000
Let's move on. Suppose you want to extract cities that have geographic coordinates without having to know those coordinates.

19
00:05:50,000 --> 00:06:13,000
Just put [] as an object instead of ?coord.

20
00:06:13,000 --> 00:06:19,000
It's functional.

21
00:06:19,000 --> 00:06:40,000
You want to do the opposite and find cities without geographic coordinates. Just write FILTER NOT EXISTS and put the condition in curly brackets.

22
00:06:40,000 --> 00:07:01,000
Let's return to our previous query. I want to set the default visualization type for my results.

23
00:07:01,000 --> 00:07:12,000
Just write #defaultView: followed by the visualization type (In our situation, Map) just before the query.

24
00:07:12,000 --> 00:07:21,000
Now, if I run the query, the visualization that will be displayed by default is the geographic map of our results.

25
00:07:21,000 --> 00:07:49,000
Let's move on to another thing. You see that we put each condition in a separate line.

26
00:07:49,000 --> 00:08:04,000
What Wikidata's query system is going to do is it's going to sequentially check conditions.

27
00:08:04,000 --> 00:08:16,000
You definitely want to save time. You can move a closed condition (having a particular element as its object) into a separate subquery.

28
00:08:16,000 --> 00:08:35,000
What will happen is that the query service will extract the results according to the subquery and then it will check if the results of the subquery satisfy the conditions of the query.

29
00:08:35,000 --> 00:09:08,000
In some cases, you may find several closed conditions in the same query. Just put the most selective condition in the subquery.

30
00:09:08,000 --> 00:09:34,000
How can we do it? It's simple. We put WITH {} before WHERE. Between the braces, we put a SPARQL query which includes the closed condition.

31
00:09:34,000 --> 00:10:05,000
For our situation, we can move LIMIT 500 to the subquery since we want to limit the number of cities.

32
00:10:05,000 --> 00:10:17,000
After closing the WITH brace, we write AS then % and the name of the subquery (%villelist).

33
00:10:17,000 --> 00:10:38,000
Now we need to call the results of the subquery in the main query. We simply put INCLUDE and the name of the subquery as the first condition of the main query.

34
00:10:38,000 --> 00:10:46,000
You see that the query time is greatly reduced with almost the same result. That's all.

35
00:10:46,000 --> 00:11:06,000
You can certainly define other subqueries of your choice using other WITH clauses. But, it's your choice.

36
00:11:06,000 --> 00:11:28,000
Now I'm going to show you a whole other thing. I'm going to define a new query.

37
00:11:28,000 --> 00:12:04,000
I will define that ?city is a city (wdt:P31 wd:Q515) is that it exists in Russia (wdt:P17 wd:Q159). We launch.

38
00:12:04,000 --> 00:12:40,000
I'll let you see the duration of the query. Probably, the condition of the city is inadequate. We eliminate it and rerun the query. We find almost 413,000 results.

39
00:12:40,000 --> 00:12:48,000
What we're going to do is we're going to click on one of the results to see how the cities are defined. This is one of the methods to refine a query.

40
00:12:48,000 --> 00:13:10,000
We can clearly see that it is rather another nature of the element: town-city (Q7930989) instead of city (Q515). I copy the class ID.

41
00:13:10,000 --> 00:13:40,000
Good. I can change the interface to Russian so that you follow. So, this is indeed another class to consider: city-city (Q7930989). That's why there are no results.

42
00:13:40,000 --> 00:14:16,000
What we're going to do is we're going to add that ?ville is a city-city (wdt:P31 wd:Q7930989). I'm going to relaunch and I want you to see the query time. We see that it is 0.4 seconds.

43
00:14:16,000 --> 00:14:24,000
So what we're going to do is we're going to change the order of the conditions. We compare the query time.

44
00:14:24,000 --> 00:15:06,000
We see that it lasted 0.8 seconds (double). This proves that the order of declarations (especially the order of predicates) affects query time.

45
00:15:06,000 --> 00:15:33,000
This is due to the degree of use of the considered property on Wikidata (wdt:P31 is among the main properties). The second likely explanation is the number of results corresponding to each condition.

46
00:15:33,000 --> 00:16:00,000
So, you have to experiment and try to find the correct order of declarations during a SPARQL query and save time.

47
00:16:00,000 --> 00:16:36,000
We covered almost everything. Let's try something else.

48
00:16:36,000 --> 00:17:15,000
Sometimes we need to use federated queries where we query Wikidata and one other resource at a time. It would be interesting if the other resource is Wikipedia's MediaWiki API or an external resource like a given ontology.

49
00:17:15,000 --> 00:18:06,000
In this context, I created a tool called Structured Categories which allows you to generate SPARQL queries which find the most common declarations of Wikidata elements included in a given Wikipedia category.

50
00:18:06,000 --> 00:18:56,000
I can show you how this works by browsing a French Wikipedia category. There is a link that appears and takes me to the structured description of the category.

51
00:18:56,000 --> 00:19:26,000
The results of a SPARQL query are displayed. To see the query, there is a link at the bottom left that you can click on.

52
00:19:26,000 --> 00:19:38,000
As you see, we used SERVICE mediawiki:api to extract data from Wikipedia. This mediawiki:api can be substituted by the URI of an external resource to access an ontology or database.

53
00:19:38,000 --> 00:19:54,000
Then, we defined the conditions to extract the data from Wikipedia's MediaWiki API as if querying it with SPARQL.

54
00:19:54,000 --> 00:19:58,000
Then, we selected the Wikipedia pages in the category with matching Wikidata elements and limited the query to 200 results.

55
00:19:58,000 --> 00:20:07,000
Finally, we determined the repeating Wikidata statements for the 200 selected Wikidata items.

56
00:20:07,000 --> 00:20:18,000
For example, we found that for the Tunisian Scientist category, there are subcategories or people born in Tunisia, etc.

57
00:20:18,000 --> 00:20:55,000
You can change the language to Russian instead of French. This would be very useful for deciphering the Wikipedia category graph for novices of a given language.

58
00:20:55,000 --> 00:21:20,000
I personally believe that federated queries represent the future of Wikidata's query service and can help generate new and unexpected applications of our database: Interaction with other resources, cross-validation and cross-enrichment.

59
00:21:20,000 --> 00:21:39,000
That's all I have to present for this tutorial. Just a word about our team. We represent a research unit which works mainly on semantic technologies in Tunisia.

60
00:21:39,000 --> 00:22:02,000
We work at the University of Sfax. Our unit is entitled Data Engineering and Semantics and it was created in 2021. We work on several areas including the semantic web, open science and Wikimedia projects.

61
00:22:02,000 --> 00:22:16,000
We are among the first to work on research related to Wikimedia projects and we are very open to any collaboration with the community.

62
00:22:16,000 --> 00:22:55,000
Anyone who wishes to work with us is encouraged to contact us through the director of our unit Dr. Mohamed Ben Aouicha through his email (mohamed.benaouicha@fss.usf.tn). Thank you and good day.