English subtitles for clip: File:WikiConference RU - Wikidata Query Service Tutorial in Tunisian - Part 2.webm
Jump to navigation
Jump to search
1 00:00:00,000 --> 00:00:10,000 Hello everyone. In the first video, we illustrated the principles of the Wikidata query service. 2 00:00:10,000 --> 00:00:18,000 Now, we will see more advanced techniques related to the formulation of SPARQL queries 3 00:00:18,000 --> 00:00:30,000 I told you earlier that you can view results in several formats. An example is the geographic map. 4 00:00:30,000 --> 00:01:06,000 Let's start by writing a query. Let's extract the geographic coordinates of a city. 5 00:01:06,000 --> 00:01:44,000 Let us specify that the cities we want to visualize are found in Russia. 6 00:01:44,000 --> 00:02:02,000 We launch the request. We found several countries. 7 00:02:02,000 --> 00:02:14,000 You can click on the eye to choose the geographic map as a viewing mode. 8 00:02:14,000 --> 00:02:32,000 If we press “Geographic map” in the drop-down list, we find the map. We wait a bit. 9 00:02:32,000 --> 00:03:00,000 This could not be displayed. Probably, this is due to the lack of Wikidata element labels. 10 00:03:00,000 --> 00:03:36,000 We will now add ?ville rdfs:label ?villeLabel (for the labels) and FILTER(LANG(?villeLabel)="ru") (so as not to keep only the labels in Russian). We only keep ?cityLabel and ?coord after SELECT. 11 00:03:36,000 --> 00:03:56,000 We found almost the same result with the Russian wordings. 12 00:03:56,000 --> 00:04:16,000 We now see whether the geographic map will be displayed or not. Apparently not. 13 00:04:16,000 --> 00:04:36,000 The problem is the number of sites. What we are going to do is we are going to limit the number of sites to 500. 14 00:04:36,000 --> 00:04:46,000 We check if the geographic map can be generated. That works. 15 00:04:46,000 --> 00:04:56,000 We see that some sites are not in Russia. This is a Wikidata deficiency that needs to be adjusted. 16 00:04:56,000 --> 00:05:06,000 This proves that SPARQL is not only useful for data extraction but also for validation of Wikidata structured data. 17 00:05:06,000 --> 00:05:32,000 This way, you can play with the visualizations to choose the most suitable format for the data. 18 00:05:32,000 --> 00:05:50,000 Let's move on. Suppose you want to extract cities that have geographic coordinates without having to know those coordinates. 19 00:05:50,000 --> 00:06:13,000 Just put [] as an object instead of ?coord. 20 00:06:13,000 --> 00:06:19,000 It's functional. 21 00:06:19,000 --> 00:06:40,000 You want to do the opposite and find cities without geographic coordinates. Just write FILTER NOT EXISTS and put the condition in curly brackets. 22 00:06:40,000 --> 00:07:01,000 Let's return to our previous query. I want to set the default visualization type for my results. 23 00:07:01,000 --> 00:07:12,000 Just write #defaultView: followed by the visualization type (In our situation, Map) just before the query. 24 00:07:12,000 --> 00:07:21,000 Now, if I run the query, the visualization that will be displayed by default is the geographic map of our results. 25 00:07:21,000 --> 00:07:49,000 Let's move on to another thing. You see that we put each condition in a separate line. 26 00:07:49,000 --> 00:08:04,000 What Wikidata's query system is going to do is it's going to sequentially check conditions. 27 00:08:04,000 --> 00:08:16,000 You definitely want to save time. You can move a closed condition (having a particular element as its object) into a separate subquery. 28 00:08:16,000 --> 00:08:35,000 What will happen is that the query service will extract the results according to the subquery and then it will check if the results of the subquery satisfy the conditions of the query. 29 00:08:35,000 --> 00:09:08,000 In some cases, you may find several closed conditions in the same query. Just put the most selective condition in the subquery. 30 00:09:08,000 --> 00:09:34,000 How can we do it? It's simple. We put WITH {} before WHERE. Between the braces, we put a SPARQL query which includes the closed condition. 31 00:09:34,000 --> 00:10:05,000 For our situation, we can move LIMIT 500 to the subquery since we want to limit the number of cities. 32 00:10:05,000 --> 00:10:17,000 After closing the WITH brace, we write AS then % and the name of the subquery (%villelist). 33 00:10:17,000 --> 00:10:38,000 Now we need to call the results of the subquery in the main query. We simply put INCLUDE and the name of the subquery as the first condition of the main query. 34 00:10:38,000 --> 00:10:46,000 You see that the query time is greatly reduced with almost the same result. That's all. 35 00:10:46,000 --> 00:11:06,000 You can certainly define other subqueries of your choice using other WITH clauses. But, it's your choice. 36 00:11:06,000 --> 00:11:28,000 Now I'm going to show you a whole other thing. I'm going to define a new query. 37 00:11:28,000 --> 00:12:04,000 I will define that ?city is a city (wdt:P31 wd:Q515) is that it exists in Russia (wdt:P17 wd:Q159). We launch. 38 00:12:04,000 --> 00:12:40,000 I'll let you see the duration of the query. Probably, the condition of the city is inadequate. We eliminate it and rerun the query. We find almost 413,000 results. 39 00:12:40,000 --> 00:12:48,000 What we're going to do is we're going to click on one of the results to see how the cities are defined. This is one of the methods to refine a query. 40 00:12:48,000 --> 00:13:10,000 We can clearly see that it is rather another nature of the element: town-city (Q7930989) instead of city (Q515). I copy the class ID. 41 00:13:10,000 --> 00:13:40,000 Good. I can change the interface to Russian so that you follow. So, this is indeed another class to consider: city-city (Q7930989). That's why there are no results. 42 00:13:40,000 --> 00:14:16,000 What we're going to do is we're going to add that ?ville is a city-city (wdt:P31 wd:Q7930989). I'm going to relaunch and I want you to see the query time. We see that it is 0.4 seconds. 43 00:14:16,000 --> 00:14:24,000 So what we're going to do is we're going to change the order of the conditions. We compare the query time. 44 00:14:24,000 --> 00:15:06,000 We see that it lasted 0.8 seconds (double). This proves that the order of declarations (especially the order of predicates) affects query time. 45 00:15:06,000 --> 00:15:33,000 This is due to the degree of use of the considered property on Wikidata (wdt:P31 is among the main properties). The second likely explanation is the number of results corresponding to each condition. 46 00:15:33,000 --> 00:16:00,000 So, you have to experiment and try to find the correct order of declarations during a SPARQL query and save time. 47 00:16:00,000 --> 00:16:36,000 We covered almost everything. Let's try something else. 48 00:16:36,000 --> 00:17:15,000 Sometimes we need to use federated queries where we query Wikidata and one other resource at a time. It would be interesting if the other resource is Wikipedia's MediaWiki API or an external resource like a given ontology. 49 00:17:15,000 --> 00:18:06,000 In this context, I created a tool called Structured Categories which allows you to generate SPARQL queries which find the most common declarations of Wikidata elements included in a given Wikipedia category. 50 00:18:06,000 --> 00:18:56,000 I can show you how this works by browsing a French Wikipedia category. There is a link that appears and takes me to the structured description of the category. 51 00:18:56,000 --> 00:19:26,000 The results of a SPARQL query are displayed. To see the query, there is a link at the bottom left that you can click on. 52 00:19:26,000 --> 00:19:38,000 As you see, we used SERVICE mediawiki:api to extract data from Wikipedia. This mediawiki:api can be substituted by the URI of an external resource to access an ontology or database. 53 00:19:38,000 --> 00:19:54,000 Then, we defined the conditions to extract the data from Wikipedia's MediaWiki API as if querying it with SPARQL. 54 00:19:54,000 --> 00:19:58,000 Then, we selected the Wikipedia pages in the category with matching Wikidata elements and limited the query to 200 results. 55 00:19:58,000 --> 00:20:07,000 Finally, we determined the repeating Wikidata statements for the 200 selected Wikidata items. 56 00:20:07,000 --> 00:20:18,000 For example, we found that for the Tunisian Scientist category, there are subcategories or people born in Tunisia, etc. 57 00:20:18,000 --> 00:20:55,000 You can change the language to Russian instead of French. This would be very useful for deciphering the Wikipedia category graph for novices of a given language. 58 00:20:55,000 --> 00:21:20,000 I personally believe that federated queries represent the future of Wikidata's query service and can help generate new and unexpected applications of our database: Interaction with other resources, cross-validation and cross-enrichment. 59 00:21:20,000 --> 00:21:39,000 That's all I have to present for this tutorial. Just a word about our team. We represent a research unit which works mainly on semantic technologies in Tunisia. 60 00:21:39,000 --> 00:22:02,000 We work at the University of Sfax. Our unit is entitled Data Engineering and Semantics and it was created in 2021. We work on several areas including the semantic web, open science and Wikimedia projects. 61 00:22:02,000 --> 00:22:16,000 We are among the first to work on research related to Wikimedia projects and we are very open to any collaboration with the community. 62 00:22:16,000 --> 00:22:55,000 Anyone who wishes to work with us is encouraged to contact us through the director of our unit Dr. Mohamed Ben Aouicha through his email (mohamed.benaouicha@fss.usf.tn). Thank you and good day.