Commons:GLAMwiki Toolset Project/NARA analytics pilot
The NARA analytics pilot is a pilot project to produce the first image usage analytics of a GLAM upload. This is a quick and dirty approach done at the Zürich Hackathon 2014. Beware, the data and analytics didn't go through any qualilty control so it might be full of errors.
History
[edit]At the moment the Wikimedia Foundation collects page view statistics, but no image view statistics. This means we don't know how many views our images and other media files get, we try to derive it based on page views. A long time ago someone enabled logging of image views for NARA images. The data has been collecting for several years, but nothing has been done with the data yet. Unfortunately the data was rotated (old files deleted) so the dataset we worked on was from 2014-02-05 to 2014-05-09. The logs are sampled 1:10 so as a rule of thumb you night to multiple the numbers with 10. We made an overview of requirements for usage and re-usage statistics for GLAM content so we know what questions we want to answer.
Steps
[edit]- udplog collects NARA data. The Raw data format is at wikitech:Analytics/Data_access#Request_logs
- This data is imported to Hadoop on stat1002.eqiad.wmnet to the table webrequest_glam_nara on database Otto (thanks Andrew)
- We do a query to clean up the data and to get the same image names (multiple url's for the same image, I hope we solved the fileviews problem)
CREATE TABLE webrequest_glam_nara_cleaned(
dt string COMMENT 'from deserializer',
country_code string COMMENT 'from deserializer',
uri string COMMENT 'from deserializer',
content_type string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer')
--PARTITIONED BY (
-- year int,
-- month int,
-- day int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
INSERT INTO TABLE
webrequest_glam_nara_cleaned
SELECT
dt, country_code, regexp_replace(uri, 'https?://upload.wikimedia.org/wikipedia/commons(/thumb)?/(\\w/\\w\\w)/([^\/]+)(.*)', '//upload.wikimedia.org/wikipedia/commons/$2/$3'), content_type, referer
FROM
webrequest_glam_nara
WHERE
year=2014
AND
(uri LIKE 'http://upload.wikimedia.org/%' OR uri like 'https://upload.wikimedia.org/%');
First results
[edit]Total number of views
[edit]SELECT COUNT(*) FROM webrequest_glam_nara_cleaned;
Over the time period we had a total of 21,547,979 hits. So, taking into account the 1:10 sampling, this means roughly 200 million hits in 3 months. We could split this out by day and make a histogram
Views per day
[edit]"Give me a daily overviewcount of the views my media files had (FileViews)"
See dataset and visualisation.
There is a spike on March 24. Further analysis shows that the biggest referral on that day is Dorothy Height. Turns out this lady was featured on a Google Doodle on that day.
Views per month
[edit]"Give me a monthly overview of the views my media files had (FileViews)"
Period | Views |
---|---|
2014-02 (from 05) | 5,394,192 |
2014-03 | 7,222,593 |
2014-04 | 6,848,673 |
2014-05 (end 09) | 2,082,521 |
The most popular images
[edit]SELECT
count(*) as freq, uri
FROM
webrequest_glam_nara_cleaned
GROUP BY
uri
ORDER BY
freq desc
LIMIT 20;
-
611595
-
354823
-
347238
-
345355
-
342920
-
295592
-
286936
-
233973
-
233663
-
230181
-
226200
-
189118
-
183370
-
181799
-
179708
-
168704
-
163783
-
159375
-
154633
-
145038
Most popular images by month
[edit]- February 2014
-
150216
-
100254
-
95545
-
90382
-
89720
-
89329
-
72311
-
64554
-
59062
-
55868
-
55429
-
54199
-
48359
-
43553
-
42421
-
41714
-
39363
-
38651
-
38400
-
35389
- March 2014
-
199393
-
119917
-
119841
-
118791
-
118756
-
92688
-
92311
-
88618
-
78705
-
77044
-
70191
-
63332
-
60841
-
60171
-
58744
-
55575
-
54875
-
52928
-
52724
-
50844
- April 2014
-
198674
-
132821
-
106091
-
103559
-
102527
-
101886
-
97024
-
84510
-
77049
-
70581
-
70290
-
63691
-
63044
-
59402
-
52567
-
51400
-
49990
-
48758
-
48550
-
46740
Imageviews by country
[edit]SELECT
country_code as country, count(*) as freq
FROM
webrequest_glam_nara_cleaned
GROUP BY
country_code
ORDER BY
freq desc
LIMIT 2000;
We could make a heatmap of this one.
country | freq |
---|---|
US | 10806780 |
GB | 1338132 |
XX | 1050605 |
DE | 879286 |
CA | 822878 |
JP | 566449 |
FR | 479325 |
AU | 459854 |
IN | 370386 |
NL | 293125 |
RU | 235243 |
MX | 226716 |
IT | 217636 |
CN | 199184 |
ES | 194763 |
BR | 144575 |
PH | 134343 |
SE | 126166 |
PL | 113083 |
EU | 109736 |
CH | 104411 |
AT | 103840 |
IE | 93439 |
CO | 81917 |
NZ | 81675 |
SG | 79750 |
NO | 78881 |
TR | 75130 |
Imageviews per country per month
[edit]Give me a monthly overview of the provenance of the users that see my media files
- February 2014
country | freq |
---|---|
US | 2710846 |
GB | 343203 |
XX | 248388 |
DE | 224909 |
CA | 203417 |
JP | 175721 |
FR | 126888 |
AU | 103175 |
IN | 96543 |
MX | 65935 |
NL | 65767 |
IT | 60654 |
ES | 55110 |
RU | 50584 |
CN | 40986 |
PH | 39484 |
SE | 30751 |
BR | 30243 |
EU | 28404 |
CH | 25599 |
- March 2014
country | freq |
---|---|
US | 3540291 |
GB | 465823 |
XX | 351699 |
DE | 303557 |
CA | 273932 |
JP | 189155 |
FR | 163550 |
AU | 152841 |
IN | 123431 |
NL | 105033 |
RU | 91192 |
MX | 84487 |
CN | 73168 |
IT | 68658 |
ES | 65726 |
PH | 51155 |
BR | 46257 |
SE | 43836 |
PL | 41642 |
EU | 37951 |
- April 2014
country | freq |
---|---|
US | 3476374 |
GB | 404892 |
XX | 344029 |
CA | 269814 |
DE | 267976 |
AU | 158220 |
JP | 154575 |
FR | 145757 |
IN | 117831 |
NL | 96214 |
RU | 74868 |
IT | 67890 |
CN | 64992 |
MX | 58800 |
ES | 57575 |
BR | 47004 |
SE | 39736 |
PL | 36828 |
PH | 33647 |
EU | 33524 |
Top referers
[edit]SELECT
referer, count(*) as freq
FROM
webrequest_glam_nara_cleaned
GROUP BY
referer
ORDER BY
freq
DESC
LIMIT 20;
Top external referers
[edit]SELECT
referer, count(*) as freq
FROM
webrequest_glam_nara_cleaned
WHERE
referer not like '%wiki%'
GROUP BY
referer
ORDER BY
freq
DESC
LIMIT 20;
referer | freq |
---|---|
https://www.google.com/ | 136984 |
http://www.google.com/ | 47612 |
http://www.google.com/blank.html | 33039 |
http://hainn12.blogspot.com/ | 12769 |
https://www.google.co.uk/ | 11585 |
http://www.reddit.com/ | 7940 |
https://www.google.ca/ | 7713 |
https://www.google.co.in/ | 4959 |
http://www.google.co.uk/ | 4165 |
https://www.google.fr/ | 3816 |
https://www.google.com.au/ | 3708 |
https://www.google.com.mx/ | 3228 |
https://www.google.es/ | 3182 |
http://hainn8x12.blogspot.com/ | 3128 |
https://www.google.it/ | 2687 |
https://www.google.co.jp/ | 2646 |
https://www.google.de/ | 2565 |
http://www.google.de/ | 2508 |
http://www.google.co.uk/blank.html | 2460 |
(removed the none one)
Top external referers (without Google)
[edit]SELECT
referer, count(*) as freq
FROM
webrequest_glam_nara_cleaned
WHERE
referer not like '%wiki%' and referer not like '%google%'
GROUP BY
referer
ORDER BY
freq
DESC
LIMIT 20;
Source code
[edit]See https://github.com/Commonists/limn-glam for a visualization code base which was based on now deprecated Limn.