Commons talk:Structured data/Modeling/Source

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

These earlier notes and resources may be inspiring:

Multichill (talk) 18:36, 12 September 2019 (UTC)[reply]

Source from Wikimania 2019

[edit]

Copy here for future reference Get the data! If we look at 100,000 random images, what is in the source field ?

Immediate source of image

[edit]
  • Own work --> only for photos of people & places ?
    • distinguish photos/scans of their own artwork
    • created digital original drawings/artwork
    • created diagrams -- software used ?
Easy enough to create a Q-item for "original creation by uploader" (now created as original creation by uploader (Q66458942)) as value for a master "Source of file": property ---- NB: Temp property P828 used for this role. Will need to be replaced.
A bot might indentify cases that look dubious, and mark with a qualifier
BUT -- if we have this model with a top-level statement we can't have any second level of qualifers to clarify the nature of statements being made in first level qualifiers -- eg one might want "applies to part", or "sourcing circumstances", or to distinguish immediate vs ultimate source URL
  • From the internet
    Q-item for "file available on the internet", with qualifiers specifying detailed provenance
    Q-item for "user modification of file available on the internet"
    • Which property will point to this Q item? A new property: Source, taking a value indicating the nature of the source, with qualifiers adding further info. A "source" statement of this kind should become mandatory, with a limited closed vocabulary of possible values. Make upload wizard enforce the making of a choice.

t

    • Commons best practice: URL for image + URL for description by source --> two qualifiers for this ?
      ADDED: The "description by source" URL might be well handled by described at URL (P973) "described at url" as a separate main statement.
      ISSUE: We might *only* have the description page URL -- and it might no longer exist. So we might need to specify that an image used to exist at a particular institution (or website), but not be able to say what the URL used to be.
    • In practice might have:
      • Some url
      • Some url with a description
      • Some url with a source site (url + Flickr, or Europeana, Internet Archive Books)
        • Identifier properties are subclass of source url
          • --> Q. Do we want to start minting new properties for identifiers from such sites, or just use URLs as per others. What are pros and cons ? Is this a workaround for not being able to find URLs that start with .... in SPARQL (because the indexing isn't there?)
    • What sorts of free text do we find in the source fields ?
      -- maybe this is the last 20% we should try to capture, after we've got the easiest 80%, But how to assess/record completeness of extraction from source field ?
  • Sources which are offline, but eg which have been scanned
    • eg images from art books --> full bibliographic
  • See also other version section for derived works
    • Q-item as top-level value to indicate "derived from file or files on Commons"  ?
      • Comment: "Other version" is only relevant if we host the work(s) that the file was derived from. But we may not. eg scan of a page from a book, diagrams based on a diagram in a book (simple enough so no copyright), photograph of a copyright-expired painting, a photo of a dress based on a Mondrian painting "based on" property

Also: some operations -- rotation, colour modification, cropping, etc may have been undertaken by user prior to upload.

-- so distinguish "scan of image" from "user-modified scan of image" in top-level source statement ?

Source of things shown within the image

[edit]

(eg : a photo of a 2D collage of objects)

Esp. important because these things may have different copyright status
-- qualifiers below "depicts" statement ?
-- how to indicate things if there is no obvious Q-item for something in the image, but neverthess one wants to identify it & record information relating to it? Should a "depicts" = "somevalue" statement be created to record information about particular parts of the image ?

Will often be handled by the Q-items for the value(s) of the depicts statements

Other

[edit]
  • copyright checkers may be closely tied to source: should the statements be similarly related -- or is it enough to put verification info as a qualifier or reference on the copyright status. Will SDC even have/display references ?

Metadata has provenance too

[edit]

eg {{BL cat credit}}

-- on Wikidata we would indicate this in references, statement by statement. But will Commons have references?

End of copy Multichill (talk) 18:36, 12 September 2019 (UTC)[reply]

Simple own work source

[edit]

@Jheald: you and some others worked on this during Wikimania, right? I would like to focus on a specific case to see if we can solve that: Own work uploads. Like for example the files uploaded as part of Wiki Loves Monuments. Would it be as simple as "new propery: Source of file" -> original creation by uploader (Q66458942) ("original creation by uploader")? Combined with author and license it would mean we can start converting some data. Multichill (talk) 18:02, 17 September 2019 (UTC)[reply]

Hi @Multichill: thanks for pinging me. Yes, exactly. The strong conclusion I got from that workshop was the usefulness of a top-level property "Nature of source of material", taking as values a very small number of different generic types of origin, that a Commons file could have. All Commons files would eventually be expected to have a statement of this kind. For material with some kinds of origin (eg "taken from the internet"), one would then expect further statements to give details of where from and when, etc. But the simplest case would be own work, for which I created the value original creation by uploader (Q66458942); as used eg on File:Petra_Al-Kaznah_by_Night.jpg, using a has cause (P828) property as a stop-gap until the new property was proposed and created. Ideally, the statement should also have a reference (eg imported from: file description page, with date) -- statements like this need provenance, I think: we should say where they have come from, if we're doing a full-scale roll-out (other values might be eg "decared by author via Upload Wizard", etc).
Unfortunately I see that d:User:MisterSynergy has since deleted Q66458942, but I've asked him to restore it. Jheald (talk) 15:47, 18 September 2019 (UTC)[reply]
@Multichill: One question that might be worth a thought is whether the property should be just "Nature of source of material", or whether it would make sense to also combine in "Nature of material" -- so whether values should just be "original creation by uploader", or whether it would make sense for the value to be eg "original photograph by uploader" / "original drawing by uploader" / "original sculpture made and photographed by uploader" etc. I come and go between which of the two I prefer. On the one hand there is a certain discipline in trying to identify conceptual orthogonality and then represent it with orthogonal properties. On the other hand, the more specific declarations about the nature of the material may bring out more honest statements, and the greater specificity and concreteness may be easier for some people. In practical terms, by making all of the latter classes subclasses of "original creation by uploader", the same "nature of source" information would be easy enough to extract either way, under either approach, whether for templates or querying or whatever. I oscillate as to which of the two approaches would be better to go for. Jheald (talk) 08:51, 19 September 2019 (UTC)[reply]
@Jheald: I let this sink in a bit. If we look at the current situation, we care on Commons about the immediate source: Taken and uploader yourself, transfer from some other wiki, taken from Flickr, from some museum website, etc.
Right now, that's what I would like to model. I would probably like to call the property "source of file" to keep it generic as we do right now. Once we want to model immediate source and underlying source, we can just use some qualifiers. That way in easy situations we just have a clear statement, but we also keep the ability to model more complex situations. Do you agree? I'm probably just going to propose a new property to complete the basic information properties that are currently mandatory ({{No source}}, {{No author}} & {{No license}}). Multichill (talk) 17:46, 3 October 2019 (UTC)[reply]

Property proposal

[edit]

See d:Wikidata:Property proposal/Source of file. Multichill (talk) 16:29, 13 October 2019 (UTC)[reply]

We now have source of file (P7482). Multichill (talk) 09:25, 27 October 2019 (UTC)[reply]

Files from the internet

[edit]

@Jheald: maybe you can describe your proposal on how to model files found on the internet? I proposed:

I think your proposal is to do:

Correct? Multichill (talk) 09:48, 27 October 2019 (UTC)[reply]

Somewhere else was mentioned that described at URL (P973) is probably better than URL (P2699) because it's more specific and the link is usually not a deeplink to the file, but to a page containing the file. It's suggested that maybe Commons compatible image available at URL (P4765) could be added too in some cases to deeplink to the file.
I'm not getting any input so I'm just going to go ahead and implement the second proposal. I just created file available on the internet (Q74228490) for this. Multichill (talk) 15:27, 9 November 2019 (UTC)[reply]
Ok, test edit. I did the same thing on the other Geograph files in Category:Dornoch Firth. What do you think? (@Jheald: ). Multichill (talk) 21:03, 9 November 2019 (UTC)[reply]
Looks good, especially P7384. I start getting used to Commons-style "statement groups".
It's just that somevalue/unknown isn't exactly the best supported feature around Wikidata and even more so here. Jura1 (talk) 00:09, 10 November 2019 (UTC)[reply]

Scanned Files

[edit]

@Multichill, Jura1, Jheald, and Schlurcher: I was thinking about modeling of files (graphics or text) scanned from books. For example, in files like my recent upload File:Chwała olimpijczykom - s.087a- Urszula Stępińska.tif where I scanned a photo from Glory to the Olympians, 1939-1945 (Q97940059) book. I think the best way to model that would be

source of file
Normal rank Glory to the Olympians, 1939-1945
page(s) 87
0 references
add reference


add value
published in
Normal rank Glory to the Olympians, 1939-1945
page(s) 87
0 references
add reference


add value

The published in (P1433)=Glory to the Olympians, 1939-1945 (Q97940059) statement is to indicate that that photograph was published in that book, but might have been published in other (earlier) books which would be listed in additional published in (P1433) statements. I do not know if it is worth to add information about who scanned it, it is usually not relevant except that on Commons you might want to look for other scans by the same person or ask them to rescan at higher resolution, etc. If we want to add optional qualifier like that we might need to propose new property as I do not see anything relevant. Other files like my 2007 upload File:Lokajski - Ślub powstańczej pary (1944).jpg might get statements:

source of file
Normal rank file available on the internet
0 references
add reference


add value
published in
Normal rank Powstanie Warszawskie w ilustracji - wydanie specjalne
page(s) 46
0 references
add reference


add value

since I do know where it was published but can no longer find location of the source website. Does that sound reasonable? --Jarekt (talk) 18:15, 3 August 2020 (UTC)[reply]

I missed the ping. I liked the fact that source of file (P7482) has a limited number of options. I don't think changing this for scans is a good plan. Multichill (talk) 20:25, 9 August 2020 (UTC)[reply]

Additional URL types

[edit]

What would people think about using the source of file property, as documented here, with additional URL types for further source linking? In particular, I would like to include the link to the IIIF manifest and direct file location for uploads. This could look like this:

It seems like if we have this data to add, this would probably be the best place to add it. Thoughts? Pinging Multichill, Jheald, Jarekt. Dominic (talk) 16:10, 13 July 2021 (UTC)[reply]

I am OK with this, as long as there is the basic
source of file
Normal rank file available on the internet
described at URL https://ark.digitalcommonwealth.org/ark:/50959/z603tb264
0 references
add reference


add value
part. One thing I would change would be to replace generic URL (P2699) with more specific Commons compatible image available at URL (P4765). --Jarekt (talk) 01:14, 14 July 2021 (UTC)[reply]
Okay, thanks. I've seen that property, but when I read the scope discussion it seems like it's for a different purpose (as currently envisioned). Its constraints currently limit use to Wikibase items, for example. But if you think it's better, that is fine with me. Dominic (talk) 21:40, 14 July 2021 (UTC)[reply]

How to tag user-created maps on Commons?

[edit]

See Commons:Village_pump/Technical#Structured_data_for_user-created_maps?

DOI

[edit]

Some images (such as File:ETH-BIB-Spiegel b. Bern, Wabern, Bern-Weissenbühl, Liebefeld, Blick nach Südsüdwesten (SSW)-LBS R1-941228.tif) can be identified uniquely by a DOI (10.3932/ethz-a-000283338). Should such images be tagged with DOI (P356) or source of file (P7482)? --1-Byte (talk) 11:18, 27 October 2024 (UTC)[reply]

I would say the DOI is not the source of the file, but an identifier. So I would suggest DOI (P356) --> 10.3932/ethz-a-000283338. The source could also be set, but the same as for any other file availible via an URL. --Schlurcher (talk) 15:14, 27 October 2024 (UTC)[reply]
So similar to how URN-NBN (P4109) is handled for File:Skeppet Skuldas undergång - SMV - SVA BB 5303 26.wav --Schlurcher (talk) 08:18, 29 October 2024 (UTC)[reply]