Commons:Requests for comment/Technical needs survey/TimedText
Jump to navigation
Jump to search
TimedText
[edit]Description of the problems
[edit]- Problem description:
- need an easy/user-friendly way to categorise timedtext. beneficial for categorising based on languages, quality of transcript, etc.
- need an easy way to check all timedtext pages associated with a file. something similar to https://commons.wikimedia.org/w/index.php?oldid=828200732#L-166 .
- need a more intuitive way of going to the associated file on a timedtext page. currently it's by ctrl+click the file (the mediaplayer box), or open up the popup and click the circle i. i needed this so much that i wrote a script before i learnt the ctrl+click trick https://commons.wikimedia.org/w/index.php?oldid=828200732#L-159.
- a way to assess the quality of timedtext (similar to wikisource?). incomplete, transcribed, non-synchronised, proofread, verified...?--RoyZuo (talk) 23:59, 31 December 2023 (UTC)
- a tool/interface that helps transcription, something like https://www.nikse.dk/subtitleedit/online .--RoyZuo (talk) 07:07, 2 January 2024 (UTC)
- Proposal type: feature request
- Proposed solution:
- Phabricator ticket:
- Further remarks:
Discussion
[edit]- Oppose You did not explain why this would be useful and why there are these needs. Also 4 can already be done via file categories. Opposing for now since this so far doesn't seem to be anywhere near the most important issues and can to a large degree already be done; very many other issues would be more important and haven't been listed here. --Prototyperspective (talk) 11:17, 1 January 2024 (UTC)
- can you point to me an english timedtext that's incomplete, and an english timedtext that's been proofread, based on your claim that "4 can already be done via file categories"? RoyZuo (talk) 11:46, 1 January 2024 (UTC)
- I said it can already be done, not that it is already being done and I would encourage such to be done, especially if machine translation / auto-caption tools are leveraged for WMC multilingualism (which could be very impactful). However, I can also point you to an example: Category:Videos by Terra X with English subtitle file unchecked – these need proofreading (see the cats above for more). I think people usually just upload timedtexts that are already complete but a new category for incomplete ones would be useful.
- 1. also is already being done with cats like "…with subtitles in English". Prototyperspective (talk) 16:13, 1 January 2024 (UTC)
- can you point to me an english timedtext that's incomplete, and an english timedtext that's been proofread, based on your claim that "4 can already be done via file categories"? RoyZuo (talk) 11:46, 1 January 2024 (UTC)
- as i've tested at TimedText:Sandbox.webm.en.srt, timedtext pages can be categorised in the same way as other pages, but hotcat doesnt work on tt pages, so it's cumbersome. which is why i said we "need an easy/user-friendly way to categorise timedtext". the most basic solution is to make hotcat work on tt pages.
- but traditional categorisation method is inferior to the assessment structure in wikisource, which i think is a lot easier to use (just clicking the coloured dots) and provides a standard classification.
- then this reminded me of the need to have a transcription tool, because transcribing audio/video is different from a text. transcribing audio/video requires pausing the playback and setting timestamps.--RoyZuo (talk) 07:07, 2 January 2024 (UTC)
- Regarding 3. A patch for this is already coming —TheDJ (talk • contribs) 10:19, 2 January 2024 (UTC)
- @TheDJ: might be a problem though: the tab is only available on timedtext but not timedtext talk pages.--RoyZuo (talk) 18:00, 21 November 2024 (UTC)
- Regarding 4. You can always just use talk pages. Just like Wikipedia uses talk pages for wiki project assessments. —TheDJ (talk • contribs) 10:17, 2 January 2024 (UTC)
- TheDJ, consider ASR-generated captions. How to indicate to the viewer in the player as long as they aren't fully proofread, that these are automated captions (and therefore might be wrong in some places)? How to efficiently proofread them using the crowd? Love Bawolff's idea below. -- Rillke(q?) 19:03, 4 February 2024 (UTC)
- I think we’d better focus on asr subtitles before this. My point was more that there are some workarounds right now, yet no one is proofreading to begin with. It’s generally not a good idea to add complexity to the software, before there is a well identified use and need. And I don’t think we have seen that yet, or there would be templates in talk with proofreading state. So in my opinion it would be better to work on more fundamental problems before we tackle proofreading with additional software complexity . —TheDJ (talk • contribs) 21:26, 4 February 2024 (UTC)
- TheDJ, consider ASR-generated captions. How to indicate to the viewer in the player as long as they aren't fully proofread, that these are automated captions (and therefore might be wrong in some places)? How to efficiently proofread them using the crowd? Love Bawolff's idea below. -- Rillke(q?) 19:03, 4 February 2024 (UTC)
- Regarding 5. Heavily suggest that this is a case of "external specialised services are better than build and maintain our own service". We used to have Amara integration and for the few years that that worked, it was pretty ok. Finding a good online editor, hosting it on Toolforge and adding a few integrations is going to be way more maintainable than trying to ram yet another component into Mediawiki. —TheDJ (talk • contribs) 10:16, 2 January 2024 (UTC)
- a tool is needed, doesnt mean it has to be embeded in mediawiki. an external tool hosted on toolforge like croptool is good enough, but we dont have such a tool now.
- making it easy to transcribe audio/video is good for linguistic diversity. RoyZuo (talk) 08:47, 13 January 2024 (UTC)
- Comment In order to avoid Special:AbuseFilter/103, designed to "Enforce syntactical valid Timed Text" for new and anonymous users, one must ensure their Timed Text is syntactically valid by reading MediaWiki:Abusefilter-warning-invalid-timed-text and en:SubRip. — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 11:37, 28 January 2024 (UTC)
- I feel like TimedText should integrate with the Translate extension. Bawolff (talk) 18:41, 3 February 2024 (UTC)
- That'd be nice regarding chunking tracking the proofreading state of each chunk. -- Rillke(q?) 19:03, 4 February 2024 (UTC)
- TimedText is currently not in a state that would encourage contributions in the same easy way you can impove Wikipedia Articles. I'd love to have a way to edit the caption line shown in the player while watching the video. Additionally, for speech, an ASR (like Whisper) would be helpful to me so the timing and most of the content is already correctly done. None of this have to be provided by MediaWiki, but a seamless integration in the user interface of Wikimedia Commons would be great. -- Rillke(q?) 19:03, 4 February 2024 (UTC)
Votes
[edit]- yes.--RoyZuo (talk) 12:01, 23 January 2024 (UTC)
- Oppose per discussion above: Support addressing general subject and related issues but opposing the particular solutions proposed here at this point (e.g. one can already categorize the files rather than the transcripts). Something that's very much needed is machine transcription and translation which then only need to be checked & edited rather than encouraging huge time sinks in a time of stagnating WMC contributor counts. --Prototyperspective (talk) 11:17, 28 January 2024 (UTC)
- i dont agree that categories that serve as indicators of the transcripts' characteristics (quality, language, source...) should be on the files. they should be directly on the transcripts. RoyZuo (talk) 12:32, 5 February 2024 (UTC)
- Why though? These categories could be hidden if they clutter the cats and they could be used for what's displayed when clicking the CC button. The transcripts are not useful to navigate separate from the videos. For example when checking them one has to watch the video or listen to the video-audio or audio too. Assuming that it's indeed for some reasons better than categories on the video, at this point it's not that needed, partly because another solution already exists – thus I think it wouldn't be a top-priority issue. Prototyperspective (talk) 12:54, 5 February 2024 (UTC)
- why should the categories be on files and not timedtexts? RoyZuo (talk) 13:13, 5 February 2024 (UTC)
- @RZuo: Categories are not syntactically valid subtitles. — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 14:05, 5 February 2024 (UTC)
- I don't understand Jeff's reply so to answer your question: mainly because that is already possible (and is standard procedure that's already implemented in some cases). Prototyperspective (talk) 15:36, 5 February 2024 (UTC)
- @Prototyperspective: Do you have an example? — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 15:46, 5 February 2024 (UTC)
- Yes, linked in the third comment under "Discussion"; but only mentioned that – more importantly this is already possible vs requiring technical changes. Prototyperspective (talk) 15:54, 5 February 2024 (UTC)
- @Prototyperspective: The categories in the wikitext at TimedText:Sandbox.webm.en.srt don't actually display linked category names at the bottom of the page, even though they do categorize the page. — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 16:46, 5 February 2024 (UTC)
- That is not the third comment from top. Prototyperspective (talk) 16:48, 5 February 2024 (UTC)
- @Prototyperspective: There is one labeled comment. What timestamp applies to what you are writing about? — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 17:01, 5 February 2024 (UTC)
- That is not the third comment from top. Prototyperspective (talk) 16:48, 5 February 2024 (UTC)
- @Prototyperspective: The categories in the wikitext at TimedText:Sandbox.webm.en.srt don't actually display linked category names at the bottom of the page, even though they do categorize the page. — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 16:46, 5 February 2024 (UTC)
- Yes, linked in the third comment under "Discussion"; but only mentioned that – more importantly this is already possible vs requiring technical changes. Prototyperspective (talk) 15:54, 5 February 2024 (UTC)
- @Prototyperspective: Do you have an example? — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 15:46, 5 February 2024 (UTC)
- why should the categories be on files and not timedtexts? RoyZuo (talk) 13:13, 5 February 2024 (UTC)
- Why though? These categories could be hidden if they clutter the cats and they could be used for what's displayed when clicking the CC button. The transcripts are not useful to navigate separate from the videos. For example when checking them one has to watch the video or listen to the video-audio or audio too. Assuming that it's indeed for some reasons better than categories on the video, at this point it's not that needed, partly because another solution already exists – thus I think it wouldn't be a top-priority issue. Prototyperspective (talk) 12:54, 5 February 2024 (UTC)
- i dont agree that categories that serve as indicators of the transcripts' characteristics (quality, language, source...) should be on the files. they should be directly on the transcripts. RoyZuo (talk) 12:32, 5 February 2024 (UTC)
- Support devoting resources to timed text in general. Point 4 and 5 and an ASR - well integrated (without me having to open zillions of browser tabs) - appear to be interesting to me. Accessibility requirements might be put in place by legislators like we have seen with GDPR regarding privicy in the past. -- Rillke(q?) 19:14, 4 February 2024 (UTC)