User:Fæ/sandbox4
Geograph mapping proposal for automated tidy-up of raw HTML imports on Geograph image pages
[edit]Please see Commons:Bots/Work_requests#Geograph_raw_html_tidy-up for a context for this sandbox analysis. --Fæ (talk) 08:02, 16 September 2012 (UTC)
Mapping example taken from: http://commons.wikimedia.org/w/index.php?title=File:Disused_railway_building_and_platform_-_geograph.org.uk_-_1989471.jpg&oldid=77328823
- Source text
{{en|1=Disused railway building and platform, CTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" id="geograph"> <head> <title>Disused railway building and platform:: OS grid SE8279 :: Geograph Britain and Ireland - photograph every grid square!</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="description" content="SE8279 :: Disused railway building and platform, near to Low Marishes, North Yorkshire, Great Britain" /> <meta name="ICBM" content="54.202974525125, -0.74288542171216"/> <meta name="DC.title" content="Geograph:: Disused railway building and platform:: OS grid SE8279"/> <meta name="pinterest" content="nopin" /> <link rel="stylesheet" type="text/css" title="Monitor" href="http://s0.geograph.org.uk/templates/basic/css/basic.v7593.css" media="screen" /> <link rel="shortcut icon" type="image/x-icon" href="http://s0.geograph.org.uk/favicon.ico"/> <link rel="alternate" type="application/vnd.google-earth.kml+xml" href="/photo/1989471.kml"/> <link rel="search" type="application/opensearchdescription+xml" title="Geograph Britain and Ireland search" href="/stuff/osd.xml" /> <script type="text/javascript" src="http://s0.geograph.org.uk/js/geograph.v7508.js"></script> </head> <body> <div id="header_block"> <div id="header"> <h1 onclick="document.location='/';"><a title="Geograph home page" href="/">Geograph - photograph every grid square</a></h1> </div> </div> <div class="content_photowhite" id="maincontent_block"><div id="maincontent"> <div style="float:right; position:relative; width:5em; height:4em;"></div> <div style="float:right; position:relative; width:2.5em; height:1em;"></div> <div itemscope itemtype="schema.org/Photograph"><meta itemprop="isFamilyFriendly" content="true"/> <h2><a title="Grid Reference SE8279 :: 22 images" href="/gridref/SE8279">SE8279</a> : Disused railway building and platform</h2> <h3 itemprop="contentLocation"><span title="about 2 km from">near to Low Marishes, North Yorkshire, Great Britain. }}
- Pseudocode
- Remove text up to meta tag description
- Indent description
- Use content of meta tag ICBM to create a ICBM line (this could be processed as a microformat)
- Trim text up to contentLocation
- Use text of contentLocation to create a Location line.
- Output text
{{en|1=Disused railway building and platform
, CTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" id="geograph">
<head>
<title>Disused railway building and platform:: OS grid SE8279 :: Geograph Britain and Ireland - photograph every grid square!</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="description" content="
- Data from Geograph:
- Description: SE8279 :: Disused railway building and platform, near to Low Marishes, North Yorkshire, Great Britain
" />
- Description: SE8279 :: Disused railway building and platform, near to Low Marishes, North Yorkshire, Great Britain
<meta name="ICBM" content="
- {{green|ICBM: 54.202974525125, -0.74288542171216
"/> <meta name="DC.title" content="Geograph:: Disused railway building and platform:: OS grid SE8279"/>
<meta name="pinterest" content="nopin" />
<link rel="stylesheet" type="text/css" title="Monitor" href="http://s0.geograph.org.uk/templates/basic/css/basic.v7593.css" media="screen" />
<link rel="shortcut icon" type="image/x-icon" href="http://s0.geograph.org.uk/favicon.ico"/>
<link rel="alternate" type="application/vnd.google-earth.kml+xml" href="/photo/1989471.kml"/>
<link rel="search" type="application/opensearchdescription+xml" title="Geograph Britain and Ireland search" href="/stuff/osd.xml" />
<script type="text/javascript" src="http://s0.geograph.org.uk/js/geograph.v7508.js"></script>
</head>
<body>
<div id="header_block">
<div id="header">
<h1 onclick="document.location='/';"><a title="Geograph home page" href="/">Geograph - photograph every grid square</a></h1>
</div>
</div>
<div class="content_photowhite" id="maincontent_block"><div id="maincontent">
<div style="float:right; position:relative; width:5em; height:4em;"></div>
<div style="float:right; position:relative; width:2.5em; height:1em;"></div>
<div itemscope itemtype="schema.org/Photograph"><meta itemprop="isFamilyFriendly" content="true"/>
<h2><a title="Grid Reference SE8279 :: 22 images" href="/gridref/SE8279">SE8279</a> : Disused railway building and platform</h2>
<h3 itemprop="contentLocation"><span title="
- Location: (about 2 km from)
">near to Low Marishes, North Yorkshire, Great Britain.
- Location: (about 2 km from)
}}
- Final result
- Data from Geograph:
- Description: SE8279 :: Disused railway building and platform, near to Low Marishes, North Yorkshire, Great Britain
- ICBM: 54.202974525125, -0.74288542171216
- Location: (about 2 km from) near to Low Marishes, North Yorkshire, Great Britain.