Web scraping: Difference between revisions

← Older edit

Web scraping (view source)

Revision as of 17:00, 16 February 2024

1,564 bytes added , 3 months ago

m

→‎{{header|Wren}}: Changed to Wren S/H

PureFox

9,476

edits

Revision as of 21:24, 11 May 2023 (view source) PSNOW123 (talk \| contribs) (→‎{{header\|Java}}) ← Older edit		Latest revision as of 17:00, 16 February 2024 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Changed to Wren S/H)
(5 intermediate revisions by 4 users not shown)
Line 1,032: =={{header\|Java}}== The http://tycho.usno.navy.mil/cgi-bin/timer.pl address is no longer available, although the parsing of the text is incredibly simple. <syntaxhighlight lang="java">▼ String scrapeUTC() throws URISyntaxException, IOException { String address = "http://tycho.usno.navy.mil/cgi-bin/timer.pl"; URL url = new URI(address).toURL(); try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) { Pattern pattern = Pattern.compile("^.+? UTC"); Matcher matcher; String line; while ((line = reader.readLine()) != null) { matcher = pattern.matcher(line); if (matcher.find()) return matcher.group().replaceAll("<.+?>", ""); } } return null; } </syntaxhighlight> I'm using a cached page and get the following output. <pre> Jun. 25, 17:59:15 UTC </pre> <br /> Alternately, using Java 8, with the new web address given in the task description. ▲<syntaxhighlight> <syntaxhighlight lang="java"> import java.io.BufferedReader; import java.io.IOException; Line 1,043 ⟶ 1,066: import java.net.URL; public final class ~~WebTime~~WebScraping { public static void main(String[] aArgs) { Line 2,215 ⟶ 2,238: End If Next</syntaxhighlight> =={{header\|V (Vlang)}}== <syntaxhighlight lang="Zig"> import net.http import net.html fn main() { resp := http.get("https://www.utctime.net") or {println(err) exit(-1)} html_doc := html.parse(resp.body) utc := html_doc.get_tag("table").str().split("UTC</td><td>")[1].split("</td>")[0] rfc_850 := html_doc.get_tag("table").str().split("RFC 850</td><td>")[1].split("</td>")[0] println(utc) println(rfc_850) } </syntaxhighlight> {{out}} <pre> 2023-06-06T12:08:01Z Tuesday, 06-Jun-23 12:08:01 UTC </pre> =={{header\|Wren}}== Line 2,221 ⟶ 2,265: An embedded program so we can ask the C host to download the page for us. This task's talk page is being used for this purpose as the original URL no longer works. The code is based in part on the C example though, as we don't have regex, we use our Pattern module to identify the first occurrence of a UTC date/time after the site notice. <syntaxhighlight lang="~~ecmascript~~wren">/* ~~web_scraping~~Web_scraping.wren / import "./pattern" for Pattern Line 2,231 ⟶ 2,275: var CURLOPT_WRITEDATA = 10001 var BUFSIZE = 16384 4 foreign class Buffer { Line 2,262 ⟶ 2,306: var html = buffer.value var ix = html.indexOf("(UTC)") ix = html.indexOf("(UTC)", ix + 1) // skip the site notice if (ix == -1) { System.print("UTC time not found.") Line 2,271 ⟶ 2,316: <br> We now embed this in the following C program, compile and run it. <syntaxhighlight lang="c">/* gcc ~~web_scraping~~Web_scraping.c -o ~~web_scraping~~Web_scraping -lcurl -lwren -lm / #include <stdio.h> Line 2,427 ⟶ 2,472: WrenVM vm = wrenNewVM(&config); const char* module = "main"; const char* fileName = "~~web_scraping~~Web_scraping.wren"; char *script = readFile(fileName); WrenInterpretResult result = wrenInterpret(vm, module, script); Line 2,447 ⟶ 2,492: {{out}} <pre> 20:5953, 3020 ~~May~~August ~~2020~~2008 </pre>