Web scraping: Difference between revisions

Web scraping (view source)

Revision as of 17:00, 16 February 2024

88 bytes added , 4 months ago

m

→‎{{header|Wren}}: Minor changes (including code to skip the site notice) and rerun.

PureFox

9,492

edits

Revision as of 14:38, 6 June 2023 (view source) Wutang (talk \| contribs) mNo edit summary ← Older edit		Revision as of 17:00, 16 February 2024 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor changes (including code to skip the site notice) and rerun.) Newer edit →
Line 2,265: An embedded program so we can ask the C host to download the page for us. This task's talk page is being used for this purpose as the original URL no longer works. The code is based in part on the C example though, as we don't have regex, we use our Pattern module to identify the first occurrence of a UTC date/time after the site notice. <syntaxhighlight lang="ecmascript">/* ~~web_scraping~~Web_scraping.wren / import "./pattern" for Pattern Line 2,275: var CURLOPT_WRITEDATA = 10001 var BUFSIZE = 16384 4 foreign class Buffer { Line 2,306: var html = buffer.value var ix = html.indexOf("(UTC)") ix = html.indexOf("(UTC)", ix + 1) // skip the site notice if (ix == -1) { System.print("UTC time not found.") Line 2,315 ⟶ 2,316: <br> We now embed this in the following C program, compile and run it. <syntaxhighlight lang="c">/* gcc ~~web_scraping~~Web_scraping.c -o ~~web_scraping~~Web_scraping -lcurl -lwren -lm / #include <stdio.h> Line 2,471 ⟶ 2,472: WrenVM vm = wrenNewVM(&config); const char* module = "main"; const char* fileName = "~~web_scraping~~Web_scraping.wren"; char *script = readFile(fileName); WrenInterpretResult result = wrenInterpret(vm, module, script); Line 2,491 ⟶ 2,492: {{out}} <pre> 20:5953, 3020 ~~May~~August ~~2020~~2008 </pre>