WiktionaryDumps to words: Difference between revisions

← Older edit

WiktionaryDumps to words (view source)

Revision as of 09:30, 17 February 2024

222 bytes added , 3 months ago

m

→‎{{header|Wren}}: Minor tidy and rerun

PureFox

9,486

edits

Revision as of 17:45, 13 January 2022 (view source) PureFox (talk \| contribs) (Added Wren) ← Older edit		Latest revision as of 09:30, 17 February 2024 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor tidy and rerun)
(3 intermediate revisions by 2 users not shown)
Line 4: Make a file that can be useful with [https://en.wikipedia.org/wiki/Spell_checker spell checkers] like [https://fr.wikipedia.org/wiki/Ispell Ispell] and [https://en.wikipedia.org/wiki/GNU_Aspell Aspell]. Use the [https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 wiktionary dump] (input) to create a file equivalent ~~than~~to [https://manpages.ubuntu.com/manpages/bionic/man5/spanish.5.html "/usr/share/dict/spanish"] (output). The input file is an XML dump of the Wiktionary that is a bz2'ed file of about 800MB. The output file should be a file similar ~~than~~to "/usr/share/dict/spanish", ~~which~~a ~~contains~~simple ~~one~~text ~~word~~file ofeach aline ~~given~~of ~~language~~which byis ~~line~~one word in athe ~~simple~~given ~~text file~~language. An example of such a file is available in Ubuntu with the package '''wspanish'''. =={{header\|C}}== <~~lang~~syntaxhighlight Clang="c">#include <stdio.h> #include <stdlib.h> #include <stdbool.h> Line 205: return 0; }</~~lang~~syntaxhighlight> {{out}} Line 229: =={{header\|Java}}== <~~lang~~syntaxhighlight lang="java">import org.xml.sax.; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.SAXException; Line 288: } } }</~~lang~~syntaxhighlight> {{out}} Line 311: =={{header\|Julia}}== Uses Regex and a state variable instead of XML parsing. Default setting prints the first 80 French words found. <~~lang~~syntaxhighlight lang="julia">using CodecBzip2 function getwords(io::IO, output::IO; languagemark = "==French==", maxwords = 80) Line 341: getwords(stream, stdout) # or open a file to write to and use its IO handle instead of stdout </~~lang~~syntaxhighlight>{{out}} <pre> gratis Line 430: Using the library [http://erratique.ch/software/xmlm xmlm]: <~~lang~~syntaxhighlight lang="ocaml">let () = let i = Xmlm.make_input ~strip:true (`Channel stdin) in let title = ref "" in Line 463: then print_endline !title end done</~~lang~~syntaxhighlight> {{out}} Line 485: =={{header\|Perl}}== {{trans\|Raku}} <~~lang~~syntaxhighlight lang="perl"># 20211214 Perl programming solution use strict; Line 529: } } )</~~lang~~syntaxhighlight> {{out}} <pre> Line 542: Does not rely on wget/bzcat etc. Downloads in 16K or so blocks, unpacks one block at a time in memory, terminates properly when 5 or more words are found.<br> Tested on Windows, should be fine on Linux as long as you can provide a suitable bz2.so <!--<~~lang~~syntaxhighlight ~~Phix~~lang="phix">(notonline)--> <span style="color: #008080;">constant</span> <span style="color: #000000;">url</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">"https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2"</span> Line 631: <span style="color: #7060A8;">curl_easy_cleanup</span><span style="color: #0000FF;">(</span><span style="color: #000000;">curl</span><span style="color: #0000FF;">)</span> <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"Total downloaded: %s\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">file_size_k</span><span style="color: #0000FF;">(</span><span style="color: #000000;">tbr</span><span style="color: #0000FF;">)})</span> <!--</~~lang~~syntaxhighlight>--> {{out}} <pre> Line 644: =={{header\|Raku}}== I misunderstood the data format and now just copy verbatim from Julia entry the processing logics .. <syntaxhighlight lang="raku" ~~perl6~~line># 20211209 Raku programming solution use LWP::Simple; Line 710: my $ua = CustomLWP.new: URL => $URL ; $ua.CustomRequest>>.say</~~lang~~syntaxhighlight> {{out}} <pre> Line 727: An embedded program so we can use libcurl and libbzip2. Rather than downloading the full 800MB .bz2 file and then decompressing it, we abort the download after receiving no more than the first 512 KB and then decompress that ignoring the resultant BZ_UNEXPECTED_EOF error. This turns out to be enough to find the first 2622 French words. <~~lang~~syntaxhighlight ~~ecmascript~~lang="wren">/ ~~wiktionary_dumps_to_words~~WiktionaryDumps_to_words.wren / import "./pattern" for Pattern Line 789: gotTextLast = false } }</~~lang~~syntaxhighlight> <br> We now embed this script in the following C program, build and run. <~~lang~~syntaxhighlight lang="c">/ gcc ~~wiktionary_dumps_to_words~~WiktionaryDumps_to_words.c -o ~~wiktionary_dumps_to_words~~WiktionaryDumps_to_words -lcurl -lbz2 -lwren -lm / #include <stdio.h> Line 988: WrenVM vm = wrenNewVM(&config); const char* module = "main"; const char* fileName = "~~wiktionary_dumps_to_words~~WiktionaryDumps_to_words.wren"; char *script = readFile(fileName); WrenInterpretResult result = wrenInterpret(vm, module, script); Line 1,004: free(script); return 0; }</~~lang~~syntaxhighlight> {{out}} Line 1,030: fable a- ~~abaca~~ ~~abada~~ ~~abalone~~ ~~abandon~~ </pre>