WiktionaryDumps to words: Difference between revisions
Content added Content deleted
(Warning about future changes) |
(Work in progress Java example) |
||
Line 4: | Line 4: | ||
Use the [https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 wiktionary dump] (input) to create a file equivalent than [http://manpages.ubuntu.com/manpages/bionic/man5/french.5.html "/usr/share/dict/french"] (output). This dump is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package '''wfrench'''. |
Use the [https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 wiktionary dump] (input) to create a file equivalent than [http://manpages.ubuntu.com/manpages/bionic/man5/french.5.html "/usr/share/dict/french"] (output). This dump is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package '''wfrench'''. |
||
=={{header|Java}}== |
|||
<lang java>import org.xml.sax.*; |
|||
import org.xml.sax.helpers.DefaultHandler; |
|||
import org.xml.sax.SAXException; |
|||
import javax.xml.parsers.SAXParser; |
|||
import javax.xml.parsers.SAXParserFactory; |
|||
import javax.xml.parsers.ParserConfigurationException; |
|||
class MyHandler extends DefaultHandler { |
|||
private static final String TITLE = "title"; |
|||
private static final String TEXT = "text"; |
|||
private String lastTag = ""; |
|||
private String title = ""; |
|||
@Override |
|||
public void characters(char[] ch, int start, int length) throws SAXException { |
|||
switch (lastTag) { |
|||
case TITLE: |
|||
title = new String(ch, start, length); |
|||
break; |
|||
case TEXT: |
|||
String text = new String(ch, start, length); |
|||
if (text.matches("(.*)\n==French==\n(.*)")) { |
|||
System.out.println(title); |
|||
} |
|||
break; |
|||
} |
|||
} |
|||
@Override |
|||
public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException { |
|||
lastTag = qName; |
|||
} |
|||
@Override |
|||
public void endElement(String uri, String localName, String qName) throws SAXException { |
|||
lastTag = ""; |
|||
} |
|||
} |
|||
public class WiktoWords { |
|||
public static void main(java.lang.String[] args) { |
|||
try { |
|||
SAXParserFactory spFactory = SAXParserFactory.newInstance(); |
|||
SAXParser saxParser = spFactory.newSAXParser(); |
|||
MyHandler handler = new MyHandler(); |
|||
saxParser.parse(new InputSource(System.in), handler); |
|||
} catch(Exception e) { |
|||
System.exit(1); |
|||
} |
|||
} |
|||
}</lang> |
|||
{{out}} |
|||
<pre> |
|||
$ javac WiktoWords.java |
|||
$ wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \ |
|||
java WiktoWords |
|||
hélice |
|||
pingouin |
|||
égoïsme |
|||
écholocation |
|||
nitroglycérine |
|||
croque-mitaine |
|||
</pre> |
|||
=={{header|OCaml}}== |
=={{header|OCaml}}== |
Revision as of 03:12, 21 December 2020
WiktionaryDumps to words is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
- NOTE
- Please help addressing the issues about this task on the discussion page. If you add another language, be aware that it may change in the future, and that you will need to update your example.
Use the wiktionary dump (input) to create a file equivalent than "/usr/share/dict/french" (output). This dump is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package wfrench.
Java
<lang java>import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.SAXException;
import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException;
class MyHandler extends DefaultHandler {
private static final String TITLE = "title"; private static final String TEXT = "text";
private String lastTag = ""; private String title = "";
@Override public void characters(char[] ch, int start, int length) throws SAXException { switch (lastTag) { case TITLE: title = new String(ch, start, length); break; case TEXT: String text = new String(ch, start, length); if (text.matches("(.*)\n==French==\n(.*)")) { System.out.println(title); } break; } }
@Override public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
lastTag = qName;
}
@Override public void endElement(String uri, String localName, String qName) throws SAXException {
lastTag = "";
}
}
public class WiktoWords {
public static void main(java.lang.String[] args) { try { SAXParserFactory spFactory = SAXParserFactory.newInstance(); SAXParser saxParser = spFactory.newSAXParser(); MyHandler handler = new MyHandler(); saxParser.parse(new InputSource(System.in), handler); } catch(Exception e) { System.exit(1); } }
}</lang>
- Output:
$ javac WiktoWords.java $ wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \ java WiktoWords hélice pingouin égoïsme écholocation nitroglycérine croque-mitaine
OCaml
Using the library xmlm:
<lang ocaml>let () =
let i = Xmlm.make_input ~strip:true (`Channel stdin) in let title = ref "" in let tag_path = ref [] in let push_tag tag = tag_path := tag :: !tag_path in let pop_tag () = match !tag_path with [] -> () | _ :: tl -> tag_path := tl in let last_tag_is tag = match !tag_path with [] -> false | hd :: _ -> hd = tag in while not (Xmlm.eoi i) do match Xmlm.input i with | `Dtd dtd -> () | `El_start ((uri, tag_name), attrs) -> push_tag tag_name | `El_end -> pop_tag () | `Data s -> if last_tag_is "title" then title := s; if last_tag_is "text" then begin let reg = Str.regexp_string "==French==" in if Str.string_match reg s 0 then print_endline !title end done</lang>
- Output:
wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \ ocaml str.cma -I $(ocamlfind query xmlm) xmlm.cma to_words.ml livrer observateur qui a bu boira quelque chose grande parure obiit pleuvoir voir ...