Revision as of 05:02, 26 April 2010 (view source) rosettacode>Bukzor No edit summary ← Older edit		Latest revision as of 15:37, 26 April 2010 (view source) rosettacode>Bukzor mNo edit summary
Line 19: #The mediawiki API is pretty straightforward. I feel done with that part. # grab the HTML for those pages, put them into a DOM #I'm having trouble getting any of the builtin html or xml parsers to give me a DOM. [http://docs.python.org/library/htmlparser.html htmlparser] is just a ghetto little state machine, and the xml parsers are too strict (&nbsp; is an 'unknown entity'). #I've posted a stackoverflow question on this subject [http://stackoverflow.com/questions/2676872/how-to-parse-malformed-html-in-python-using-standard-libraries here]. --Bukzor 16:31, 20 April 2010 (UTC) #Despite everyone agreeing that Python doesn't have a builtin HTML->DOM parser, I've parsed the site A-Z with ElementTree with minimal effort. I had to fix a bunch of inavalid HTML though. Look at my edits for the previous couple days for details. Line 26: #* I note that plenty of the Python solutions are using plain <pre> tags which are being skipped in my current scheme. I'll have to add some code to detect this... # automate feeding that code through pylint #* The current ~~Ubunut~~Ubuntu pylint (0.18) throws up on 'import curses' for unknown reasons, but installing the latest version (0.20) allows me to pylint all of the scraped snippets as a whole. Collectively, they're rated at -1.58/10 (that's negative). I hope to get that up to 10/10 someday. --[[User:Bukzor\|Bukzor]] 05:02, 26 April 2010 (UTC) # save a report of pages->scores

User:Bukzor: Difference between revisions

User:Bukzor (view source)

Revision as of 15:37, 26 April 2010