python html parsing fails in document with javascript -


I'm trying to use Python to parse HTML (although Speaking strictly, server Claims XHtml) and every parcener I have tried (Elementatri, Minenom, and LxML) all fail. When I want to see where the problem is, it's inside a script tag:

    

I see what the problem is, ampersand be quoted needed. The problem is, it is inside a JavaScript script tag, so it can be cited not because it will break the code.

What's going on here? How inline javascript is capable of breaking my paras, and what can I do about it?

Update: per request, here is the code that is used with LXML.

  & gt; & Gt; & Gt; lxml Import Entry from & gt; & Gt; & Gt; Tree = etree.parse ( "http://192.168.1.185/site.html") Traceback (most recent call last): File "& lt; stdin & gt;", line 1, & lt; Module & gt; File "lxml.etree.pyx", line 3299, lxml.etree.parse (src / lxml / lxml.etree.c: 72655) file "parser.pxi", line 17 9 1, lxml.etree._parseDocument (src / Lxml / lxml.etree.c: 106263) The file "parser.pxi", in line 1817, lxml.etree._parseDocumentFromURL (src / lxml / lxml.etree.c: 106564) file "parser.pxi", line 1721, In Lxml.etree._parseDocFromFile (src / lxml / lxml.etree.c: 105561) file "parser.pxi", in line 1122, lxml.etree._BaseParser._parseDocFromFile (src / lxml / lxml.etree.c: 100456) File "Parser.pxi", line 580, lxml.etree._ParserContext._handleParseResultDoc (src / lxml / lxml.etree.c: 94543) file "parser.pxi", in line 690, lxml.etree._handleParseResult (src / lxml /lxml.etree.c:96003) file "parser.pxi", line 620, in lxml.etree._raiseParseError (src / lxml / lx Ml.etree.c: 95050) lxml.etree.XMLSyntaxError: xmlParseEntityRef: No Name, Line 77, Column 22   

"XML and HTML to Pars a Very Simple and Powerful API provides "written by chapter 9 beginning, so I hope that it does not appear to be an exception.

HTML is a very dirty method for parsing braking. Bad HTML is omnipresent, and both script sections and various templatelying languages ​​have been thrown to solve the monkey in the works.

However, you also XML-use oriented analysts are for the job, which are rigid and thus very much, all right, more completely not likely to put up with valid input . Most HTML - most xhtml in it - is not explicitly included.

Thus, use a parser designed to ignore certain groups of HTML:

  import lxml.html d = Lxml.html. PRSE (URL)   

You should be in the race.

Comments

Popular posts from this blog

php - PDO bindParam() fatal error -

php - How can I cram 6+31 numeric characters into 22 alphanumeric characters? -

logging - How can I log both the Request.InputStream and Response.OutputStream traffic in my ASP.NET MVC3 Application for specific Actions? -