问题
I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)
I would like to do it with a single line of command-line with one of the best parsers, like Saxon-PE or BaseX.
So far the shortest solution that I (seemed to have) found is with these two lines:
java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"page.xhtml" -qs:"//DIV[@id='ps-content']"
but all what it returns is this, that is not my expected block of html code:
<?xml version="1.0" encoding="UTF-8"?>
My questions are two:
- what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
- since Saxon-PE has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fsaxonica.com%252Fdocumentation9.4-demo%252Fhtml%252Fextensions%252Ffunctions%252Fparse-html.html), how can I integrate my two lines into a single line?
回答1:
I found the correct command-line to launch the query without TagSoup:
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"
Note that inverting the type of quotes like this doesn't work (in Win7):
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:'//*:div[@id="ps-content"]'
Does anyone know how to add the TagSoup preprocess in the same command-line?
回答2:
My last failed attempts to integrate TagSoup in the same command-line:
...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query
...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query
...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
FODC0002: The file or directory
file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
exist
Query processing failed: Run-time errors were reported
...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
expected ")", found ";"
Static error(s) in query
来源:https://stackoverflow.com/questions/17013688/how-to-extract-an-xpath-from-an-html-page-with-saxon-pe-commandline