jtidy | 易学教程

how to extract data using jtidy and xpath

阅读更多关于 how to extract data using jtidy and xpath

问题 i have to extract d company name and face value from http://money.rediff.com/companies/20-microns-ltd/15110088 i noticed that this task could be accomplished using xpath api. since this is an html page, i am using jtidy parser. this is the xpath for the face value which i have to extract. /html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2] This is my code URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088"); URLConnection yc = oracle.openConnection();

JTidy can't handle HTML tags inside script element

阅读更多关于 JTidy can't handle HTML tags inside script element

问题 (This is a followup to a problem I had a few days ago, where JTidy was reporting 3 errors inside a 300k HTML document, but not reporting where. After some grinding on the problem, I found what appears to be causing the error, and I have a strong suspicion of why, but I haven't decided what to do about it yet.) Here is a small standalone HTML expression that causes JTidy to report an error: <html> <body> Some text. <script type="text/javascript"> var foo = "Press <u>ESC</u> to continue"; <

XPath How to retrieve the value of a table cell from html document

阅读更多关于 XPath How to retrieve the value of a table cell from html document

问题 I have a html document and somewhere inside the doc is below a table, I can get the table rows and java DOM objects. What is not clear to me is how to extract the value of the table cell when the value is a string and also when it is a binary resource? I am using code like: XPath xpath; XPathExpression expr; NodeList nodes=null; // Use XPath to obtain whatever you want from the (X)HTML try{ xpath = XPathFactory.newInstance().newXPath(); //<table class="data"> NodeList list = doc

jTidy pretty print custom HTML tag

阅读更多关于 jTidy pretty print custom HTML tag

问题 I'm trying to use JTidy to pretty print a well formed HTML generated by the user: <div class="component-holder ng-binding ng-scope ui-draggable ui-draggable-handle" data-component="cronos-datasource" id="cronos-datasource-817277"> <datasource name="" entity="" key="" endpoint="" rows-per-page=""> <i class="cpn cpn-datasource"></i> </datasource> </div> This is my config: Tidy tidy = new Tidy(); tidy.setXHTML(true); tidy.setIndentContent(true); tidy.setPrintBodyOnly(true); tidy.setTidyMark

How do I make JTIdy make HTML documents well-formed?

阅读更多关于 How do I make JTIdy make HTML documents well-formed?

问题 I'm using JTidy v. r938. I'm using this code to attempt to clean up a page … final Tidy tidy = new Tidy(); tidy.setQuiet(false); tidy.setShowWarnings(true); tidy.setShowErrors(0); tidy.setMakeClean(true); Document document = tidy.parseDOM(conn.getInputStream(), null); But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like <META http

JTidy java API toConvert HTML to XHTML

阅读更多关于 JTidy java API toConvert HTML to XHTML

问题 I am using JTidy to convert from HTML to XHTML but I found in my XHTML file this tag . Can i prevent it ? this is my code //from html to xhtml try { fis = new FileInputStream(htmlFileName); } catch (java.io.FileNotFoundException e) { System.out.println("File not found: " + htmlFileName); } Tidy tidy = new Tidy(); tidy.setShowWarnings(false); tidy.setXmlTags(false); tidy.setInputEncoding("UTF-8"); tidy.setOutputEncoding("UTF-8"); tidy.setXHTML(true);// tidy.setMakeClean(true); Document

JTidy reports “3 errors were found!”… but does not say what they are

阅读更多关于 JTidy reports “3 errors were found!”… but does not say what they are

问题 I have a large block of programmatically generated HTML. I ran it through Tidy (version r938) with the following Java code: StringReader inStr = new StringReader(htmlInput); StringWriter outStr = new StringWriter(); Tidy tidy = new Tidy(); tidy.setXHTML(true); tidy.parseDOM(inStr, outStr); I get the following output: InputStream: Document content looks like HTML 4.01 Transitional 247 warnings, 3 errors were found! This document has errors that must be fixed before using HTML Tidy to generate

how to take title text from any web page in java

阅读更多关于 how to take title text from any web page in java

问题 I am using java to fetch the title text from web page. I have fetched image from web page using Tag name as follows: int i=1; InputStream in=new URL("www.yahoo.com").openStream(); org.w3c.dom.Document doc= new Tidy().parseDOM(in, null); NodeList img=doc.getElementsByTagName("img"); ArrayList<String> list=new ArrayList<String>(); list.add(img.item(i).getAttributes().getNamedItem("src").getNodeValue()); It is working,But I want to fetch title tag from web page(www.yahoo.com) using same code as

How to best use JTidy with a Spring servlet container?

阅读更多关于 How to best use JTidy with a Spring servlet container?

问题 I have a Java servlet container using the Spring Framework. Pages are generated from JSPs using Spring to wire everything up. The resulting HTML sent to the user isn't as, well, tidy as I'd like. I'd like to send the HTML to Tidy right before it's sent to the client browser. I'll set it up to work in development and be turned off in production; it's a winner, from my point of view, as it'll gain me more ease of maintenance. Suggestions on how to make that work cleanly in Spring? 回答1: Why do

jTidy and TagSoup documentation

阅读更多关于 jTidy and TagSoup documentation

问题 I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml or html5) tags. I have tested HTMLCleaner, NekoHTML and Jericho, but i don't find documentation for jTidy and TagSoup, apart from simplest examples to clear a file. I need documentation about manipulate contents, replace tags, extract info, etc...