jtidy

how to extract data using jtidy and xpath

不羁岁月 提交于 2019-12-25 01:08:02
问题 i have to extract d company name and face value from http://money.rediff.com/companies/20-microns-ltd/15110088 i noticed that this task could be accomplished using xpath api. since this is an html page, i am using jtidy parser. this is the xpath for the face value which i have to extract. /html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2] This is my code URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088"); URLConnection yc = oracle.openConnection();

JTidy can't handle HTML tags inside script element

天涯浪子 提交于 2019-12-24 09:27:33
问题 (This is a followup to a problem I had a few days ago, where JTidy was reporting 3 errors inside a 300k HTML document, but not reporting where. After some grinding on the problem, I found what appears to be causing the error, and I have a strong suspicion of why, but I haven't decided what to do about it yet.) Here is a small standalone HTML expression that causes JTidy to report an error: <html> <body> Some text. <script type="text/javascript"> var foo = "Press <u>ESC</u> to continue"; <

XPath How to retrieve the value of a table cell from html document

允我心安 提交于 2019-12-23 18:22:45
问题 I have a html document and somewhere inside the doc is below a table, I can get the table rows and java DOM objects. What is not clear to me is how to extract the value of the table cell when the value is a string and also when it is a binary resource? I am using code like: XPath xpath; XPathExpression expr; NodeList nodes=null; // Use XPath to obtain whatever you want from the (X)HTML try{ xpath = XPathFactory.newInstance().newXPath(); //<table class="data"> NodeList list = doc

jTidy pretty print custom HTML tag

北慕城南 提交于 2019-12-23 07:28:13
问题 I'm trying to use JTidy to pretty print a well formed HTML generated by the user: <div class="component-holder ng-binding ng-scope ui-draggable ui-draggable-handle" data-component="cronos-datasource" id="cronos-datasource-817277"> <datasource name="" entity="" key="" endpoint="" rows-per-page=""> <i class="cpn cpn-datasource"></i> </datasource> </div> This is my config: Tidy tidy = new Tidy(); tidy.setXHTML(true); tidy.setIndentContent(true); tidy.setPrintBodyOnly(true); tidy.setTidyMark

How do I make JTIdy make HTML documents well-formed?

╄→尐↘猪︶ㄣ 提交于 2019-12-20 23:30:17
问题 I'm using JTidy v. r938. I'm using this code to attempt to clean up a page … final Tidy tidy = new Tidy(); tidy.setQuiet(false); tidy.setShowWarnings(true); tidy.setShowErrors(0); tidy.setMakeClean(true); Document document = tidy.parseDOM(conn.getInputStream(), null); But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like <META http

JTidy java API toConvert HTML to XHTML

时光总嘲笑我的痴心妄想 提交于 2019-12-19 03:57:15
问题 I am using JTidy to convert from HTML to XHTML but I found in my XHTML file this tag   . Can i prevent it ? this is my code //from html to xhtml try { fis = new FileInputStream(htmlFileName); } catch (java.io.FileNotFoundException e) { System.out.println("File not found: " + htmlFileName); } Tidy tidy = new Tidy(); tidy.setShowWarnings(false); tidy.setXmlTags(false); tidy.setInputEncoding("UTF-8"); tidy.setOutputEncoding("UTF-8"); tidy.setXHTML(true);// tidy.setMakeClean(true); Document

JTidy reports “3 errors were found!”… but does not say what they are

六眼飞鱼酱① 提交于 2019-12-12 03:05:40
问题 I have a large block of programmatically generated HTML. I ran it through Tidy (version r938) with the following Java code: StringReader inStr = new StringReader(htmlInput); StringWriter outStr = new StringWriter(); Tidy tidy = new Tidy(); tidy.setXHTML(true); tidy.parseDOM(inStr, outStr); I get the following output: InputStream: Document content looks like HTML 4.01 Transitional 247 warnings, 3 errors were found! This document has errors that must be fixed before using HTML Tidy to generate

how to take title text from any web page in java

让人想犯罪 __ 提交于 2019-12-10 11:17:42
问题 I am using java to fetch the title text from web page. I have fetched image from web page using Tag name as follows: int i=1; InputStream in=new URL("www.yahoo.com").openStream(); org.w3c.dom.Document doc= new Tidy().parseDOM(in, null); NodeList img=doc.getElementsByTagName("img"); ArrayList<String> list=new ArrayList<String>(); list.add(img.item(i).getAttributes().getNamedItem("src").getNodeValue()); It is working,But I want to fetch title tag from web page(www.yahoo.com) using same code as

How to best use JTidy with a Spring servlet container?

喜你入骨 提交于 2019-12-10 05:49:45
问题 I have a Java servlet container using the Spring Framework. Pages are generated from JSPs using Spring to wire everything up. The resulting HTML sent to the user isn't as, well, tidy as I'd like. I'd like to send the HTML to Tidy right before it's sent to the client browser. I'll set it up to work in development and be turned off in production; it's a winner, from my point of view, as it'll gain me more ease of maintenance. Suggestions on how to make that work cleanly in Spring? 回答1: Why do

jTidy and TagSoup documentation

泪湿孤枕 提交于 2019-12-10 04:24:46
问题 I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml or html5) tags. I have tested HTMLCleaner, NekoHTML and Jericho, but i don't find documentation for jTidy and TagSoup, apart from simplest examples to clear a file. I need documentation about manipulate contents, replace tags, extract info, etc...