Can you provide examples of parsing HTML?

后端 未结 29 2285
走了就别回头了
走了就别回头了 2020-11-22 13:49

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions

相关标签:
29条回答
  • 2020-11-22 14:29

    language: Ruby
    library: Nokogiri

    #!/usr/bin/env ruby
    
    require "nokogiri"
    require "open-uri"
    
    doc = Nokogiri::HTML(open('http://www.example.com'))
    hrefs = doc.search('a').map{ |n| n['href'] }
    
    puts hrefs
    

    Which outputs:

    /
    /domains/
    /numbers/
    /protocols/
    /about/
    /go/rfc2606
    /about/
    /about/presentations/
    /about/performance/
    /reports/
    /domains/
    /domains/root/
    /domains/int/
    /domains/arpa/
    /domains/idn-tables/
    /protocols/
    /numbers/
    /abuse/
    http://www.icann.org/
    mailto:iana@iana.org?subject=General%20website%20feedback
    

    This is a minor spin on the one above, resulting in an output that is usable for a report. I only return the first and last elements in the list of hrefs:

    #!/usr/bin/env ruby
    
    require "nokogiri"
    require "open-uri"
    
    doc = Nokogiri::HTML(open('http://nokogiri.org'))
    hrefs = doc.search('a[href]').map{ |n| n['href'] }
    
    puts hrefs
      .each_with_index                     # add an array index
      .minmax{ |a,b| a.last <=> b.last }   # find the first and last element
      .map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output
    
      1 http://github.com/tenderlove/nokogiri
    100 http://yokolet.blogspot.com
    
    0 讨论(0)
  • 2020-11-22 14:31

    Language: JavaScript
    Library: jQuery

    $.each($('a[href]'), function(){
        console.debug(this.href);
    });
    

    (using firebug console.debug for output...)

    And loading any html page:

    $.get('http://stackoverflow.com/', function(page){
         $(page).find('a[href]').each(function(){
            console.debug(this.href);
        });
    });
    

    Used another each function for this one, I think it's cleaner when chaining methods.

    0 讨论(0)
  • 2020-11-22 14:31

    Language: Java
    Library: jsoup

    import java.io.IOException;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import org.xml.sax.SAXException;
    
    public class HtmlTest {
        public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
            final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
            final Elements links = document.select("a[href]");
            for (final Element element : links) {
                System.out.println(element.attr("href"));
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-22 14:33

    language: Python
    library: BeautifulSoup

    from BeautifulSoup import BeautifulSoup
    
    html = "<html><body>"
    for link in ("foo", "bar", "baz"):
        html += '<a href="http://%s.com">%s</a>' % (link, link)
    html += "</body></html>"
    
    soup = BeautifulSoup(html)
    links = soup.findAll('a', href=True) # find <a> with a defined href attribute
    print links  
    

    output:

    [<a href="http://foo.com">foo</a>,
     <a href="http://bar.com">bar</a>,
     <a href="http://baz.com">baz</a>]
    

    also possible:

    for link in links:
        print link['href']
    

    output:

    http://foo.com
    http://bar.com
    http://baz.com
    
    0 讨论(0)
  • 2020-11-22 14:33

    Language: Python
    Library: HTQL

    import htql; 
    
    page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
    query="<a>:href,tx";
    
    for url, text in htql.HTQL(page, query): 
        print url, text;
    

    Simple and intuitive.

    0 讨论(0)
  • 2020-11-22 14:34

    language: Ruby
    library: Hpricot

    #!/usr/bin/ruby
    
    require 'hpricot'
    
    html = '<html><body>'
    ['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
    html += '</body></html>'
    
    doc = Hpricot(html)
    doc.search('//a').each {|elm| puts elm.attributes['href'] }
    
    0 讨论(0)
提交回复
热议问题