Scala and HTML parsing

后端 未结 5 2228
北海茫月
北海茫月 2020-12-14 08:53

How do you load an HTML DOM document into Scala? The XML singleton had errors when trying to load the xmlns tags.

import java.net._
import java.io._
import s         


        
相关标签:
5条回答
  • 2020-12-14 09:05

    I have just tried to use this answer with scala 2.8.1 and ended up using the work from:

    http://www.hars.de/2009/01/html-as-xml-in-scala.html

    The interesting bit that I needed was:

    val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
    val parser = parserFactory.newSAXParser()
    val source = new org.xml.sax.InputSource("http://www.scala-lang.org")
    val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
    adapter.loadXML(source, parser)
    
    0 讨论(0)
  • 2020-12-14 09:05

    Scala Scraper

    I recommend Scala Scraper that lets you parse HTML elegantly like this:

    // Parse elements from files, URLs or plain strings
    val browser = JsoupBrowser()
    val doc = browser.parseFile("core/src/test/resources/example.html")
    val doc2 = browser.get("http://example.com")
    val doc3 = browser.parseString("<html><h1>parse me</h1></html>")
    
    // Extract the text inside the element with id "header"
    doc >> text("#header")
    
    // Extract the <span> elements inside #menu
    val items = doc >> elementList("#menu span")
    
    // From each item, extract all the text inside their <a> elements
    items.map(_ >> allText("a"))
    

    Examples are taken from the Scala Scraper's readme.

    0 讨论(0)
  • 2020-12-14 09:08

    Try using scala.xml.parsing.XhtmlParser instead.

    0 讨论(0)
  • 2020-12-14 09:08
    /* 
    Copyright (c) 2008 Florian Hars, BIK Aschpurwis+Behrens GmbH, Hamburg 
    Copyright (c) 2002-2008 EPFL, Lausanne, unless otherwise specified. 
    All rights reserved. 
    
    This software was developed by the Programming Methods Laboratory of the 
    Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland. 
    
    Permission to use, copy, modify, and distribute this software in source 
    or binary form for any purpose with or without fee is hereby granted, 
    provided that the following conditions are met: 
    
    1. Redistributions of source code must retain the above copyright 
      notice, this list of conditions and the following disclaimer. 
    
    2. Redistributions in binary form must reproduce the above copyright 
      notice, this list of conditions and the following disclaimer in the 
      documentation and/or other materials provided with the distribution. 
    
    3. Neither the name of the EPFL nor the names of its contributors 
      may be used to endorse or promote products derived from this 
      software without specific prior written permission. 
    
    
     THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 
     ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
     IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
     ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 
     FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
     DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
     SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
     CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 
     LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 
     OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 
     SUCH DAMAGE. 
    */ 
    
    package tagsoup 
    
    import org.xml.sax.InputSource 
    import javax.xml.parsers.SAXParser 
    import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl 
    import scala.xml.parsing.FactoryAdapter 
    import scala.xml._ 
    
    class TagSoupFactoryAdapter extends FactoryAdapter { 
    
      val parserFactory = new SAXFactoryImpl 
      parserFactory.setNamespaceAware(false) 
    
      val emptyElements = Set("area", "base", "br", "col", "hr", "img", 
                          "input", "link", "meta", "param") 
    
      /** Tests if an XML element contains text. 
       * @return true if element named <code>localName</code> contains text. 
       */ 
      def nodeContainsText(localName: String) = !(emptyElements contains localName) 
    
      /** creates a node. 
      */ 
      def createNode(pre:String, label: String, attrs: MetaData, 
                 scpe: NamespaceBinding, children: List[Node] ): Elem = { 
        Elem( pre, label, attrs, scpe, children:_* ); 
      } 
    
      /** creates a text node 
      */ 
      def createText( text:String ) = 
        Text( text ); 
    
      /** Ignore Processing Instructions 
      */ 
      def createProcInstr(target: String, data: String) = Nil 
    
      /** load XML document 
       * @param source 
       * @return a new XML document object 
       */ 
      override def loadXML(source: InputSource) = { 
        val parser: SAXParser = parserFactory.newSAXParser() 
    
        scopeStack.push(TopScope) 
        parser.parse(source, this) 
        scopeStack.pop 
        rootElem 
      } 
    
    }
    

    How-to-use-TagSoup-with-Scala-XML

    0 讨论(0)
  • 2020-12-14 09:10

    This may help you Processing real world HTML as if it were XML in scala

    0 讨论(0)
提交回复
热议问题