Web Scraping with Scala [closed]

前端未结

关注

 3  1886

遥遥无期

相关标签:

3条回答

情深已故

2021-01-30 13:56

I'd recommend Goose: https://github.com/jiminoc/goose

It's not as general-use as you might need but if you are scraping article content from popular sites, it may work out of the box. It also provides a framework for you to work from if you want to extend their code to cover other sites.

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2021-01-30 14:08
I don't have a Scala-specific recommendation, but for the JVM in general I've had good success with:
- JSoup You can CSS selectors to "scrape" the document. Really nice to work with.
- Use Tagsoup to get your input HTML to XML, then use XML processors to "Scrape".
The Tagsoup route actually works quite well with Scala since Scala's built-in XML "dsl" is pretty concise (if you can forgive its perf issues and occasional API weirdness). Also, Tagsoup will handle nearly any garbage document you give it. It also has niceties like built-in understanding of many HTML entities that other SAXParsers will choke on as being undeclared.

tl;dr - JSoup + CSS selectors if possible, otherwise Tagsoup + scala XML. If slow is ok, tagsoup first, then jsoup the result.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2021-01-30 14:19
First there is a plethora of HTML scraping libs in JVM all you need to do is pimp one of them (pimp my library pattern).

The four I have used are:
- HtmlUnit - Will emulate the browser and even run Javascript
- Jericho - Preserves formatting and ideal if you want to edit the scraped HTML
- NekoHtml
- JSoup -- ~~does not work with Scala~~. Might work
I have used Selenium but never for scraping. Scala has a wrapper around selenium.

I would recommend pimping an existing Java library over some half baked Scala lib.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题