Web page recommender system

做~自己de王妃 提交于 2019-12-02 19:49:27

as Thomas Jungblut said, one could write several books on your questions ;-) I will try to give you a list of brief pointers - but be aware there will be no ready-to-use off-the-shelf solution ...

  1. Crawling the internet: There are plenty of toolkits for doing this, like Scrapy for Python , crawler4j and Heritrix for Java, or WWW::Robot for Perl. For extracting the actual content from web pages, have a look at boilerpipe.

    http://scrapy.org/

    http://crawler.archive.org/

    http://code.google.com/p/crawler4j/

    https://metacpan.org/module/WWW::Robot

    http://code.google.com/p/boilerpipe/

  2. First of all, often you can use collaborative filtering instead of content-based approaches. But if you want to have good coverage, especially in the long tail, there will be no way around analyzing the text. One thing to look at is topic modelling, e.g. LDA. Several LDA approaches are implemented in Mallet, Apache Mahout, and Vowpal Wabbit. For indexing, search, and text processing, have a look at Lucene. It is an awesome, mature piece of software.

    http://mallet.cs.umass.edu/

    http://mahout.apache.org/

    http://hunch.net/~vw/

    http://lucene.apache.org/

  3. Besides Apache Mahout which also contains things like LDA (see above), clustering, and text processing, there are also other toolkits available if you want to focus on collaborative filtering: LensKit, which is also implemented in Java, and MyMediaLite (disclaimer: I am the main author), which is implemented in C#, but also has a Java port.

    http://lenskit.grouplens.org/

    http://ismll.de/mymedialite

    https://github.com/jcnewell/MyMediaLiteJava

This should be a good read: Google news personalization: scalable online collaborative filtering

It's focused on collaborative filtering rather than content based recommendations, but it touches some very interesting points like scalability, item churn, algorithms, system setup and evaluation.

Mahout has very good collaborative filtering techniques, which is what you describe as using the behaviour of the users (click, read, etc) and you could introduce some content based using the rescorer classes.

You might also want to have a look at Myrrix, which is in some ways the evolution of the taste (aka recommendations) portion of Mahout. In addition, it also allows applying content based logic on top of collaborative filtering using the rescorer classes.

If you are interested in Mahout, the Mahout in Action book would be the best place to start.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!