Implementing PageRank using MapReduce

淺唱寂寞╮ 提交于 2019-12-02 17:46:07

Here is a pseudocode:

map( key: [url, pagerank], value: outlink_list )
    for each outlink in outlink_list
        emit( key: outlink, value: pagerank/size(outlink_list) )

    emit( key: url, value: outlink_list )

reducer( key: url, value: list_pr_or_urls )
    outlink_list = []
    pagerank = 0

    for each pr_or_urls in list_pr_or_urls
        if is_list( pr_or_urls )
            outlink_list = pr_or_urls
        else
            pagerank += pr_or_urls

    pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank )

    emit( key: [url, pagerank], value: outlink_list )

It is important that in the reduce you should output outlinks and not inlinks, as some articles on the intenret suggests. This way the consecutive iterations will also have outlinks as input of the mapper.

Pay attention that multiple outlinks with the same address from the same page count as one. Also, don't count loops (page linking to itself).

The damping factor is traditionally 0.85, although you can play around with other values, too.

We iteratively evaluate PR. PR(x) = Sum(PR(a)*weight(a), a in in_links) by

map ((url,PR), out_links) //PR = random at start
for link in out_links
   emit(link, ((PR/size(out_links)), url))

reduce(url, List[(weight, url)):
   PR =0
   for v in weights
       PR = PR + v
   Set urls = all urls from list

   emit((url, PR), urls)

so the output equals input and we can do this until coverage.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!