Implementing PageRank using MapReduce

前端 未结 3 750
长发绾君心
长发绾君心 2021-01-31 22:42

I\'m trying to get my head around an issue with the theory of implementing the PageRank with MapReduce.

I have the following simple scenario with three nodes: A B C.

相关标签:
3条回答
  • 2021-01-31 23:00

    A detailed explanation with Python code, by Michael Nielsen.

    0 讨论(0)
  • 2021-01-31 23:15

    Here is a pseudocode:

    map( key: [url, pagerank], value: outlink_list )
        for each outlink in outlink_list
            emit( key: outlink, value: pagerank/size(outlink_list) )
    
        emit( key: url, value: outlink_list )
    
    reducer( key: url, value: list_pr_or_urls )
        outlink_list = []
        pagerank = 0
    
        for each pr_or_urls in list_pr_or_urls
            if is_list( pr_or_urls )
                outlink_list = pr_or_urls
            else
                pagerank += pr_or_urls
    
        pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank )
    
        emit( key: [url, pagerank], value: outlink_list )
    

    It is important that in the reduce you should output outlinks and not inlinks, as some articles on the intenret suggests. This way the consecutive iterations will also have outlinks as input of the mapper.

    Pay attention that multiple outlinks with the same address from the same page count as one. Also, don't count loops (page linking to itself).

    The damping factor is traditionally 0.85, although you can play around with other values, too.

    0 讨论(0)
  • 2021-01-31 23:16

    We iteratively evaluate PR. PR(x) = Sum(PR(a)*weight(a), a in in_links) by

    map ((url,PR), out_links) //PR = random at start
    for link in out_links
       emit(link, ((PR/size(out_links)), url))
    
    reduce(url, List[(weight, url)):
       PR =0
       for v in weights
           PR = PR + v
       Set urls = all urls from list
    
       emit((url, PR), urls)
    

    so the output equals input and we can do this until coverage.

    0 讨论(0)
提交回复
热议问题