问题

I'm using Spark for fun and to learn new things about MapReduce. So, I'm trying to write a program suggesting new friendships (i.e., a sort of recommendation system). The suggestion of a friendship between two individuals is performed if they are not connected yet and have a lot of friends in common.

The friendship text file has a structure similar to the following:

1   2,4,11,12,15
2   1,3,4,5,9,10
3   2,5,11,15,20,21
4   1,2,3
5   2,3,4,15,16
...

where the syntax is: ID_SRC1<TAB>ID_DST1,ID_DST2,....

The program should output (print or text file) something like the following:

where the syntax is: ID_SRC1<TAB>ID_SUGG1,ID_SUGG2,.... Of course the program must suggest a friendship if the two individuals shares a minimum number of friends, let's say 3 in our case.

I've written my program, but I'd like to read better and more powerful solutions by you. Indeed, I think my code can improved a lot since it takes much time to output from an input file of 4.2 MB.

Below my code:

from pyspark import SparkContext, SparkConf

def linesToDataset(line):
    (src, dst_line) = line.split('\t')
    src = int(src.strip())

    dst_list_string = dst_line.split(',')
    dst_list = [int(x.strip()) for x in dst_list_string if x != '']

    return (src, dst_list)  

def filterPairs(x):
     # don't take into account pairs of a same node and pairs of already friends
    if (x[0][0] != x[1][0]) and (not x[0][0] in x[1][1]) and (not x[1][0] in x[0][1]):
        shared = len(list(set(x[0][1]).intersection(set(x[1][1]))))
        return (x[0][0], [x[1][0], shared])

def mapFinalDataset(elem):
    recommendations = []
    src = elem[0]
    dst_commons = elem[1]
    for pair in dst_commons:
        if pair[1] > 3: # 3 is the minimum number of friends in common
            recommendations.append(pair[0])
    return (src, recommendations)

def main():
    conf = SparkConf().setAppName("Recommendation System").setMaster("local[4]")
    sc = SparkContext(conf=conf)
    rdd = sc.textFile("data.txt")

    dataset = rdd.map(linesToDataset)

    cartesian = dataset.cartesian(dataset)
    filteredDatasetRaw = cartesian.map(filterPairs)
    filteredDataset = filteredDatasetRaw.filter(lambda x: x != None)
#   print filteredDataset.take(10)

    groupedDataset = filteredDataset.groupByKey().mapValues(list)
#   print groupedDataset.take(10)

    finalDataset = groupedDataset.map(mapFinalDataset)
    output = finalDataset.take(100)
    for (k, v) in output:
        if len(v) > 0:
            print str(k) + ': ' + str(v)

    sc.stop()


if __name__ == "__main__":
    main()

回答1:

Better is a point of view of course.

I would argue the strategy I am about to propose is better in terms of performance and readability, but this has to be subjective. The main reason is that I avoid the cartesian product, to replace it with a JOIN.

Alternative strategy

Description

The strategy I propose is based on the fact that the basic data line

1   2,4,11,12,15

Can be thought of as a list of "friendship suggestions", meaning this line tells me : "2 should be friends with 4, 11, 12, 15", "4 should be friends with 2, 11, 12, 15", and so on.

Therefore, the gist of my implementation is

Turn each line into a list of suggestions (foo should be friends with bar)
group suggestions by person (foo should be friends with bar, baz, bar) with duplicates
count the number of duplicates (foo should be friends with bar(2 suggestions), baz (1 suggestion)
remove existing relationships
filter suggestions that occur too rarely
print result

Implementation

As I'm more of a Java/scala guy, pardon the scala language, but it should map fairly easily to Python.

First, create basic friendship data from your text file

def parseLine(line: String): (Int, Array[String]) = {
  (Integer.parseInt(line.substring(0, line.indexOf("\t"))), line.substring(line.indexOf("\t")+1).split(","))
}
def toIntegerArray(strings: Array[String]): Array[Int] = { 
  strings.filter({ x => !x.isEmpty() }).map({ x => Integer.parseInt(x) }) 
}
// The friendships that exist
val alreadyFriendsRDD = sc.textFile("src/data.txt", 4)
        // Parse file : (id of the person, [Int] of friends)
        .map { parseLine }
        .mapValues( toIntegerArray );

And convert them to suggestions

// If person 1 is friends with 2 and 4, this means we should suggest 2 to become friends with 4 , and vice versa
def toSymetricalPairs(suggestions: Array[Int]): TraversableOnce[(Int, Int)] = {
  suggestions.combinations(2)
             .map { suggestion => (suggestion(0), suggestion(1)) }
             .flatMap { suggestion => Iterator(suggestion, (suggestion._2, suggestion._1)) }
}
val suggestionsRDD = alreadyFriendsRDD
  .map( x => x._2 )
  // Then we create suggestions from the friends Array
  .flatMap( toSymetricalPairs )

Once you have a RDD of suggestions, regroup them :

def mergeSuggestion(suggestions: mutable.HashMap[Int, Int], newSuggestion: Int): mutable.HashMap[Int, Int] = {
  suggestions.get(newSuggestion) match {
    case None => suggestions.put(newSuggestion, 1)
    case Some(x) => suggestions.put(newSuggestion, x + 1)
  }
  suggestions
}
def mergeSuggestions(suggestions: mutable.HashMap[Int, Int], toMerge: mutable.HashMap[Int, Int]) = {
  val keySet = suggestions.keySet ++: toMerge.keySet
  keySet.foreach { key =>
    suggestions.get(key) match {
      case None => suggestions.put(key, toMerge.getOrElse(key, 0))
      case Some(x) => suggestions.put(key, x + toMerge.getOrElse(key, 0))
    }
  }
  suggestions
}

def filterRareSuggestions(suggestions: mutable.HashMap[Int, Int]): scala.collection.Set[Int] = {
  suggestions.filter(p => p._2 >= 3).keySet
}

// We reduce as a RDD of suggestion count by person
val suggestionsByPersonRDD = suggestionsRDD.combineByKey(
    // For each person, create a map of suggestion count
    (person: Int) => new mutable.HashMap[Int, Int](),           
    // For every suggestion, merge it into the map
    mergeSuggestion , 
    // When merging two maps, sum the suggestions
    mergeSuggestions
    )
    // We restrict to suggestions that occur more than 3 times
    .mapValues( filterRareSuggestions )

Finally filter the suggestions by taking into account already existing friendships

val suggestionsCleanedRDD = suggestionsByPersonRDD
  // We co-locate the suggestions with the known friends
  .join(alreadyFriendsRDD)
  // We clean the suggestions by removing the known friends
  .mapValues (_ match { case (suggestions, alreadyKnownFriendsByPerson) => {
    suggestions -- alreadyKnownFriendsByPerson
  }})

Which outputs, for example :

(49831,Set(49853, 49811, 49837, 49774))
(49835,Set(22091, 20569, 29540, 36583, 31122, 3004, 10390, 4113, 1137, 15064, 28563, 20596, 36623))
(49839,Set())
(49843,Set(49844))

Meaning 49831 should be friends with 49853, 49811, 49837, 49774.