How to assign unique contiguous numbers to elements in a Spark RDD

前端 未结 5 2004
無奈伤痛
無奈伤痛 2020-12-04 14:00

I have a dataset of (user, product, review), and want to feed it into mllib\'s ALS algorithm.

The algorithm needs users and products to be numbers, whil

相关标签:
5条回答
  • 2020-12-04 14:44

    Starting with Spark 1.0 there are two methods you can use to solve this easily:

    • RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
    • RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
    0 讨论(0)
  • 2020-12-04 14:45

    monotonically_increasing_id() appears to be the answer, but unfortunately won't work for ALS since it produces 64-bit numbers and ALS expects 32-bit ones (see my comment below radek1st's answer for deets).

    The solution I found is to use zipWithIndex(), as mentioned in Darabos' answer. Here's how to implement it:

    If you already have a single-column DataFrame with your distinct users called userids, you can create a lookup table (LUT) as follows:

    # PySpark code
    user_als_id_LUT = sqlContext.createDataFrame(userids.rdd.map(lambda x: x[0]).zipWithIndex(), StructType([StructField("userid", StringType(), True),StructField("user_als_id", IntegerType(), True)]))
    

    Now you can:

    • Use this LUT to get ALS-friendly integer IDs to provide to ALS
    • Use this LUT to do a reverse-lookup when you need to go back from ALS ID to the original ID

    Do the same for items, obviously.

    0 讨论(0)
  • 2020-12-04 14:46

    For a similar example use case, I just hashed the string values. See http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

    def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
    var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
    

    It sounds like you're already doing something like this, although hashing can be easier to manage.

    Matei suggested here an approach to emulating zipWithIndex on an RDD, which amounts to assigning IDs within each partiition that are going to be globally unique: https://groups.google.com/forum/#!topic/spark-users/WxXvcn2gl1E

    0 讨论(0)
  • 2020-12-04 14:51

    Another easy option, if using DataFrames and just concerned about the uniqueness is to use function MonotonicallyIncreasingID

    import org.apache.spark.sql.functions.monotonicallyIncreasingId 
    val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
    

    Edit: MonotonicallyIncreasingID was deprecated and removed since Spark 2.0; it is now known as monotonically_increasing_id .

    0 讨论(0)
  • 2020-12-04 15:01

    People have already recommended monotonically_increasing_id(), and mentioned the problem that it creates Longs, not Ints.

    However, in my experience (caveat - Spark 1.6) - if you use it on a single executor (repartition to 1 before), there is no executor prefix used, and the number can be safely cast to Int. Obviously, you need to have less than Integer.MAX_VALUE rows.

    0 讨论(0)
提交回复
热议问题