How can I get a distinct RDD of dicts in PySpark?

问题

I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. However, when I try to call

rdd.distinct()

PySpark gives me the following error

TypeError: unhashable type: 'dict'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/02/19 16:55:56 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 1776, in combineLocally
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'dict'

I do have a key inside of the dict that I could use as the distinct element, but the documentation doesn't give any clues on how to solve this problem.

EDIT: The content is made up of strings, arrays of strings, and a dictionary of numbers

EDIT 2: Example of a dictionary... I'd like dicts with equal "data_fingerprint" keys to be considered equal:

{"id":"4eece341","data_fingerprint":"1707db7bddf011ad884d132bf80baf3c"}

Thanks

回答1:

As @zero323 pointed out in his comment you have to decide how to compare dictionaries as they are not hashable. One way would be to sort the keys (as they are not in any particular order) for example by lexycographic order. Then create a string of the form:

def dict_to_string(dict):
    ...
    return 'key1|value1|key2|value2...|keyn|valuen'

If you have nested unhashable objects you have to do this recursively.

Now you can just transform your RDD to pair with string as a key (or some kind of hash of it)

pairs = dictRDD.map(lambda d: (dict_to_string(d), d))

To get what you want you just have to reduce by key as fallows

distinctDicts = pairs.reduceByKey(lambda val1, val2: val1).values()

回答2:

Since your data provides an unique key you can simply do something like this:

(rdd
    .keyBy(lambda d: d.get("data_fingerprint"))
    .reduceByKey(lambda x, y: x)
    .values())

There are at least two problems with Python dictionaries which make them bad candidates for hashing:

mutability - which makes any hashing tricky
arbitrary order of keys

A while ago there was a PEP proposing frozerdicts (PEP 0416) but it was finally rejected.

来源：https://stackoverflow.com/questions/35509919/how-can-i-get-a-distinct-rdd-of-dicts-in-pyspark

标签

python

apache-spark

pyspark

rdd