How to determine if object is a valid key-value pair in PySpark

后端 未结 1 1309
心在旅途
心在旅途 2020-11-27 08:08
  1. If I have a rdd, how do I understand the data is in key:value format? is there a way to find the same - something like type(object) tells me an object\'s type. I tried
相关标签:
1条回答
  • 2020-11-27 08:34

    Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:

    k, v = kv
    

    Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:

    key_value.py

    class KeyValue(object):
        def __init__(self, k, v):
            self.k = k
            self.v = v
        def __iter__(self):
           for x in [self.k, self.v]:
               yield x
    
    from key_value import KeyValue
    
    rdd = sc.parallelize(
        [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)]) 
    
    rdd.reduceByKey(add).collect()
    ## [('bar', 0), ('foo', 3)]
    

    and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.

    Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

    0 讨论(0)
提交回复
热议问题