spark map和mapPartitions的区别和使用场景

map和mapPartitions的主要区别：

1） map ：一次处理一个元素的数据

2）mapPartitions：一次处理一批数据

mapPartitions的优缺点：

优点：速度快，一次处理一批数据，即一次接收所有的partition数据，在map过程中需要频繁创建额外的对象(例如将rdd中的数据通过jdbc写入数据库，map需要为每个元素创建一个链接，而mapPartition为每个partition创建一个链接)，则mapPartitions效率比map高的多。

缺点：容易出现内存溢出，当接收的partition的数据量较大时，例如100万数据，一次传入一个function以后，那么可能一下子内存不够，但是又没有办法去腾出内存空间来，可能就导致OOM（内存溢出）；而map一般较少出现内存溢出。

mapPartitions()出现内存溢出时的解决方法：

将数据切成较多的partition：
repartition(100).mapPartitions(xx)
设置较大的处理器内存
--executor-memory 8g

代码demo_1.py：每个partition仅仅初始化一次 Personas对象

def spark_get_order_personal_res():
    spark_conf = SparkConf()
    spark_conf.setAppName("xxx")
    spark_context = SparkContext(conf=spark_conf)
    def trans_feature_to_personal(partition):
    	##每个partition仅仅初始化一次 Personas对象
        ps = Personas() 
        for line in partition:
            cid, fmap = eval(line.strip())
            try:
                ps_result = ps(features_map=fmap)                
                yield ps_result
            except:
                yield None

    rdd_personal_info = spark_context.textFile(input_path).repartition(40).mapPartitions(trans_feature_to_personal).filter(lambda k:k!=None).saveAsTextFile(output_path)
    spark_context.stop()

代码demo_2.py：每个partition仅仅初始化一次连接数据的操作

if __name__ == "__main__":
	def call_sql(mysql_hydra_cur, sql) :
        	count = mysql_hydra_cur.execute(sql)
	        data = mysql_hydra_cur.fetchall()
    	    return count,data
    	    
	def deal_partitions(partition) :
	    # 每个partition仅仅初始化一次连接数据的操作
        mysql_hydra = MySQLdb.connect(
                                host='xxx',
                                user='xxx',
                                passwd='xxx',
                                db='xxx')
        mysql_hydra_cur = mysql_hydra.cursor() 
           	    
        for a in partition :
            a = json.loads(a)
            cid = a['cid']
            sql = "select created_time,phone,identity from api_credit where cid='%s'" % cid
            ans = call_sql(mysql_hydra_cur, sql)
            yield ans
            
	rdd_data = sc.textFile(input_path).repartition(40).mapPartitions(deal_partitions).saveAsTextFile(output_path)

来源：CSDN

作者：追梦杏花天影

链接：https://blog.csdn.net/u010569893/article/details/96480858

标签

内存溢出