pyspark.RDD.reduceByKey#
- RDD.reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash>)[source]#
- Merge the values for each key using an associative and commutative reduce function. - This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. - Output will be partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified. Default partitioner is hash-partition. - New in version 1.6.0. - Parameters
- funcfunction
- the reduce function 
- numPartitionsint, optional
- the number of partitions in new - RDD
- partitionFuncfunction, optional, default portable_hash
- function to compute the partition index 
 
- Returns
 - See also - Examples - >>> from operator import add >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> sorted(rdd.reduceByKey(add).collect()) [('a', 2), ('b', 1)]