Aggregate duplicate records by maintaining the order and also include duplicate records

ぐ巨炮叔叔 提交于 2020-02-04 03:57:53

问题


I am trying to solve an interesting problem, it's easy to just do a groupBy for aggregation like sum, count etc. But this problem is slightly different. Let me explain:

This is my list of tuples:

val repeatSmokers: List[(String, String, String, String, String, String)] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )

The schema for these records are (Idnumber, name, test_code, year, amount). From these elements, I want only repeated records, the way we define unique combination in the above list is by taking (sachin, kita MR.,56308) name and test_code combination. Which means if the same name and test_code is repeating it's a repeat smoker record. For simplicity you can assume only test_code as unique value, if it repeats you can say it's a repeat smoker record.

below is the exact output:

ID76182,27539,1990,255,1 
ID76182,27539,1990,365,2
ID76182,45873,1990,20,1 
ID76182,45873,1990,6770,2 
ID76182,45873,1990,9370,3
ID76182,49337,1990,200,1
ID76182,49337,1990,570,2
ID76182,47542,1990,280,1
ID76182,47542,1990,536,2

Finally, the challenging part here is to maintain the order and aggregate sum on every second repeat smoker record and also add occurences.

For example: this record schema is: ID76182,47542,1990,536,2

IDNumber,test_code,year,amount,occurences

since it occured twice we see 2 above.

Note:

output can be a list are any collection but it should be in the same format I mentioned above


回答1:


So here is some code in Scala but it is really a Java code just written in Scala:

import java.util.ArrayList
import java.util.LinkedHashMap
import scala.collection.convert._


type RawRecord = (String, String, String, String, String, String)
type Record = (String, String, String, String, Int, Int)
type RecordKey = (String, String, String, String)
type Output = (String, String, String, String, Int, Int, Int)
val keyF: Record => RecordKey = r => (r._1, r._2, r._3, r._4)
val repeatSmokersRaw: List[RawRecord] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )
val repeatSmokers = repeatSmokersRaw.map(r => (r._1, r._2, r._3, r._4, r._5.toInt, r._6.toInt))

val acc = new LinkedHashMap[RecordKey, (util.ArrayList[Output], Int, Int)]
repeatSmokers.foreach(r => {
  val key = keyF(r)
  var cur = acc.get(key)
  if (cur == null) {
    cur = (new ArrayList[Output](), 0, 0)
  }
  val nextCnt = cur._2 + 1
  val sum = cur._3 + r._6
  val output = (r._1, r._2, r._3, r._4, r._5, sum, nextCnt)
  cur._1.add(output)
  acc.put(key, (cur._1, nextCnt, sum))
})
val result = acc.values().asScala.filter(p => p._2 > 1).flatMap(p => p._1.asScala)
// or if you are clever you can merge filter and flatMap as
// val result = acc.values().asScala.flatMap(p => if (p._1.size > 1) p._1.asScala else Nil)

println(result.mkString("\n"))

It prints

(ID76182,GANGS,SKILL,27539,1990,255,1)
(ID76182,GANGS,SKILL,27539,1990,365,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,20,1)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,6770,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,9370,3)
(ID76182,DRAGON,WARS,49337,1990,200,1)
(ID76182,DRAGON,WARS,49337,1990,570,2)
(ID76182,HULK,PAIN MR.,47542,1990,280,1)
(ID76182,HULK,PAIN MR.,47542,1990,536,2)

The main trick in this code is to use Java's LinkedHashMap as the accumulator collection because it preserves the order of insertion. Additional trick is to to store some lists inside (as I use Java-collections anyway I decided to use ArrayList for the inner accumulator but you can use anything you like). So the idea is to build a map of key => list of smokers and additionally for each key store current counter and current sum so "aggregated" smokers can be added to the list. When the map is built, go through it to filter out those keys that have not accumulated at least 2 records and then convert a map of lists to a single list (and this is the point where it is important that LinkedHashMap is used because insertion order is preserved during iteration)




回答2:


Here is a functional way of solving this problem:

For this input:

val repeatSmokers: List[(String, String, String, String, String, String)] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )

With an case class representing the record:

case class Record(
    id: String,
    fname: String,
    lname: String,
    code: String,
    year: String,
    amount: String)

We can run the following:

val result = repeatSmokers
  .map(recordTuple => Record.tupled(recordTuple))
  .zipWithIndex
  .groupBy { case (record, order) => (record.fname, record.lname, record.code) }
  .flatMap {

    case (_, List(singleRecord)) => Nil // get rid of non-repeat records

    case (key, records) => {

      val firstKeyIdx = records.head._2

      val amounts = records.map {
        case (record, order) => record.amount.toInt
      }.foldLeft(List[Int]()) {
        case (Nil, addAmount) => List(addAmount)
        case (previousAmounts :+ lastAmount, addAmount) =>
          previousAmounts :+ lastAmount :+ (lastAmount + addAmount)
      }

      records
        .zip(amounts)
        .zipWithIndex
        .map {
          case (((rec, order), amount), idx) =>
            val serializedRecord =
              List(rec.id, rec.code, rec.year, amount, idx + 1)
            (serializedRecord.mkString(","), (firstKeyIdx, idx))
        }
    }
  }
  .toList
  .sortBy { case (serializedRecord, finalOrder) => finalOrder }
  .map { case (serializedRecord, finalOrder) => serializedRecord }

This produces:

ID76182,27539,1990,255,1
ID76182,27539,1990,365,2
ID76182,45873,1990,20,1
ID76182,45873,1990,6770,2
ID76182,45873,1990,9370,3
ID76182,49337,1990,200,1
ID76182,49337,1990,570,2
ID76182,47542,1990,280,1
ID76182,47542,1990,536,2

Some explanation:

A pretty nice way to instantiate a case class from a tuple (creates a List of Records from the list of tuples):

.map(recordTuple => Record.tupled(recordTuple))

Each record is tupled with its global index (Record, index) in order to be able to have a way to latter work with orderings:

.zipWithIndex

We then group using the key you required:

.groupBy { case (record, order) => (record.fname, record.lname, record.code) }

Then for each key/value resulting from the group stage, we will output a list of records (or an empty list if the value is a single record). Thus the flatMap which flattens the lists which will be produced.

Here is the part getting rid of single records:

case (_, List(singleRecord)) => Nil

The other case deals with the creation of cumulative amounts (this is a List of Int) (note for Spark developers: groupBy does preserve the order of value elements within a given key):

val amounts = records.map {
    case (record, order) => record.amount.toInt
  }.foldLeft(List[Int]()) {
    case (Nil, addAmount) => List(addAmount)
    case (previousAmounts :+ lastAmount, addAmount) =>
      previousAmounts :+ lastAmount :+ (lastAmount + addAmount)
  }

These amounts are zipped back to records in order to modify each record amount with the given accumulated amount. And it's also there that records are serialized to the final desired format:

records
    .zip(amounts)
    .zipWithIndex
    .map {
      case (((rec, order), amount), idx) =>
        val serializedRecord =
          List(rec.id, rec.code, rec.year, amount, idx + 1).mkString(
            ",")
        (serializedRecord, (firstKeyIdx, idx))
    }

The previous part also zipped records with their index. In fact each serialized record is provided with a tuple (firstKeyIdx, idx) which is used to then order each record as needed (first per order of apparition of the key (firstKeyIdx) and then for records coming from the same key, the "nested" order is defined by idx):

.sortBy { case (serializedRecord, finalOrder) => finalOrder }



回答3:


Here is a functional/recursive way of solving this problem, based on @SergGr's solution which rightfully introduced the LinkedHashMap.

Given this input:

val repeatSmokers: List[(String, String, String, String, String, Int)] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", 300),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", 100),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", 255),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", 110),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", 20),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", 6750),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", 2090),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", 200),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", 280),
    ("ID76182", "JAMES", "JIM", "30548", "1990", 300),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", 2600),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", 370),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", 2600),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", 2600),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", 256)
  )

By first preparing and aggregating data this way:

case class Record(
  id: String, fname: String, lname: String,
  code: String, year: String, var amount: Int
)

case class Key(fname: String, lname: String, code: String)

val preparedRecords: List[(Key, Record)] = repeatSmokers.map {
  case recordTuple @ (_, fname, lname, code, _, _) =>
    (Key(fname, lname, code), Record.tupled(recordTuple))
}

.

import scala.collection.mutable.LinkedHashMap

def aggregateDuplicatesWithOrder(
    remainingRecords: List[(Key, Record)],
    processedRecords: LinkedHashMap[Key, List[Record]]
): LinkedHashMap[Key, List[Record]] =
  remainingRecords match {

    case (key, record) :: newRemainingRecords => {

      processedRecords.get(key) match {
        case Some(recordList :+ lastRecord) => {
          record.amount = record.amount + lastRecord.amount
          processedRecords.update(key, recordList :+ lastRecord :+ record)
        }
        case None => processedRecords(key) = List(record)
      }

      aggregateDuplicatesWithOrder(newRemainingRecords, processedRecords)
    }

    case Nil => processedRecords
  }

val result = aggregateDuplicatesWithOrder(
  preparedRecords, LinkedHashMap[Key, List[Record]]()
).values.flatMap {
  case _ :: Nil => Nil
  case records =>
    records.zipWithIndex.map { case (rec, idx) =>
      List(rec.id, rec.code, rec.year, rec.amount, idx + 1).mkString(",")
    }
}


来源:https://stackoverflow.com/questions/48782282/aggregate-duplicate-records-by-maintaining-the-order-and-also-include-duplicate

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!