StackOverflowError when operating with a large number of columns in Spark

前端 未结 1 1842
长发绾君心
长发绾君心 2021-01-03 00:33

I have a wide dataframe (130000 rows x 8700 columns) and when I try to sum all columns I´m getting the following error:

Exception in thread \"main\" j

1条回答
  •  囚心锁ツ
    2021-01-03 01:26

    You can use a different reduction method that produces a balanced binary tree of depth O(log(n)) instead of a degenerate linearized BinaryExpression chain of depth O(n):

    def balancedReduce[X](list: List[X])(op: (X, X) => X): X = list match {
      case Nil => throw new IllegalArgumentException("Cannot reduce empty list")
      case List(x) => x
      case xs => {
        val n = xs.size
        val (as, bs) = list.splitAt(n / 2)
        op(balancedReduce(as)(op), balancedReduce(bs)(op))
      }
    }
    

    Now in your code, you can replace

    colsList.reduce(_ + _)
    

    by

    balancedReduce(colsList)(_ + _)
    

    A little example to further illustrate what happens with the BinaryExpressions, compilable without any dependencies:

    sealed trait FormalExpr
    case class BinOp(left: FormalExpr, right: FormalExpr) extends FormalExpr {
      override def toString: String = {
        val lStr = left.toString.split("\n").map("  " + _).mkString("\n")
        val rStr = right.toString.split("\n").map("  " + _).mkString("\n")
        return s"BinOp(\n${lStr}\n${rStr}\n)"
      }
    }
    case object Leaf extends FormalExpr
    
    val leafs = List.fill[FormalExpr](16){Leaf}
    
    println(leafs.reduce(BinOp(_, _)))
    println(balancedReduce(leafs)(BinOp(_, _)))
    

    This is what the ordinary reduce does (and this is what essentially happens in your code):

    BinOp(
      BinOp(
        BinOp(
          BinOp(
            BinOp(
              BinOp(
                BinOp(
                  BinOp(
                    BinOp(
                      BinOp(
                        BinOp(
                          BinOp(
                            BinOp(
                              BinOp(
                                BinOp(
                                  Leaf
                                  Leaf
                                )
                                Leaf
                              )
                              Leaf
                            )
                            Leaf
                          )
                          Leaf
                        )
                        Leaf
                      )
                      Leaf
                    )
                    Leaf
                  )
                  Leaf
                )
                Leaf
              )
              Leaf
            )
            Leaf
          )
          Leaf
        )
        Leaf
      )
      Leaf
    )
    

    This is what balancedReduce produces:

    BinOp(
      BinOp(
        BinOp(
          BinOp(
            Leaf
            Leaf
          )
          BinOp(
            Leaf
            Leaf
          )
        )
        BinOp(
          BinOp(
            Leaf
            Leaf
          )
          BinOp(
            Leaf
            Leaf
          )
        )
      )
      BinOp(
        BinOp(
          BinOp(
            Leaf
            Leaf
          )
          BinOp(
            Leaf
            Leaf
          )
        )
        BinOp(
          BinOp(
            Leaf
            Leaf
          )
          BinOp(
            Leaf
            Leaf
          )
        )
      )
    )
    

    The linearized chain is of length O(n), and when Catalyst is trying to evaluate it, it blows the stack. This should not happen with the flat tree of depth O(log(n)).

    And while we are talking about asymptotic runtimes: why are you appending to a mutable colsList? This needs O(n^2) time. Why not simply call toList on the output of .columns?

    0 讨论(0)
提交回复
热议问题