Serialization Exception on spark

后端 未结 2 1188
盖世英雄少女心
盖世英雄少女心 2021-02-09 07:12

I meet a very strange problem on Spark about serialization. The code is as below:

class PLSA(val sc : SparkContext, val numOfTopics : Int) extends Serializable
{         


        
相关标签:
2条回答
  • 2021-02-09 08:13

    Anonymous functions serialize their containing class. When you map {doc => DocumentParameter(doc, numOfTopics)}, the only way it can give that function access to numOfTopics is to serialize the PLSA class. And that class can't actually be serialized, because (as you can see from the stacktrace) it contains the SparkContext which isn't serializable (Bad Things would happen if individual cluster nodes had access to the context and could e.g. create new jobs from within a mapper).

    In general, try to avoid storing the SparkContext in your classes (edit: or at least, make sure it's very clear what kind of classes contain the SparkContext and what kind don't); it's better to pass it as a (possibly implicit) parameter to individual methods that need it. Alternatively, move the function {doc => DocumentParameter(doc, numOfTopics)} into a different class from PLSA, one that really can be serialized.

    (As multiple people have suggested, it's possible to keep the SparkContext in the class but marked as @transient so that it won't be serialized. I don't recommend this approach; it means the class will "magically" change state when serialized (losing the SparkContext), and so you might end up with NPEs when you try to access the SparkContext from inside a serialized job. It's better to maintain a clear distinction between classes that are only used in the "control" code (and might use the SparkContext) and classes that are serialized to run on the cluster (which must not have the SparkContext)).

    0 讨论(0)
  • 2021-02-09 08:14

    This is indeed a weird one, but I think I can guess the problem. But first, you have not provided the bare minimum to solve the problem (I can guess, because I've seen 100s of these before). Here are some problems with your question:

    def infer(document: RDD[Document], numOfTopics: Int): RDD[DocumentParameter] = {
      val docs = documents.map(doc => DocumentParameter(doc, numOfTopics))
    }
    

    This method doesn't return RDD[DocumentParameter] it returns Unit. You must have copied and pasted code incorrectly.

    Secondly you haven't provided the entire stack trace? Why? There is no reason NOT to provide the full stack trace, and the full stack trace with message is necessary to understand the error - one needs the whole error to understand what the error is. Usually a not serializable exception tells you what is not serializable.

    Thirdly you haven't told us where method infer is, are you doing this in a shell? What is the containing object/class/trait etc of infer?

    Anyway, I'm going guess that by passing in the Int your causing a chain of things to get serialized that you don't expect, I can't give you any more information than that until you provide the bare minimum code so we can fully understand your problem.

    0 讨论(0)
提交回复
热议问题