Spark数据挖掘-基于 LSA 隐层语义分析理解APP描述信息(1)

1 前言

结构化数据处理比较直接，然而非结构化数据（比如：文本、语音）处理就比较具有挑战。对于文本现在比较成熟的技术是搜索引擎，它可以帮助人们从给定的词语中快速找到包含关键词的文本。但是，一些情况下人们希望找到某一个概念的文本，而不关心文本里面是否包含某个关键词。这种情况下应该如何是好？
隐语义分析(Latent Semantic Analysis，简称：LSA)是一种寻找更好的理解语料库中词和文档之间关系的自然语言和信息检索的技术。它试图通过语料库提取一系列概念。每个概念对应一系列单词并且通常对应语料库中讨论的一个主题。先抛开数据而言，每一个概念由三个属性构成：

每个文档与概念之间的相关性
每个单词与概念之间的相关性
概念描述数据集变化程度（方差）的重要性得分

比如：LSA可能会发现某个概念和单词“股票”、“炒股”有很高的相关性并且和“互联网金融系列文章”有很高的相关性。通过选择最重要的概念，LSA可以去掉一些噪音数据。在很多场合都可以使用这种简洁的表示，比如计算词与词、文档与文档、词与文档的相似性。通过LSA得到的关于概念的得分，可以对语料库有更加深入的理解，而不只是简单的计算单词或者共现词。这种相似性度量可以解决同义词查询、文本按照相同主题聚类、给文本添加标签等。 LSA主要用到的技术就是奇异值分解。首先得到词-文档重要性矩阵（一般是TF-IDF矩阵），然后利用svd奇异值分解技术得到原矩阵近似相等的三个矩阵的乘积：SVD，其中 S 可以看出概念与文件的关系，V 表示概念的重要程度，D 表示概念与词的关系。
下面将完整讲述通过爬虫抓取豌豆荚App信息之后，如何利用Spark读取数据，对文本分词、去除噪音词、将数据转换为数字格式、最后计算SVD并且解释如何理解和使用得到的结果。

2 数据集（豌豆荚APP数据）

爬虫不是本文的重点，有兴趣的读者可以查看作者构建的开源爬虫nlp-spider,本文集中抓取的是豌豆荚关于金融理财大类的数据。只提取了三个信息：package_name(包名)，description(app 描述信息)，categories(类别名)，示例如下：

com.zmfz.app  "影视制片过程管理系统，对演员，设备，道具，剧本进行分类管理"   [{level: 1, name: "金融理财"},{level: 2, name: "记账"}]
cn.fa.creditcard  "办信用卡，方便快捷"  [{level: 1, name: "金融理财"},{level: 2, name: "银行"}]

3 数据清洗

public static void clearWandoujiaAppData(
      String categoryFile, //确定哪些类的数据才需要
      String filePath,     //保存抓取数据的文件
      String filedsTerminated //文件的分割符号
) {
  List<String> changeLines;
  File wdj = new File(filePath);
  if (!wdj.exists()) {
      LOGGER.error("file:" + wdj.getAbsolutePath() + " not exists, please check!");
  }
  try {
      List<String> categories = FileUtils.readLines(new File(categoryFile));
      List<String> lines = FileUtils.readLines(wdj, fileEncoding);
      changeLines = new ArrayList<String>(lines.size()*2);
      for (String line : lines) {
          String[] cols = StringUtils.split(line, filedsTerminated);
          //去掉样本中格式错误的
          if (cols.length != 3) {
              LOGGER.warn("line:" + line + ", format error!");
              continue;
          }
          //去掉描述信息为空白、包含乱码、不包含中文、短文本
          if (StringUtils.isBlank(cols[1]) || StringUtils.isEmpty(cols[1])){
              LOGGER.warn("line:" + line + ", content all blank!");
              continue;
          }
          if (StringUtils.contains(cols[1], "?????")){
              LOGGER.warn("line:" + line + ", content contains error code!");
              continue;
          }
          if (!isContainsChinese(cols[1])){
              LOGGER.warn("line:" + line + ", content not contains chinese word!");
              continue;
          }
          if (cols[1].length() <= 10){
              LOGGER.warn("line:" + line + ", content length to short!");
              continue;
          }

          List<String> cates = JsonUtil.jsonParseAppCategories(cols[2], "name");
          if (cates.contains("金融理财")) {
              if (isForClass) {
                  for (String cate : cates) {
                      if (StringUtils.equals(cate, "金融理财"))
                          continue;
                      else {
                          if (categories.contains(cate)) {
                              String[] newLines = new String[]{cols[0], StringUtils.trim(cols[1]), cate};
                              changeLines.add(StringUtil.mkString(newLines, filedsTerminated));
                          }
                      }
                  }
              } else {
                  String[] newLines = new String[]{cols[0], cols[1]};
                  changeLines.add(StringUtil.mkString(newLines, filedsTerminated));
              }
          }
      }
      FileUtils.writeLines(new File(wdj.getParent(), wdj.getName() + ".clear"), changeLines);
  } catch (IOException e) {
      e.printStackTrace();
  }
}

上面会清洗掉不需要的数据，只保留金融理财的数据，注意上面使用的类的来源如下：

FileUtils common-io
StringUtils common-lang
JsonUtil fastjson

4 分词

分词主要使用的是 HanLP(https://github.com/hankcs/HanLP) 这个自然语言处理工具包，下面贴出关键代码：

public static List<String> segContent(String content) {
    List<String> words = new ArrayList<String>(content.length());
    List<Term> terms = HanLP.segment(content);
    for (Term term : terms) {
        String word = term.word;
        //单词必须包含中文而且长度必须大于2
        if (word.length() < 2 || word.matches(".*[a-z|A-Z].*"))
            continue;
        String nature = term.nature.name();
        //词性过滤
        if (nature.startsWith("a") ||
                nature.startsWith("g") ||
                nature.startsWith("n") ||
                nature.startsWith("v")
                ) {
            //停用词去除
            if (!sw.isStopWord(word))
                words.add(word);
        }

    }
    return words;
}

5 Spark加载数据SVD计算

SVD计算之前必须得到一个矩阵，本文使用的是TF-IDF矩阵，TF-IDF矩阵可以理解如下：

TF： token frequent 指的是每个单词在文档中出现的频率 = 单词出现的个数/文档中总单词数
IDF：inverse document frequent 指的是逆文档频率 = 1/文档频率 = 总文档数量/单词在多少不同文档中出现的次数

TF-IDF = TF*LOG(IDF)

下面给出整个计算的详细流程，代码都有注释，请查看：

object SVDComputer {
  val rootDir = "your_data_dir";
  //本地测试
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SVDComputer").setMaster("local[4]")
    val sc = new SparkContext(conf)
    val rawData = sc.textFile(rootDir + "/your_file_name")
    val tables = rawData.map {
      line =>
        val cols = line.split("your_field_seperator")
        val appId = cols(0)
        val context = cols(1)
        (appId, context.split(" "))
    }
    val numDocs = tables.count()
    //得到每个单词在文章中的次数 -> 计算 tf
    val dtf = docTermFreqs(tables.values)
    val docIds = tables.keys.zipWithIndex().map{case (key, value) => (value, key)}.collect().toMap
    dtf.cache()
    //得到单词在所有文档中出现的不同次数->计算 idf
    val termFreq = dtf.flatMap(_.keySet).map((_, 1)).reduceByKey(_ + _)

    //计算 idf
    val idfs = termFreq.map {
      case (term, count) => (term, math.log(numDocs.toDouble/count))
    }.collect().toMap

    //将词编码 spark 不接受字符串的 id
    val termIds = idfs.keys.zipWithIndex.toMap
    val idTerms = termIds.map{case (term, id) => (id -> term)}
    val bIdfs = sc.broadcast(idfs).value
    val bTermIds = sc.broadcast(termIds).value
    //利用词频(dtf)，逆文档频率矩阵（idfs）计算tf-idf

    val vecs = buildIfIdfMatrix(dtf, bIdfs, bTermIds)
    val mat = new RowMatrix(vecs)
    val svd = mat.computeSVD(1000, computeU = true)

    println("Singular values: " + svd.s)
    val topConceptTerms = topTermsInTopConcepts(svd, 10, 10, idTerms)
    val topConceptDocs = topDocsInTopConcepts(svd, 10, 10, docIds)
    for ((terms, docs) <- topConceptTerms.zip(topConceptDocs)) {
      println("Concept terms: " + terms.map(_._1).mkString(", "))
      println("Concept docs: " + docs.map(_._1).mkString(", "))
      println()
    }
    //dtf.take(10).foreach(println)
  }

  /**
    *
    * @param lemmatized
    * @return
    */
  def  docTermFreqs(lemmatized: RDD[Array[String]]):
        RDD[mutable.HashMap[String, Int]] = {
    val dtf = lemmatized.map(terms => {
      val termFreqs = terms.foldLeft(new mutable.HashMap[String, Int]){
        (map, term) => {
          map += term -> (map.getOrElse(term, 0) + 1)
          map
        }
      }
      termFreqs
    })
    dtf
  }

  /**
    * 建立 tf-idf 矩阵
    * @param termFreq
    * @param bIdfs
    * @param bTermIds
    * @return
    */
  def buildIfIdfMatrix(termFreq: RDD[mutable.HashMap[String, Int]],
                       bIdfs: Map[String, Double],
                       bTermIds: Map[String, Int]) = {
    termFreq.map {
      tf =>
        val docTotalTerms = tf.values.sum
        //首先过滤掉没有编码的 term
        val termScores = tf.filter {
          case (term, freq) => bTermIds.contains(term)
        }.map {
          case (term, freq) => (bTermIds(term),
            bIdfs(term) * freq / docTotalTerms)
        }.toSeq
        Vectors.sparse(bTermIds.size, termScores)
    }
  }


  def topTermsInTopConcepts(svd: SingularValueDecomposition[RowMatrix, Matrix], numConcepts: Int,
                            numTerms: Int, termIds: Map[Int, String]): Seq[Seq[(String, Double)]] = {
    val v = svd.V
    val topTerms = new ArrayBuffer[Seq[(String, Double)]]()
    val arr = v.toArray
    for (i <- 0 until numConcepts) {
      val offs = i * v.numRows
      val termWeights = arr.slice(offs, offs + v.numRows).zipWithIndex
      val sorted = termWeights.sortBy(-_._1)
      topTerms += sorted.take(numTerms).map{case (score, id) => (termIds(id), score)}
    }
    topTerms
  }

  def topDocsInTopConcepts(svd: SingularValueDecomposition[RowMatrix, Matrix], numConcepts: Int,
                           numDocs: Int, docIds: Map[Long, String]): Seq[Seq[(String, Double)]] = {
    val u  = svd.U
    val topDocs = new ArrayBuffer[Seq[(String, Double)]]()
    for (i <- 0 until numConcepts) {
      val docWeights = u.rows.map(_.toArray(i)).zipWithUniqueId
      topDocs += docWeights.top(numDocs).map{case (score, id) => (docIds(id), score)}
    }
    topDocs
  }

}

上面代码的运行结果如下所示，只给出了前10个概念最相关的十个单词和十个文档：

Concept terms: 彩票, 记账, 开奖, 理财, 中奖, 大方, 大乐透, 竞彩, 收入, 开发
Concept docs: com.payegis.mobile.energy, audaque.SuiShouJie, com.cyht.dcjr, com.xlltkbyyy.finance, com.zscfappview.jinzheng.wenjiaosuo, com.wukonglicai.app, com.goldheadline.news, com.rytong.bank_cgb.enterprise, com.xfzb.yyd, com.chinamworld.bfa

Concept terms: 茂日, 厕所, 洗浴, 围脖, 邮局, 乐得, 大王, 艺龙, 开开, 茶馆
Concept docs: ylpad.ylpad, com.xh.xinhe, com.jumi, com.zjzx.licaiwang168, com.ss.app, com.yingdong.zhongchaoguoding, com.noahwm.android, com.ylink.MGessTrader_QianShi, com.ssc.P00120, com.monyxApp

Concept terms: 彩票, 开奖, 投注, 中奖, 双色球, 福彩, 号码, 彩民, 排列, 大乐透
Concept docs: ssq.random, com.wukonglicai.app, com.cyht.dcjr, com.tyun.project.app104, com.kakalicai.lingqian, com.wutong, com.icbc.android, com.mzmoney, com.homelinkLicai.activity, com.pingan.lifeinsurance

Concept terms: 开户, 证券, 行情, 股票, 交易, 炒股, 资讯, 基金, 期货, 东兴
Concept docs: com.byp.byp, com.ea.view, com.hmt.jinxiangApp, cn.com.ifsc.yrz, com.cgbsoft.financial, com.eeepay.bpaybox.home.htf, com.gy.amobile.person, wmy.android, me.xiaoqian, cn.eeeeeke.iehejdleieiei

Concept terms: 贷款, 彩票, 开户, 抵押, 证券, 信用, 银行, 申请, 小额, 房贷
Concept docs: com.silupay.silupaymr, com.zscfandroid_guoxinqihuo, com.jin91.preciousmetal, com.manqian.youdan.activity, com.zbar.lib.yijiepay, com.baobei.system, com.caimi.moneymgr, com.thinkive.mobile.account_yzhx, com.qianduan.app, com.bocop.netloan

Concept terms: 支付, 理财, 银行, 信用卡, 刷卡, 金融, 收益, 商户, 硬件, 收款
Concept docs: com.unicom.wopay, com.hexun.futures, com.rapidvalue.android.expensetrackerlite, OTbearStockJY.namespace, gupiao.caopanshou.bigew, com.yucheng.android.yiguan, com.wzlottery, com.zscfappview.shanghaizhongqi, com.wareone.tappmt, com.icbc.echannel

Concept terms: 行情, 理财, 投资, 比特币, 黄金, 汇率, 资讯, 原油, 财经, 贵金属
Concept docs: com.rytong.bankps, com.souyidai.investment.android, com.css.sp2p.invest.activity, com.lotterycc.android.lottery77le, com.sub4.caogurumen, com.feifeishucheng.canuciy, com.hundsun.zjfae, cn.cctvvip, com.mr.yironghui.activity, org.zywx.wbpalmstar.widgetone.uex11328838

Concept terms: 信用卡, 行情, 刷卡, 硬件, 汇率, 比特币, 支付, 交易, 商户, 黄金
Concept docs: com.unicom.wopay, com.hexun.futures, gupiao.caopanshou.bigew, com.rytong.bankps, com.souyidai.investment.android, com.feifeishucheng.canuciy, com.css.sp2p.invest.activity, org.zywx.wbpalmstar.widgetone.uex11328838, com.net.caishi.caishilottery, com.lotterycc.android.lottery77le

Concept terms: 行情, 刷卡, 硬件, 记账, 贷款, 交易, 资讯, 支付, 易贷, 比特币
Concept docs: com.unicom.wopay, com.shengjingbank.mobile.cust, com.rytong.bankps, com.souyidai.investment.android, com.silupay.silupaymr, aolei.sjcp, com.css.sp2p.invest.activity, com.megahub.brightsmart.fso.mtrader.activity, com.manqian.youdan.activity, gupiao.caopanshou.bigew

Concept terms: 刷卡, 硬件, 支付, 汇率, 换算, 贷款, 收款, 商户, 货币, 易贷
Concept docs: com.unicom.wopay, OTbearStockJY.namespace, com.yucheng.android.yiguan, com.wzlottery, com.zscfappview.shanghaizhongqi, com.junanxinnew.anxindainew, com.bitjin.newsapp, com.feifeishucheng.canuciy, org.zywx.wbpalmstar.widgetone.uexYzxShubang, com.qsq.qianshengqian

从上面的结果可以看出，效果还行，这个和语料库太少也有关系。每个概念都比较集中一个主题，比如第一个概念关心的是彩票等。具体应用就不展开了。