TF-IDF词频逆文档频率算法

一个人想着一个人 提交于 2020-03-01 04:36:25

一.简介

  1.RF-IDF【term frequency-inverse document frequency】是一种用于检索与探究的常用加权技术。

  2.TF-IDF是一种统计方法,用于评估一个词对于一个文件集或一个语料库中的其中一个文件的重要程度。

  3.词的重要性随着它在文件中出现的次数的增加而增加,但同时也会随着它在语料库中出现的频率的升高而降低。

二.词频

  指的是某一个给定的词语在一份给定的文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件【同一个词语在文件里可能会比短文件有更高的词频,而不管该词重要与否】。

  公式:

    

  ni,j:是该词在文件dj中出现的次数,而分母则是在文件dj中所有词出现的次数之和。

三.逆文档频率

  是一个词普遍重要性的度量。某一个特定词的IDF可以由总文件数目除以包含该词语的文件数据,再将得到的商取对数得到。

  公式:

    

  |D|:语料库中的文件总数

  |{j:ti€dj}|:包含ti的文件总数

四.TF-IDF

  公式:TF-IDF = TF * IDF

  特点:某一特定文件内的高频率词语,以及该词语在整个语料库中的低文件频率,可以产生高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。

  思想:如果某个词或短语在一篇文章中出现的频率TF高,并且在其它文章中很少出现,则认为此词或短语具有很好的类别区分能力,适合用来分类。

五.代码实现

 1 package big.data.analyse.tfidf
 2 
 3 import org.apache.log4j.{Level, Logger}
 4 import org.apache.spark.sql.SparkSession
 5 
 6 /**
 7   * Created by zhen on 2019/05/28.
 8   */
 9 object TF_IDF {
10   /**
11     * 设置日志级别
12     */
13   Logger.getLogger("org").setLevel(Level.WARN)
14   def main(args: Array[String]) {
15     val spark = SparkSession
16       .builder()
17       .appName("TF_IDF")
18       .master("local[2]")
19       .config("spark.sql.warehouse.dir", "file:///D://warehouse").getOrCreate()
20     val sc = spark.sparkContext
21     /**
22       * 计算TF
23       */
24     val tf = sc.textFile("src/big/data/analyse/tfidf/TF.txt")
25       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " ")) // 数据清洗
26       .flatMap(row => row.split(" ")) // 拆分
27       .map(row => (row, 1.0))
28       .reduceByKey(_+_)
29 
30     val tfSize = tf.map(row => row._2).sum() // 计算总词数
31 
32     val tfed = tf.map(row => (row._1, row._2 / tfSize.toDouble)) //求词频
33     println("TF:")
34     tfed.foreach(println)
35 
36     /**
37       * 计算IDF
38       */
39     val idf_0 = tf.map(row => (row._1, 1.0))
40     println("加载IDF1文件数据。。。")
41     val idf_1 = sc.textFile("src/big/data/analyse/tfidf/IDF1.txt")
42       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " "))
43       .flatMap(row => row.split(" "))
44       .map(row => (row, 1.0))
45       .reduceByKey(_+_)
46       .map(row => (row._1, 1.0))
47 
48     println("加载IDF2文件数据。。。")
49     val idf_2 = sc.textFile("src/big/data/analyse/tfidf/IDF2.txt")
50       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " "))
51       .flatMap(row => row.split(" "))
52       .map(row => (row, 1.0))
53       .reduceByKey(_+_)
54       .map(row => (row._1, 1.0))
55 
56     /**
57       * 整合语料库数据
58       */
59     val idf = idf_0.union(idf_1).union(idf_2)
60       .reduceByKey(_+_)
61       .map(row => (row._1, 3 / row._2))
62     println("IDF:")
63     idf.foreach(println)
64 
65     /**
66       * 关联TF和IDF,计算TF-IDF
67       */
68     println("TF-IDF:")
69     tfed.join(idf).map(row => (row._1, (row._2._1 * row._2._2).formatted("%.4f")))
70       .foreach(println)
71   }
72 }

六.结果

TF:
(GraphX,0.011494252873563218)
(are,0.011494252873563218)
(learning,0.011494252873563218)
(Python,0.011494252873563218)
(provides,0.011494252873563218)
(is,0.022988505747126436)
(Please,0.011494252873563218)
(higher-level,0.011494252873563218)
(general,0.011494252873563218)
(Security,0.034482758620689655)
(R,0.011494252873563218)
(fast,0.011494252873563218)
(SQL,0.022988505747126436)
(Apache,0.011494252873563218)
(Java,0.011494252873563218)
(data,0.011494252873563218)
(attack,0.011494252873563218)
(This,0.011494252873563218)
(cluster,0.011494252873563218)
(graph,0.011494252873563218)
(execution,0.011494252873563218)
(MLlib,0.011494252873563218)
(Scala,0.011494252873563218)
(computing,0.011494252873563218)
(downloading,0.011494252873563218)
(Streaming,0.011494252873563218)
(supports,0.022988505747126436)
(engine,0.011494252873563218)
(set,0.011494252873563218)
(running,0.011494252873563218)
(Spark,0.08045977011494253)
(you,0.011494252873563218)
(Overview,0.011494252873563218)
(general-purpose,0.011494252873563218)
(rich,0.011494252873563218)
(APIs,0.011494252873563218)
(vulnerable,0.011494252873563218)
(that,0.011494252873563218)
(a,0.022988505747126436)
(high-level,0.011494252873563218)
(processing,0.022988505747126436)
(OFF,0.011494252873563218)
(before,0.011494252873563218)
(including,0.011494252873563218)
(could,0.011494252873563218)
(optimized,0.011494252873563218)
(in,0.022988505747126436)
(to,0.011494252873563218)
(see,0.011494252873563218)
(graphs,0.011494252873563218)
(of,0.011494252873563218)
(also,0.011494252873563218)
(by,0.022988505747126436)
(structured,0.011494252873563218)
(tools,0.011494252873563218)
(It,0.022988505747126436)
(for,0.034482758620689655)
(mean,0.011494252873563218)
(an,0.011494252873563218)
(machine,0.011494252873563218)
(and,0.06896551724137931)
(system,0.011494252873563218)
(default,0.022988505747126436)
加载IDF1文件数据。。。
加载IDF2文件数据。。。
IDF:
(running,1.5)
(For,3.0)
(visit,3.0)
(The,3.0)
(you,1.0)
(website,1.5)
(than,3.0)
(7,3.0)
(PATH,3.0)
(that,1.0)
(was,1.5)
(a,1.0)
(main,3.0)
(old,3.0)
(high-level,1.5)
(be,1.5)
(quick,3.0)
(processing,1.5)
(could,1.5)
(all,3.0)
(augmenting,3.0)
(optimized,1.5)
(Downloads,3.0)
(follow,3.0)
(applications,3.0)
(classpath,3.0)
(structured,1.5)
(like,1.5)
(along,3.0)
(support,3.0)
(Spark’s,1.5)
(If,3.0)
(but,3.0)
(and,1.0)
(reference,3.0)
(1,3.0)
(g,3.0)
(system,1.5)
(your,3.0)
(10,3.0)
(It’s,3.0)
(are,1.0)
(learning,1.5)
(download,1.5)
(its,3.0)
(After,3.0)
(Building,3.0)
(can,1.5)
(Security,1.5)
(have,3.0)
(runs,3.0)
(6,3.0)
(build,3.0)
(0,1.5)
(SQL,1.0)
(with,1.5)
(locally,3.0)
(projects,3.0)
(their,3.0)
(Get,3.0)
(UNIX-like,3.0)
(This,1.0)
(,1.5)
(first,3.0)
(documentation,3.0)
(Since,3.0)
(still,3.0)
(Downloading,3.0)
(packaged,3.0)
(better,3.0)
(However,3.0)
(switch,3.0)
(hood,3.0)
(Linux,3.0)
(Streaming,1.5)
(supports,1.5)
(PyPI,3.0)
((2,3.0)
(vulnerable,1.5)
(RDD,3.0)
(Dataset,3.0)
(package,3.0)
(this,3.0)
(under,3.0)
(Python,1.0)
(provides,1.0)
(API,1.5)
(higher-level,1.5)
(introduction,3.0)
(Apache,1.5)
(will,1.5)
(Java,1.0)
(2,1.5)
(data,1.5)
(as,3.0)
(YARN,3.0)
(installed,3.0)
(pointing,3.0)
(optimizations,3.0)
(get,3.0)
(cluster,1.5)
(tutorial,3.0)
(graph,1.5)
(easy,3.0)
(execution,1.5)
(MLlib,1.5)
(We,3.0)
(you’d,3.0)
(supported,3.0)
(downloading,1.5)
(shell,3.0)
(handful,3.0)
(1+,3.0)
(Users,3.0)
(engine,1.5)
(version,1.5)
(11,3.0)
(set,1.5)
(performance,3.0)
(rich,1.5)
(systems,3.0)
(replaced,3.0)
(Spark,1.0)
(project,3.0)
(Overview,1.5)
(APIs,1.5)
(Mac,3.0)
(or,1.5)
(popular,3.0)
(Support,3.0)
(richer,3.0)
(downloads,3.0)
(OFF,1.5)
(future,3.0)
(detailed,3.0)
(GraphX,1.5)
(removed,3.0)
(4,3.0)
(installation,3.0)
(Please,1.5)
(is,1.0)
(guide,3.0)
(recommend,3.0)
(R,1.5)
(general,1.5)
(JAVA_HOME,3.0)
(fast,1.5)
(include,3.0)
(need,3.0)
(one,3.0)
(attack,1.5)
(how,3.0)
(uses,3.0)
(compatible,3.0)
(information,3.0)
(we,3.0)
(interactive,3.0)
(—,3.0)
(using,1.5)
(Note,1.5)
(7+/3,3.0)
(java,3.0)
(pre-packaged,3.0)
(Scala,1.0)
(any,1.5)
(computing,1.5)
(variable,3.0)
(users,3.0)
(from,1.5)
(has,3.0)
(won’t,3.0)
(through,3.0)
(at,3.0)
(more,3.0)
(3,3.0)
(versions,3.0)
(of,1.0)
(tools,1.5)
(8+,3.0)
(by,1.0)
(mean,1.5)
(RDDs,3.0)
((e,3.0)
(It,1.5)
(for,1.0)
(To,3.0)
(were,3.0)
(both,3.0)
(an,1.0)
(12,3.0)
(which,3.0)
(machine,1.5)
(libraries,3.0)
(introduce,3.0)
(environment,3.0)
((in,3.0)
(programming,3.0)
(See,3.0)
(use,1.5)
(default,1.5)
(the,1.5)
(write,3.0)
(highly,3.0)
(release,3.0)
(Resilient,3.0)
(interface,3.0)
(strongly-typed,3.0)
(about,3.0)
(run,3.0)
(general-purpose,1.5)
(5,3.0)
(Distributed,3.0)
(on,3.0)
(You,3.0)
(source,3.0)
(Scala),3.0)
(show,3.0)
(then,3.0)
(before,1.0)
(including,1.5)
(to,1.0)
(in,1.0)
(client,3.0)
(see,1.5)
(HDFS,1.5)
(graphs,1.5)
(Hadoop’s,3.0)
(also,1.5)
(“Hadoop,3.0)
(binary,3.0)
(x),3.0)
(free”,3.0)
(Maven,3.0)
(coordinates,3.0)
(Windows,3.0)
(deprecated,3.0)
(install,3.0)
((RDD),3.0)
(4+,3.0)
(page,3.0)
(OS),3.0)
(Hadoop,1.5)
TF-IDF:
(you,0.0115)
(that,0.0115)
(a,0.0230)
(high-level,0.0172)
(processing,0.0345)
(could,0.0172)
(optimized,0.0172)
(structured,0.0172)
(and,0.0690)
(system,0.0172)
(are,0.0115)
(learning,0.0172)
(Security,0.0517)
(SQL,0.0230)
(This,0.0115)
(Streaming,0.0172)
(supports,0.0345)
(vulnerable,0.0172)
(Spark,0.0805)
(Overview,0.0172)
(APIs,0.0172)
(OFF,0.0172)
(of,0.0115)
(tools,0.0172)
(by,0.0230)
(mean,0.0172)
(It,0.0345)
(for,0.0345)
(an,0.0115)
(machine,0.0172)
(default,0.0345)
(Python,0.0115)
(provides,0.0115)
(higher-level,0.0172)
(Apache,0.0172)
(GraphX,0.0172)
(Please,0.0172)
(is,0.0230)
(R,0.0172)
(general,0.0172)
(fast,0.0172)
(attack,0.0172)
(Java,0.0115)
(Scala,0.0115)
(computing,0.0172)
(data,0.0172)
(cluster,0.0172)
(graph,0.0172)
(execution,0.0172)
(MLlib,0.0172)
(downloading,0.0172)
(engine,0.0172)
(set,0.0172)
(rich,0.0172)
(general-purpose,0.0172)
(before,0.0115)
(including,0.0172)
(to,0.0115)
(in,0.0230)
(see,0.0172)
(graphs,0.0172)
(also,0.0172)

Process finished with exit code 0
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!