text-processing

How to remove OCR artifacts from text?

ⅰ亾dé卋堺 提交于 2020-01-13 11:29:10
问题 OCR generated texts sometimes come with artifacts, such as this one: Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks. How can one turn the above text into a more, say, canonical form, like: Diese grundsätzliche Verborgenheit

Text Summarization Evaluation - BLEU vs ROUGE

穿精又带淫゛_ 提交于 2020-01-11 16:37:14
问题 With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores of sys1 was higher than sys2 (ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, ROUGE-SU4, ...) but the BLEU score of sys1 was less than the BLEU score of sys2 (quite much). So my question is: Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why

Add a prefix string to beginning of each line

帅比萌擦擦* 提交于 2020-01-08 11:23:21
问题 I have a file as below: line1 line2 line3 And I want to get: prefixline1 prefixline2 prefixline3 I could write a Ruby script, but it is better if I do not need to. prefix will contain / . It is a path, /opt/workdir/ for example. 回答1: # If you want to edit the file in-place sed -i -e 's/^/prefix/' file # If you want to create a new file sed -e 's/^/prefix/' file > file.new If prefix contains / , you can use any other character not in prefix , or escape the / , so the sed command becomes 's#^#

Add a prefix string to beginning of each line

对着背影说爱祢 提交于 2020-01-08 11:22:37
问题 I have a file as below: line1 line2 line3 And I want to get: prefixline1 prefixline2 prefixline3 I could write a Ruby script, but it is better if I do not need to. prefix will contain / . It is a path, /opt/workdir/ for example. 回答1: # If you want to edit the file in-place sed -i -e 's/^/prefix/' file # If you want to create a new file sed -e 's/^/prefix/' file > file.new If prefix contains / , you can use any other character not in prefix , or escape the / , so the sed command becomes 's#^#

Apache Spark - Scala - HashMap (k, HashMap[String, Double](v1, v2,..)) to ((k,v1),(k,v2),…)

自古美人都是妖i 提交于 2020-01-06 16:18:51
问题 I got: val vector: RDD[(String, HashMap[String,Double])] = [("a", {("x",1.0),("y", 2.0),...}] I want to get: RDD[String,(String,Double)] = [("a",("x",1.0)), ("a", ("y", 2.0)), ...] How can it be done with FlatMap? Better solutions are welcome! 回答1: Try: vector.flatMapValues(_.toSeq) 来源: https://stackoverflow.com/questions/38507249/apache-spark-scala-hashmap-k-hashmapstring-doublev1-v2-to-k-v1

Perl: With Text::CSV can I write out a hash ref?

自闭症网瘾萝莉.ら 提交于 2020-01-05 07:12:55
问题 I have a Perl script that reads in a CSV file, changes the columns names of the original, adds new ones (output CSV column names are stored in the array, header_line), adds new field values for each row read, and then writes out a new CSV file. Thanks to a comment by @harleypig on my last question, I'd like to use: $csv_i->column_names( @header_line); $row = $csv_i->getline_hr($fh_i) because this lets me easily access row fields using meaningful names rather than magic numbers. For example:

sed/awk/perl: find a regex, copy 5 columns of this line and paste to it at the beginning of the next lines

↘锁芯ラ 提交于 2020-01-05 06:33:26
问题 I have following lines: 057 - - No adod3 stptazlqn 10 753 tlm 10 027 stp 10 021 12 - - No azad1 bbcz 30 12 03085 - - No azad1 azad1222 xxaz 1 12 azzst 1 12 hss 2 12 what I need to do is: Find lines starting with a number [0-9]. Copy the first 5 columns separated by a space ' '. Paste it in the next lines not starting with a number. 057 - - No adod3 stptazlqn 10 753 057 - - No adod3 tlm 10 027 057 - - No adod3 stp 10 021 12 - - No azad1 12 - - No azad1 bbcz 30 12 03085 - - No azad1 azad1222

Using `awk` to print number of lines in file in the BEGIN section

一曲冷凌霜 提交于 2020-01-04 04:31:49
问题 I am trying to write an awk script and before anything is done tell the user how many lines are in the file. I know how to do this in the END section but unable to do so in the BEGIN section. I have searched SE and Google but have only found a half dozen ways to do this in the END section or as part of a bash script, not how to do it before any processing has taken place at all. I was hoping for something like the following: #!/usr/bin/awk -f BEGIN{ print "There are a total of " **TOTAL LINES

Matlab tex file specify column delimiters

大憨熊 提交于 2020-01-03 03:31:06
问题 I am creating a tex file in Matlab. The end goal is to create a pdf using latex. I have using following website to check the latex I have is correct latex generator. Everything is fine about from when I have a number that contains comma's for example 5,236,012. The issue comes when I copy the data from the tex file. The column delimiter is set to Commas, how can I change this to Semicolon? 回答1: Use strrep - %%// input_filepath and output_filepath are the filepaths of the %%// input and output

count number of distinct words

旧城冷巷雨未停 提交于 2020-01-02 18:36:29
问题 I am trying to count the number of distinct words in the text, using Java. The word can be a unigram, bigram or trigram noun. These three are already found out by using Stanford POS tagger, but I'm not able to calculate the words whose frequency is greater than equal to one, two, three, four and five, and their counts. 回答1: I might not be understanding correctly, but if all you need to do is count the number of distinct words in a given text depending on where/how you are getting the words