text-processing | 易学教程

How to remove OCR artifacts from text?

阅读更多关于 How to remove OCR artifacts from text?

问题 OCR generated texts sometimes come with artifacts, such as this one: Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks. How can one turn the above text into a more, say, canonical form, like: Diese grundsätzliche Verborgenheit

Text Summarization Evaluation - BLEU vs ROUGE

阅读更多关于 Text Summarization Evaluation - BLEU vs ROUGE

问题 With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores of sys1 was higher than sys2 (ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, ROUGE-SU4, ...) but the BLEU score of sys1 was less than the BLEU score of sys2 (quite much). So my question is: Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why

Add a prefix string to beginning of each line

阅读更多关于 Add a prefix string to beginning of each line

问题 I have a file as below: line1 line2 line3 And I want to get: prefixline1 prefixline2 prefixline3 I could write a Ruby script, but it is better if I do not need to. prefix will contain / . It is a path, /opt/workdir/ for example. 回答1: # If you want to edit the file in-place sed -i -e 's/^/prefix/' file # If you want to create a new file sed -e 's/^/prefix/' file > file.new If prefix contains / , you can use any other character not in prefix , or escape the / , so the sed command becomes 's#^#

Add a prefix string to beginning of each line

阅读更多关于 Add a prefix string to beginning of each line

Apache Spark - Scala - HashMap (k, HashMap[String, Double](v1, v2,..)) to ((k,v1),(k,v2),…)

阅读更多关于 Apache Spark - Scala - HashMap (k, HashMap[String, Double](v1, v2,..)) to ((k,v1),(k,v2),…)

问题 I got: val vector: RDD[(String, HashMap[String,Double])] = [("a", {("x",1.0),("y", 2.0),...}] I want to get: RDD[String,(String,Double)] = [("a",("x",1.0)), ("a", ("y", 2.0)), ...] How can it be done with FlatMap? Better solutions are welcome! 回答1: Try: vector.flatMapValues(_.toSeq) 来源： https://stackoverflow.com/questions/38507249/apache-spark-scala-hashmap-k-hashmapstring-doublev1-v2-to-k-v1

Perl: With Text::CSV can I write out a hash ref?

阅读更多关于 Perl: With Text::CSV can I write out a hash ref?

问题 I have a Perl script that reads in a CSV file, changes the columns names of the original, adds new ones (output CSV column names are stored in the array, header_line), adds new field values for each row read, and then writes out a new CSV file. Thanks to a comment by @harleypig on my last question, I'd like to use: $csv_i->column_names( @header_line); $row = $csv_i->getline_hr($fh_i) because this lets me easily access row fields using meaningful names rather than magic numbers. For example:

sed/awk/perl: find a regex, copy 5 columns of this line and paste to it at the beginning of the next lines

阅读更多关于 sed/awk/perl: find a regex, copy 5 columns of this line and paste to it at the beginning of the next lines

问题 I have following lines: 057 - - No adod3 stptazlqn 10 753 tlm 10 027 stp 10 021 12 - - No azad1 bbcz 30 12 03085 - - No azad1 azad1222 xxaz 1 12 azzst 1 12 hss 2 12 what I need to do is: Find lines starting with a number [0-9]. Copy the first 5 columns separated by a space ' '. Paste it in the next lines not starting with a number. 057 - - No adod3 stptazlqn 10 753 057 - - No adod3 tlm 10 027 057 - - No adod3 stp 10 021 12 - - No azad1 12 - - No azad1 bbcz 30 12 03085 - - No azad1 azad1222

Using `awk` to print number of lines in file in the BEGIN section

阅读更多关于 Using `awk` to print number of lines in file in the BEGIN section

问题 I am trying to write an awk script and before anything is done tell the user how many lines are in the file. I know how to do this in the END section but unable to do so in the BEGIN section. I have searched SE and Google but have only found a half dozen ways to do this in the END section or as part of a bash script, not how to do it before any processing has taken place at all. I was hoping for something like the following: #!/usr/bin/awk -f BEGIN{ print "There are a total of " **TOTAL LINES

Matlab tex file specify column delimiters

阅读更多关于 Matlab tex file specify column delimiters

问题 I am creating a tex file in Matlab. The end goal is to create a pdf using latex. I have using following website to check the latex I have is correct latex generator. Everything is fine about from when I have a number that contains comma's for example 5,236,012. The issue comes when I copy the data from the tex file. The column delimiter is set to Commas, how can I change this to Semicolon? 回答1: Use strrep - %%// input_filepath and output_filepath are the filepaths of the %%// input and output

count number of distinct words

阅读更多关于 count number of distinct words

问题 I am trying to count the number of distinct words in the text, using Java. The word can be a unigram, bigram or trigram noun. These three are already found out by using Stanford POS tagger, but I'm not able to calculate the words whose frequency is greater than equal to one, two, three, four and five, and their counts. 回答1: I might not be understanding correctly, but if all you need to do is count the number of distinct words in a given text depending on where/how you are getting the words