text-processing

count number of distinct words

我们两清 提交于 2020-01-02 18:35:21
问题 I am trying to count the number of distinct words in the text, using Java. The word can be a unigram, bigram or trigram noun. These three are already found out by using Stanford POS tagger, but I'm not able to calculate the words whose frequency is greater than equal to one, two, three, four and five, and their counts. 回答1: I might not be understanding correctly, but if all you need to do is count the number of distinct words in a given text depending on where/how you are getting the words

Bash: any command to replace strings in text files?

冷暖自知 提交于 2020-01-02 07:10:25
问题 I have a hierarchy of directories containing many text files. I would like to search for a particular text string every time it comes up in one of the files, and replace it with another string. For example, I may want to replace every occurrence of the string "Coke" with "Pepsi". Does anyone know how to do this? I am wondering if there is some sort of Bash command that can do this without having to load all these files in an editor, or come up with a more complex script to do it. I found this

Split text on paragraphs where paragraph delimiters are non-standard

梦想的初衷 提交于 2020-01-01 05:34:11
问题 If I have text with standard paragraph formatting (a blank line followed by an indent) such as text 1 it's easy enough to extract the paragraphs using text.split("\n\n"). Text 1: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus sit amet sapien velit, ac sodales ante. Integer mattis eros non turpis interdum et auctor enim consectetur, etc. Praesent molestie suscipit bibendum. Donec justo purus, venenatis eget convallis sed, feugiat vitae velit,etc. But what if I have text with

Using SQL to determine word count stats of a text field

妖精的绣舞 提交于 2019-12-27 11:05:04
问题 I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice outside the DB) is: SELECT AVG(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1) FROM documents This seems to work* but do you have other suggestions? I'm currently using MySQL 4 (hope to move to version 5 for this app soon), but am also

if IP address exist in URL return something else return something

╄→尐↘猪︶ㄣ 提交于 2019-12-25 12:08:36
问题 How do I check whether the IP address exist in URL by using Matlab? Is there any function that can be used to check the IP address? data =['http://95.154.196.187/broser/6716804bc5a91f707a34479012dad47c/', 'http://95.154.196.187/broser/', 'http://paypal.com.cgi-bin-websc5.b4d80a13c0a2116480.ee0r-cmd-login-submit-dispatch-'] def IP_exist(data): for b in data: containsdigit = any(a.isdigit() for a in b) if containsdigit: print("1") else: print("0") 回答1: With regexp , you can either use 'tokens'

How to compare number of lines of two files using Awk

≯℡__Kan透↙ 提交于 2019-12-25 04:44:13
问题 I am new to awk and need to compare the number of lines of two files. The script shall return true, if lines(f1) == (lines(f2)+1) otherwise false. How can I do that? Best regards 回答1: If it has to be awk : awk 'NR==FNR{x++} END{ if(x!=FNR){exit 1} }' file1 file2 The varibale x is incremented and contains the number of line of file1 and FNR contains the number of file2 . At the end, both are compared and the script is exited 0 or 1. See an example: user@host:~$ awk 'NR==FNR{x++} END{ if(x!=FNR

How can I get words after and before a specific token?

空扰寡人 提交于 2019-12-25 03:53:38
问题 I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things: import os, re texts = [] for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts. with open(i, 'r') as f: texts.append(f.read()) Now I want to find the word before and after a token. myToken = 'blue' found = [] for i in texts: fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA

Multiple regex replacements based on lists in multiple files

烈酒焚心 提交于 2019-12-25 02:46:48
问题 I have a folder with multiple text files inside that I need to process and format using multiple replacement lists looking like this: old string1~new string1 old string2~new string2 etc~blah I run each replacement pair from replacement lists on each line of those multiple text files. Now I have a set of python scripts to perform this operation. What I wonder about is will it make the code simpler and better maintainable if I switch to sed or awk? Will it be a better solution or should I

Insert specific lines from file before first occurrence of pattern using Sed

前提是你 提交于 2019-12-24 11:41:38
问题 I want to insert a range of lines from a file, say something like 210,221r before the first occurrence of a pattern in a bunch of other files. As I am clearly not a GNU sed expert, I cannot figure how to do this. I tried sed '0,/pattern/{210,221r file }' bunch_of_files But apparently file is read from line 210 to EOF. 回答1: Try this: sed -r 's/(FIND_ME)/PUT_BEFORE\1/' test.text -r enables extendend regular expressions the string you are looking for ("FIND_ME") is inside parentheses, which

Output of ZipArchive() in tree format

我只是一个虾纸丫 提交于 2019-12-24 08:49:49
问题 Using PHP, I have this list of files, that I get by: new ZipArchive(); I mean that it is in a zip file. The file list is: docs/ docs/INSTALL.html docs/auth_api.html docs/corners_right.gif docs/corners_right.png docs/COPYING docs/corners_left.png docs/bg_header.gif docs/CHANGELOG.html docs/coding-guidelines.html docs/hook_system.html docs/FAQ.html docs/site_logo.gif docs/AUTHORS docs/README.html docs/corners_left.gif docs/stylesheet.css docs/New Folder/ docs/New Folder/New Text Document.txt