text-parsing

Parsing / Extracting Text from String in Rails?

白昼怎懂夜的黑 提交于 2019-12-04 19:53:45
I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy". Is this a matter of using Regex and lifting the text between "#books" to "."? What if there's no structure to the message, like: "This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or "This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books" How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex

Reading parameters from a text file into the workspace

故事扮演 提交于 2019-12-04 18:19:20
I have a file which has the following information: % ---------------------- location details -------------------------- % % lat : latitude [minimum = -90, maximum = 90, unit = % degrees north] % lon : longitude [ minimum = -360, maximum = 360, unit = % deg east] % z: altitude (above sea level, m) %--------------------------------------------------------------- % location: lat = 54.35 lon = -2.9833 This is a small section of the file. I would like to read some of this information into MATLAB, where the information can then be used to perform some calculations. The part of the file that I would

Compute ngrams for each row of text data in R

六眼飞鱼酱① 提交于 2019-12-04 17:47:59
I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) # Using Tyler's method of making the 'Text'

Resume/CV Parsing in PHP [closed]

蹲街弑〆低调 提交于 2019-12-04 15:24:34
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 8 years ago . We are developing a requirement base social media site using LAMP. For that we want to do Resume/CV Parsing in PHP. We were able to parse Email-id and Phone, but not sure how to parse the other information like

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

安稳与你 提交于 2019-12-04 13:03:02
问题 I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure. In terms of getting counts for occurrence: vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = CountVectorizer(vocabulary=vocabulary) X = cv.fit_transform(corpus) print X.toarray() gives: [[0 0 0 0]] What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams: vocabulary = ['hi', 'bye'

Java: How read a File line by line by ignoring “\n”

99封情书 提交于 2019-12-04 09:40:07
问题 I'm trying to read a tab separated text file line per line. The lines are separated by using carriage return ("\r\n") and LineFeed (\"n") is allowed within in tab separated text fields. Since I want to read the File Line per Line, I want my programm to ignore a standalone "\n". Unfortunately, BufferedReader uses both possibilities to separate the lines. How can I modify my code, in order to ignore the standalone "\n"? try { BufferedReader in = new BufferedReader(new FileReader(flatFile));

Removing brackets and quotes from print in Python 2.7

天大地大妈咪最大 提交于 2019-12-04 06:50:44
问题 I am trying to remove the brackets from the print statement using Python 2.7 I tried suggestions from various forums but it didn't work as expected finally thought of asking out here myself. Code: with open('buttonpress_and_commandstarted','r') as buttonpress_commandstarted: for line in buttonpress_commandstarted: button_press_command_time = '' if os.path.getsize('buttonpress_and_commandstarted') > 0: button_press_command_time = line.split()[2] else: print " > Cannot get time stamp as the

ColdFusion extract values from text file

故事扮演 提交于 2019-12-04 06:28:01
问题 The technical details I want to EXTRACT values from a text file containing parameter names and values. For each line that starts with "request.config." (there are empty lines, lines with comments, etc. which I don't want to extract anything from) I want to extract these values (in bold) : request.config. my_param_1 = "some random string" ; I thought the best way to do this might be using REGEX, but how can I do this? I thought there would be something like a regular expression that would

How can I extract/parse tabular data from a text file in Perl?

百般思念 提交于 2019-12-04 05:32:37
I am looking for something like HTML::TableExtract , just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing. Data could look like this: Here is some header text. Column One Column Two Column Three a b a b c Some more text Another Table Another Column abdbdbdb aaaa Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially Perlish pseudocode example) Assumption: data may contain spaces and is NOT quoted ala CSV if there's a space - if

Howto clean comments from raw sql file

馋奶兔 提交于 2019-12-04 01:54:35
I have problem with cleaning comments and empty lines from already existing sql file. The file has over 10k lines so cleaning it manually is not an option. I have a little python script, but I have no idea how to handle comments inside multi line inserts. Code: f = file( 'file.sql', 'r' ) t = filter( lambda x: not x.startswith('--') \ and not x.isspace() , f.readlines() ) f.close() t #<- here the cleaned data should be How it should work: This should be cleaned: -- normal sql comment This should stay as it is: CREATE FUNCTION func1(a integer) RETURNS void LANGUAGE plpgsql AS $$ BEGIN --