text-parsing | 易学教程

Parsing / Extracting Text from String in Rails?

阅读更多关于 Parsing / Extracting Text from String in Rails?

I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy". Is this a matter of using Regex and lifting the text between "#books" to "."? What if there's no structure to the message, like: "This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or "This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books" How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex

Reading parameters from a text file into the workspace

阅读更多关于 Reading parameters from a text file into the workspace

I have a file which has the following information: % ---------------------- location details -------------------------- % % lat : latitude [minimum = -90, maximum = 90, unit = % degrees north] % lon : longitude [ minimum = -360, maximum = 360, unit = % deg east] % z: altitude (above sea level, m) %--------------------------------------------------------------- % location: lat = 54.35 lon = -2.9833 This is a small section of the file. I would like to read some of this information into MATLAB, where the information can then be used to perform some calculations. The part of the file that I would

Compute ngrams for each row of text data in R

阅读更多关于 Compute ngrams for each row of text data in R

I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) # Using Tyler's method of making the 'Text'

Resume/CV Parsing in PHP [closed]

阅读更多关于 Resume/CV Parsing in PHP [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 8 years ago . We are developing a requirement base social media site using LAMP. For that we want to do Resume/CV Parsing in PHP. We were able to parse Email-id and Phone, but not sure how to parse the other information like

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

阅读更多关于 Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

问题 I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure. In terms of getting counts for occurrence: vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = CountVectorizer(vocabulary=vocabulary) X = cv.fit_transform(corpus) print X.toarray() gives: [[0 0 0 0]] What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams: vocabulary = ['hi', 'bye'

Java: How read a File line by line by ignoring “\n”

阅读更多关于 Java: How read a File line by line by ignoring “\n”

问题 I'm trying to read a tab separated text file line per line. The lines are separated by using carriage return ("\r\n") and LineFeed (\"n") is allowed within in tab separated text fields. Since I want to read the File Line per Line, I want my programm to ignore a standalone "\n". Unfortunately, BufferedReader uses both possibilities to separate the lines. How can I modify my code, in order to ignore the standalone "\n"? try { BufferedReader in = new BufferedReader(new FileReader(flatFile));

Removing brackets and quotes from print in Python 2.7

阅读更多关于 Removing brackets and quotes from print in Python 2.7

问题 I am trying to remove the brackets from the print statement using Python 2.7 I tried suggestions from various forums but it didn't work as expected finally thought of asking out here myself. Code: with open('buttonpress_and_commandstarted','r') as buttonpress_commandstarted: for line in buttonpress_commandstarted: button_press_command_time = '' if os.path.getsize('buttonpress_and_commandstarted') > 0: button_press_command_time = line.split()[2] else: print " > Cannot get time stamp as the

ColdFusion extract values from text file

阅读更多关于 ColdFusion extract values from text file

问题 The technical details I want to EXTRACT values from a text file containing parameter names and values. For each line that starts with "request.config." (there are empty lines, lines with comments, etc. which I don't want to extract anything from) I want to extract these values (in bold) : request.config. my_param_1 = "some random string" ; I thought the best way to do this might be using REGEX, but how can I do this? I thought there would be something like a regular expression that would

How can I extract/parse tabular data from a text file in Perl?

阅读更多关于 How can I extract/parse tabular data from a text file in Perl?

I am looking for something like HTML::TableExtract , just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing. Data could look like this: Here is some header text. Column One Column Two Column Three a b a b c Some more text Another Table Another Column abdbdbdb aaaa Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially Perlish pseudocode example) Assumption: data may contain spaces and is NOT quoted ala CSV if there's a space - if

Howto clean comments from raw sql file

阅读更多关于 Howto clean comments from raw sql file

I have problem with cleaning comments and empty lines from already existing sql file. The file has over 10k lines so cleaning it manually is not an option. I have a little python script, but I have no idea how to handle comments inside multi line inserts. Code: f = file( 'file.sql', 'r' ) t = filter( lambda x: not x.startswith('--') \ and not x.isspace() , f.readlines() ) f.close() t #<- here the cleaned data should be How it should work: This should be cleaned: -- normal sql comment This should stay as it is: CREATE FUNCTION func1(a integer) RETURNS void LANGUAGE plpgsql AS $$ BEGIN --