tokenize

C++ tokenize a string using a regular expression

夙愿已清 提交于 2019-11-27 14:52:17
问题 I'm trying to learn myself some C++ from scratch at the moment. I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a classroom setting in the past. Please excuse the naivete of my question. I would like to split a string using a regular expression but have not had much luck finding a clear, definitive, efficient and complete example of how to do this in C++. In perl this is action is common, and thus can be accomplished in a trivial manner, /home/me$ cat

Is it a Lexer's Job to Parse Numbers and Strings?

做~自己de王妃 提交于 2019-11-27 13:28:59
Is it a lexer's job to parse numbers and strings? This may or may not sound dumb, given that fact that I'm asking whether a lexer should parse input. However, I'm not sure whether that's in fact the lexer's job or the parser's job, because in order to lex properly, the lexer needs to parse the string/number in the first place , so it would seem like code would be duplicated if the parser does this. Is it indeed the lexer's job? Or should the lexer simply break up a string like 123.456 into the strings 123 , . , 456 and let the parser figure out the rest? Doing this wouldn't be so

Input line by line from an input file and tokenize using strtok() and the output into an output file

别来无恙 提交于 2019-11-27 12:28:24
问题 What I am trying to do is to input a file LINE BY LINE and tokenize and output into an output file.What I have been able to do is input the first line in the file but my problem is that i am unable to input the next line to tokenize so that it could be saved as a second line in the output file,this is what i could do so far fro inputing the first line in the file. #include <iostream> #include<string> //string library #include<fstream> //I/O stream input and output library using namespace std;

Tokenize a string and include delimiters in C++

南笙酒味 提交于 2019-11-27 11:08:15
问题 I'm tokening with the following, but unsure how to include the delimiters with it. void Tokenize(const string str, vector<string>& tokens, const string& delimiters) { int startpos = 0; int pos = str.find_first_of(delimiters, startpos); string strTemp; while (string::npos != pos || string::npos != startpos) { strTemp = str.substr(startpos, pos - startpos); tokens.push_back(strTemp.substr(0, strTemp.length())); startpos = str.find_first_not_of(delimiters, pos); pos = str.find_first_of

How does a parser (for example, HTML) work?

穿精又带淫゛_ 提交于 2019-11-27 09:40:58
问题 For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere? I'm interested for the sake of knowing (I'm curious). If I were to read through

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

微笑、不失礼 提交于 2019-11-27 07:07:30
This question already has an answer here: Python split text on sentences 10 answers I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work. import re text = """\ Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9

Parsing pipe delimited string into columns?

我与影子孤独终老i 提交于 2019-11-27 07:02:44
问题 I have a column with pipe separated values such as: '23|12.1| 450|30|9|78|82.5|92.1|120|185|52|11' I want to parse this column to fill a table with 12 corresponding columns: month1, month2, month3...month12. So month1 will have the value 23, month2 the value 12.1 etc... Is there a way to parse it by a loop or delimeter instead of having to separate one value at a time using substr? Thanks. 回答1: You can use regexp_substr (10g+): SQL> SELECT regexp_substr('23|12.1| 450|30|9|', '[^|]+', 1, 1) c1

Looking for a clear definition of what a “tokenizer”, “parser” and “lexers” are and how they are related to each other and used?

强颜欢笑 提交于 2019-11-27 05:50:39
I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions. I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are

C++ Templates Angle Brackets Pitfall - What is the C++11 fix?

社会主义新天地 提交于 2019-11-27 05:17:29
In C++11, this is now valid syntax: vector<vector<float>> MyMatrix; whereas previously, it had to be written like this (notice the space): vector<vector<float> > MyMatrix; My question is what is the fix that the standard uses to allow the first version? Could it be as simply as making > a token instead of >> ? If that's not it, what does not work with this approach? I consider that forms like myTemplate< x>>3 > are a non-problem, since you can disambiguate them by doing myTemplate<(x>>3)> . It's fixed by adding a special case to the parsing rules when parsing template arguments. C++11 14.2/3:

How do I tokenize a string sentence in NLTK?

☆樱花仙子☆ 提交于 2019-11-27 04:34:58
I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like my_text = ['This', 'is', 'my', 'text'] I'd like to discover any way to input my "text" as: my_text = "This is my text, this is a nice way to input text." Which method, python's or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols? This is actually on the main page of nltk.org : >>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word