tokenize | 易学教程

C++ tokenize a string using a regular expression

阅读更多关于 C++ tokenize a string using a regular expression

问题 I'm trying to learn myself some C++ from scratch at the moment. I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a classroom setting in the past. Please excuse the naivete of my question. I would like to split a string using a regular expression but have not had much luck finding a clear, definitive, efficient and complete example of how to do this in C++. In perl this is action is common, and thus can be accomplished in a trivial manner, /home/me$ cat

Is it a Lexer's Job to Parse Numbers and Strings?

阅读更多关于 Is it a Lexer's Job to Parse Numbers and Strings?

Is it a lexer's job to parse numbers and strings? This may or may not sound dumb, given that fact that I'm asking whether a lexer should parse input. However, I'm not sure whether that's in fact the lexer's job or the parser's job, because in order to lex properly, the lexer needs to parse the string/number in the first place , so it would seem like code would be duplicated if the parser does this. Is it indeed the lexer's job? Or should the lexer simply break up a string like 123.456 into the strings 123 , . , 456 and let the parser figure out the rest? Doing this wouldn't be so

Input line by line from an input file and tokenize using strtok() and the output into an output file

阅读更多关于 Input line by line from an input file and tokenize using strtok() and the output into an output file

问题 What I am trying to do is to input a file LINE BY LINE and tokenize and output into an output file.What I have been able to do is input the first line in the file but my problem is that i am unable to input the next line to tokenize so that it could be saved as a second line in the output file,this is what i could do so far fro inputing the first line in the file. #include <iostream> #include<string> //string library #include<fstream> //I/O stream input and output library using namespace std;

Tokenize a string and include delimiters in C++

阅读更多关于 Tokenize a string and include delimiters in C++

问题 I'm tokening with the following, but unsure how to include the delimiters with it. void Tokenize(const string str, vector<string>& tokens, const string& delimiters) { int startpos = 0; int pos = str.find_first_of(delimiters, startpos); string strTemp; while (string::npos != pos || string::npos != startpos) { strTemp = str.substr(startpos, pos - startpos); tokens.push_back(strTemp.substr(0, strTemp.length())); startpos = str.find_first_not_of(delimiters, pos); pos = str.find_first_of

How does a parser (for example, HTML) work?

阅读更多关于 How does a parser (for example, HTML) work?

问题 For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere? I'm interested for the sake of knowing (I'm curious). If I were to read through

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

阅读更多关于 Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

This question already has an answer here: Python split text on sentences 10 answers I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work. import re text = """\ Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9

Parsing pipe delimited string into columns?

阅读更多关于 Parsing pipe delimited string into columns?

问题 I have a column with pipe separated values such as: '23|12.1| 450|30|9|78|82.5|92.1|120|185|52|11' I want to parse this column to fill a table with 12 corresponding columns: month1, month2, month3...month12. So month1 will have the value 23, month2 the value 12.1 etc... Is there a way to parse it by a loop or delimeter instead of having to separate one value at a time using substr? Thanks. 回答1: You can use regexp_substr (10g+): SQL> SELECT regexp_substr('23|12.1| 450|30|9|', '[^|]+', 1, 1) c1

Looking for a clear definition of what a “tokenizer”, “parser” and “lexers” are and how they are related to each other and used?

阅读更多关于 Looking for a clear definition of what a “tokenizer”, “parser” and “lexers” are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions. I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are

C++ Templates Angle Brackets Pitfall - What is the C++11 fix?

阅读更多关于 C++ Templates Angle Brackets Pitfall - What is the C++11 fix?

In C++11, this is now valid syntax: vector<vector<float>> MyMatrix; whereas previously, it had to be written like this (notice the space): vector<vector<float> > MyMatrix; My question is what is the fix that the standard uses to allow the first version? Could it be as simply as making > a token instead of >> ? If that's not it, what does not work with this approach? I consider that forms like myTemplate< x>>3 > are a non-problem, since you can disambiguate them by doing myTemplate<(x>>3)> . It's fixed by adding a special case to the parsing rules when parsing template arguments. C++11 14.2/3:

How do I tokenize a string sentence in NLTK?

阅读更多关于 How do I tokenize a string sentence in NLTK?

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like my_text = ['This', 'is', 'my', 'text'] I'd like to discover any way to input my "text" as: my_text = "This is my text, this is a nice way to input text." Which method, python's or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols? This is actually on the main page of nltk.org : >>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word