tokenize | 易学教程

Splitting strings in python

阅读更多关于 Splitting strings in python

问题 I have a string which is like this: this is [bracket test] "and quotes test " I'm trying to write something in Python to split it up by space while ignoring spaces within square braces and quotes. The result I'm looking for is: ['this','is','bracket test','and quotes test '] 回答1: Here's a simplistic solution that works with your test input: import re re.findall('\[[^\]]*\]|\"[^\"]*\"|\S+',s) This will return any code that matches either a open bracket followed by zero or more non-close

Python 2 newline tokens in tokenize module

阅读更多关于 Python 2 newline tokens in tokenize module

问题 I am using the tokenize module in Python and wonder why there are 2 different newline tokens: NEWLINE = 4 NL = 54 Any examples of code that would produce both tokens would be appreciated. 回答1: According to python documentation: tokenize.NL Token value used to indicate a non-terminating newline. The NEWLINE token indicates the end of a logical line of Python code; NL tokens are generated when a logical line of code is continued over multiple physical lines. More here: https://docs.python.org/2

RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

阅读更多关于 RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

问题 I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs (which the code in that thread already does). So, what I've found is something like this: from nltk.tokenize import * txt = "Today it's 07.May 2011. Or 2.999." regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+') ['Today', 'it', "'s", '07.May', '2011',

Issues with Pointer Arithmetic - Trying to tokenize input String

阅读更多关于 Issues with Pointer Arithmetic - Trying to tokenize input String

问题 Currently I am working on a program that allows a user to enter a string that is then tokenized, then the tokens are printed to the screen by using an array of pointers. It is "supposed" to do this by calling my tokenize function which reads the input string until the first separator ( ' ', ',', '.', '?', '!'). It then changes that separator in my string to a NULL char. It then should return a pointer to the next character in my string. In main after the string has been input, it should keep

Java : The constructor JSONTokener(InputStreamReader) is undefined

阅读更多关于 Java : The constructor JSONTokener(InputStreamReader) is undefined

问题 I have a quite strange issue with Java, I'm getting an error on some machines only, I would like to know if there is any way I can avoid that: This is the line of code concerned: JSONTokener jsonTokener = new JSONTokener( new InputStreamReader(is, "UTF-8")); This is the error I get on some machines The file *.java could not be compiled. Error raised is : The constructor JSONTokener(InputStreamReader) is undefined 回答1: Check the classpath on the machines where this error occurs. This could

Tokenizer with Pygments in Python

阅读更多关于 Tokenizer with Pygments in Python

问题 Want to create a tokenizer for source files (e.g. Java or C++) in Python. Came across Pygments and in particular these lexers. I could not found examples i the documentation and online for how to use the lexer. Wondering if it is possible to actually use Pygments in Python in order to get the tokens and their position for a given source file. I am struggling with the very basics here, so If someone could offer even a small chunk of code detailing the above it would be much appreciated. 回答1:

Simple tokenizer for C++ in Python

阅读更多关于 Simple tokenizer for C++ in Python

问题 Struggling to find a Python library of script to tokenize (find specific tokens like function definition names, variable names, keywords etc.). I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc . I was hoping of using a pre-existent script; I explored Pygments with no success. Its lexer seems amazing for what I want but have no idea how to utilize it in Python and to also get positions for each

Analyzer to autocomplete names

阅读更多关于 Analyzer to autocomplete names

问题 I want to be able autocomplete names. For example, if we have the name John Smith , I want to be able to search for Jo and Sm and John Sm to get the document back. In addition, I do not want jo sm matching the document. I currently have this analyzer: return array( 'settings' => array( 'index' => array( 'analysis' => array( 'analyzer' => array( 'autocomplete' => array( 'tokenizer' => 'autocompleteEngram', 'filter' => array('lowercase', 'whitespace') ) ), 'tokenizer' => array(

Regex to find tokens - Java Scanner or another alternative

阅读更多关于 Regex to find tokens - Java Scanner or another alternative

问题 Hi I'm trying to write a class that transfers some text into well defined tokens. The strings are somewhat similar to code like: (brown) "fox" 'c'; . What I would like to get is (either a token from Scanner or an array after slitting I think both would work just fine) ( , brown , ) , "fox" , 'c' , ; separately (as they are potential tokens) which include: quoted text with ' and " number with or without a decimal point parenthesis, braces , semicolon , equals, sharp, ||,<=,&& Currently I'm

Challenge: Regex-only tokenizer for shell-assignment-like config lines

阅读更多关于 Challenge: Regex-only tokenizer for shell-assignment-like config lines

问题 I asked the original question here, and got a practical response with mixed Ruby and Regular Expressions. Now, the purist in me wants know: Can this be done in regular expressions? My gut says it can. There's an ABNF floating around for bash 2.0, though it doesn't include string escapes. The Spec Given an input line that is either (1) a variable ("key") assignment from a bash-flavored script or (2) a key-value setting from a typical configuration file like postgresql.conf , this regex (or