tokenize | 易学教程

How to stop BERT from breaking apart specific words into word-piece

阅读更多关于 How to stop BERT from breaking apart specific words into word-piece

问题 I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer = BertTokenizer('bert-base-uncased-vocab.txt') tokens = tokenizer.tokenize("metastasis") Create tokens like this: ['meta', '##sta', '##sis'] However, I want to keep the whole words as one token, like this: ['metastasis'] 回答1: You are free to add new tokens to the

How to query fields with path_hierarchy analyzer in elasticsearch?

阅读更多关于 How to query fields with path_hierarchy analyzer in elasticsearch?

问题 I have configured path_analyzer in elasticsearch using below configuration. PUT /elastic_course { "settings": { "analysis": { "analyzer": { "path_analyzer": { "tokenizer": "path_tokenizer" }, "reverse_path_analyzer": { "tokenizer": "path_tokenizer" } }, "tokenizer": { "path_tokenizer": { "type": "path_hierarchy", "delimiter": "/", "replacement": "-" }, "reverse_path_tokenizer": { "type": "path_hierarchy", "delimiter": "/", "replacement": "-" } } } }, "mappings": { "book" : { "properties": {

Best way to parse custom Filtersyntax

阅读更多关于 Best way to parse custom Filtersyntax

问题 I have a program which allows the user to enter a filter in a textbox in the column header of a DataGridView. This text is then parsed into a list of FilterOperations. Currently i tokenize the string and then build the list in a hunge For-loop. Which Desing Patterns could i use to get rid of the huge for consruct? Are there any other actions i can take to improve the design? In the current state its hard to add support for another operator, datatype or build something else thant the

How to ignore punctuation in-between words using word_tokenize in NLTK?

阅读更多关于 How to ignore punctuation in-between words using word_tokenize in NLTK?

问题 I'm looking to ignore characters in-between words using NLTK word_tokenize. If I have a a sentence: test = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email test@testing.com' The word_tokenize method is splitting the S&P into 'S','&','P','?' Is there a way to have this library ignore punctuation between words or letters? Expected output: 'S&P','?' 回答1: Let me know how this works with your sentences. I added an additional test with a bunch of punctuation. The

Wordpiece tokenization versus conventional lemmatization?

阅读更多关于 Wordpiece tokenization versus conventional lemmatization?

问题 I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing"). Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece

How to encode string in tf.data.Dataset?

阅读更多关于 How to encode string in tf.data.Dataset?

问题 So I am trying to encode a string in a tensorflow dataset in order to use it to train a pretrained RoBERTa model. The training_dataset is a tensorflow dataset made from a pandas dataframe that looks like this: I used this dataframe to construct the tf.data.Dataset using: features = ['OptionA', 'OptionB', 'OptionC'] training_dataset = ( tf.data.Dataset.from_tensor_slices( ( tf.cast(train_split[features].values, tf.string), tf.cast(train_split['Answer'].values, tf.int32) ) ) ) Now I want to