nlp

Keep punctuation and casing in gensim wikicorpus text

孤人 提交于 2021-01-29 15:33:55
问题 I have a Wiki Dump as xml.bz2 file and want to convert it to txt for later processing with BERT. The goal is to have each separate sentence in a new line and an emptly line between articles (requirements of BERT Training) I tried to follow this (How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?) Post and did a lot of research of my own. This is what i got so far: from __future__ import print_function import sys from gensim.corpora import WikiCorpus from

what is workers parameter in word2vec in NLP

天大地大妈咪最大 提交于 2021-01-29 14:57:52
问题 in below code . i didn't understand the meaning of workers parameter . model = Word2Vec(sentences, size=300000, window=2, min_count=5, workers=4) 回答1: workers = use this many worker threads to train the model (=faster training with multicore machines). If your system is having 2 cores, and if you specify workers=2, then data will be trained in two parallel ways. By default , worker = 1 i.e, no parallelization 回答2: As others have mentioned, workers controls the number of independent threads

How to train custom NER in Spacy with single words data set?

故事扮演 提交于 2021-01-29 13:22:26
问题 I am trying to train a custom ner in Spacy with the new entity 'ANIMAL'. But I have a data set with single words as: TRAIN_DATA = [("Whale_ Blue", {"entities": [(0,11,LABEL)]}), ("Shark_ whale", {"entities": [(0,12,LABEL)]}), ("Elephant_ African", {"entities": [(0,17,LABEL)]}), ("Elephant_ Indian", {"entities": [(0,16,LABEL)]}), ("Giraffe_ male", {"entities": [(0,13,LABEL)]}), ("Mule", {"entities": [(0,4,LABEL)]}), ("Camel", {"entities": [(0,5,LABEL)]}), ("Horse", {"entities": [(0,5,LABEL)]})

CUDA out of memory

心不动则不痛 提交于 2021-01-29 12:58:23
问题 I am getting error when trying to run BERT model for NER task. "CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 3.82 GiB total capacity; 2.58 GiB already allocated; 25.38 MiB free; 6.33 MiB cached)I have also tried reducing batch size to 1c enter code here epochs = 10 max_grad_norm = 1.0 for _ in trange(epochs, desc="Epoch"): # TRAIN loop model.train() tr_loss = 0 nb_tr_examples, nb_tr_steps = 0, 0 for step, batch in enumerate(train_dataloader): # add batch to gpu batch = tuple(t.to

Training a model to identify names appearing in a sentence

时间秒杀一切 提交于 2021-01-29 11:08:20
问题 I have a dataset containing the names of about 238583 people. The names can contain more than one word for example: Willie Enriquez , James J Johnson, D.J. Khaled . My problem is to identify these names when it appears in a sentence. I am trying to create a machine learning model that can identify if the input is a name or not. My trouble is figuring the input and output of this model. Since I have a bunch of names I can train a model which can recognise a name when the input is a name, but

NER training using Spacy

做~自己de王妃 提交于 2021-01-29 11:02:12
问题 When running a train on an empty NER model, should I include only labeled data (data that contain necessarily at least one entity), or should I also include data that do not contain any label at all (in this case, teaching the model that in some circunstances these words do not have any label)? 回答1: If you look at the commonly used training data for NER (you can find links at http://nlpprogress.com/english/named_entity_recognition.html ), you’ll see that most/every example has at least one

How to get count of words from DataFrame based on conditions

核能气质少年 提交于 2021-01-29 08:46:39
问题 I have the following two dataframes badges and comments . I have created a list of 'gold users' from badges dataframe whose Class=1 . Here Name means the 'Name of Badge' and Class means the level of Badge (1=Gold, 2=Silver, 3=Bronze). I have already done the text preprocessing on comments['Text'] and now want to find the count of top 10 words for gold users from comments['Text'] . I tried the given code but am getting error "KeyError: "None of [Index(['1532', '290', '1946', '1459', '6094',

How to tune the parameters for gensim `LdaMulticore` in Python

夙愿已清 提交于 2021-01-29 08:24:41
问题 I was running gensim LdaMulticore package for the topic modelling using Python. I tried to understand the meaning of the parameters within LdaMulticore and found the website that provides some explanations on the usage of the parameters. As a non-expert, I have some difficulty understanding these intuitively. I also referred some other materials from the website but I guess this page gives relatively full explanations of every parameters. This page chunksize Number of documents to be used in

Intent classification with large number of intent classes

眉间皱痕 提交于 2021-01-29 08:21:46
问题 I am working on a data set of approximately 3000 questions and I want to perform intent classification. The data set is not labelled yet , but from the business perspective, there's a requirement of identifying approximately 80 various intent classes . Let's assume my training data has approximately equal number of each classes and is not majorly skewed towards some of the classes. I am intending to convert the text to word2vec or Glove and then feed into my classifier. I am familiar with

How to zero pad on both sides and encode the sequence into one hot in keras?

别说谁变了你拦得住时间么 提交于 2021-01-29 08:18:17
问题 I have text data as follows. X_train_orignal= np.array(['OC(=O)C1=C(Cl)C=CC=C1Cl', 'OC(=O)C1=C(Cl)C=C(Cl)C=C1Cl', 'OC(=O)C1=CC=CC(=C1Cl)Cl', 'OC(=O)C1=CC(=CC=C1Cl)Cl', 'OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O']) As it is evident that different sequences have different length. How can I zero pad the sequence on both sides of the sequence to some maximum length. And then convert each sequence into one hot encoding based on each characters? Try: I used the following keras API but it doesn't