nlp

How to perform entity linking to local knowledge graph?

倾然丶 夕夏残阳落幕 提交于 2021-02-04 16:22:36
问题 I'm building my own knowledge base from scratch, using articles online. I am trying to map the entities from my scraped SPO triples (the Subject and potentially the Object) to my own record of entities which consist of listed companies which I scraped from some other website. I've researched most of the libraries, and the method are focused on mapping entities to big knowledge bases like Wikipedia, YAGO, etc., but I'm not really sure how to apply those techniques to my own knowledge base.

Python NLP British English vs American English

我的梦境 提交于 2021-02-04 13:48:26
问题 I'm currently working on NLP in python. However, in my corpus, there are both British and American English(realize/realise) I'm thinking to convert British to American. However, I did not find a good tool/package to do that. Any suggestions? 回答1: I've not been able to find a package either, but try this: (Note that I've had to trim the us2gb dictionary substantially for it to fit within the Stack Overflow character limit - you'll have to rebuild this yourself). # Based on Shengy's code: #

How do I count all occurrences of a phrase in a text file using regular expressions?

半城伤御伤魂 提交于 2021-01-29 22:44:21
问题 I am reading in multiple files from a directory and attempting to find how many times a specific phrase (in this instance "at least") occurs in each file (not just that it occurs, but how many times in each text file it occurs) My code is as follows import glob import os path = 'D:/Test' k = 0 for filename in glob.glob(os.path.join(path, '*.txt')): if filename.endswith('.txt'): f = open(filename) data = f.read() data.split() data.lower() S = re.findall(r' at least ', data, re.MULTILINE) count

How do I count all occurrences of a phrase in a text file using regular expressions?

我的梦境 提交于 2021-01-29 22:33:14
问题 I am reading in multiple files from a directory and attempting to find how many times a specific phrase (in this instance "at least") occurs in each file (not just that it occurs, but how many times in each text file it occurs) My code is as follows import glob import os path = 'D:/Test' k = 0 for filename in glob.glob(os.path.join(path, '*.txt')): if filename.endswith('.txt'): f = open(filename) data = f.read() data.split() data.lower() S = re.findall(r' at least ', data, re.MULTILINE) count

word2vec cosine similarity greater than 1 arabic text

自作多情 提交于 2021-01-29 22:01:22
问题 I have trained my word2vec model from gensim and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores: top neighbors for الاحتلال: الاحتلال: 1.0000001192092896 الاختلال: 0.9541053175926208 الاهتلال: 0.872565507888794 الاحثلال: 0.8386293649673462 الاكتلال: 0.8209128379821777 It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents).

AttributeError: 'Tensor' object has no attribute '_keras_history' using CRF

牧云@^-^@ 提交于 2021-01-29 21:42:09
问题 I know there are a bunch of questions on this problem and I have read some of those but none of them worked for me. I am trying to build a model with the following architecture: The code is as follows: token_inputs = Input((32,), dtype=tf.int32, name='input_ids') mask_inputs = Input((32,), dtype=tf.int32, name='attention_mask') seg_inputs = Input((32,), dtype=tf.int32, name='token_type_ids') seq_out, _ = bert_model([token_inputs, mask_inputs, seg_inputs]) bd = Bidirectional(LSTM(units=50,

AttributeError: 'Tensor' object has no attribute '_keras_history' using CRF

◇◆丶佛笑我妖孽 提交于 2021-01-29 19:58:30
问题 I know there are a bunch of questions on this problem and I have read some of those but none of them worked for me. I am trying to build a model with the following architecture: The code is as follows: token_inputs = Input((32,), dtype=tf.int32, name='input_ids') mask_inputs = Input((32,), dtype=tf.int32, name='attention_mask') seg_inputs = Input((32,), dtype=tf.int32, name='token_type_ids') seq_out, _ = bert_model([token_inputs, mask_inputs, seg_inputs]) bd = Bidirectional(LSTM(units=50,

Create SavedModel for BERT

怎甘沉沦 提交于 2021-01-29 18:20:39
问题 I'm using this Colab for BERT model. In last cells in order to make predictions we have: def getPrediction(in_sentences): labels = ["Negative", "Positive"] input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer) predict_input_fn = run_classifier.input_fn_builder(features=input_features,

Slice JSON File into Different Time Intercepts with Python

心不动则不痛 提交于 2021-01-29 15:53:09
问题 For a current research project, I am trying to slice a JSON file into different time intercepts. Based on the object "Date", I want to analyse content of the JSON file by quarter, i.e. 01 January - 31 March, 01 April - 20 June etc. The code would ideally have to pick the oldest date in the file and add quarterly time incercepts on top of that. I have done research on this point but not found any helpful methods yet. Is there any smart way to include this in the code? The JSON file has the

Dataframe aggregation of n-gram, their frequency and associate the entries of other columns with it using R

人盡茶涼 提交于 2021-01-29 15:48:18
问题 I am trying to aggregate a dataframe based on 1-gram (can be extended to n-gram by changing n in the code below) frequency and associate other columns to it. The way I did it is shown below. Are there any other shortcuts/ alternatives to produce the table shown at the very end of this question for the dataframe given below? The code and the results are shown below. The below chunk sets the environment, loads the libraries and reads the dataframe: # Clear variables in the working environment