huggingface-transformers

Pretraining a language model on a small custom corpus

烂漫一生 提交于 2020-07-21 07:55:47
问题 I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text. For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning. Putting it as a pipeline, I would describe this as: Using a pre-trained BERT

Pretraining a language model on a small custom corpus

筅森魡賤 提交于 2020-07-21 07:55:05
问题 I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text. For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning. Putting it as a pipeline, I would describe this as: Using a pre-trained BERT

Hugging-Face Transformers: Loading model from path error

别来无恙 提交于 2020-07-10 10:28:16
问题 I am pretty new to Hugging-Face transformers. I am facing the following issue when I try to load xlm-roberta-base model from a given path: >> tokenizer = AutoTokenizer.from_pretrained(model_path) >> Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 182, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/user

Pytorch error “RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows”

折月煮酒 提交于 2020-06-29 03:43:43
问题 I have sentences that I vectorize using sentence_vector() method of BiobertEmbedding python module (https://pypi.org/project/biobert-embedding/). For some group of sentences I have no problem but for some others I have the following error message : File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", line 133, in sentence_vector encoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert

Pytorch error “RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows”

僤鯓⒐⒋嵵緔 提交于 2020-06-29 03:42:38
问题 I have sentences that I vectorize using sentence_vector() method of BiobertEmbedding python module (https://pypi.org/project/biobert-embedding/). For some group of sentences I have no problem but for some others I have the following error message : File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", line 133, in sentence_vector encoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert

Save model wrapped in Keras

偶尔善良 提交于 2020-06-27 14:57:20
问题 Sorry for my naive question but I am trying to save my keras model () in which I use TFBertModel() function as an hidden layer. To do that I use the save() function provided by the tf.keras package. But I got this error: --------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) <ipython-input-13-3b315f7219da> in <module>() ----> 1 model.save('model_weights.h5') 8 frames /tensorflow-2.1.0/python3.6/tensorflow_core/python

How to use Hugging Face Transformers library in Tensorflow for text classification on custom data?

谁都会走 提交于 2020-05-29 03:28:32
问题 I am trying to do binary text classification on custom data (which is in csv format) using different transformer architectures that Hugging Face 'Transformers' library offers. I am using this Tensorflow blog post as reference. I am loading the custom dataset into 'tf.data.Dataset' format using the following code: def get_dataset(file_path, **kwargs): dataset = tf.data.experimental.make_csv_dataset( file_path, batch_size=5, # Artificially small to make examples easier to show. na_value="", num

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

旧时模样 提交于 2020-05-15 05:13:10
问题 I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline back to my original text. Moreover, the outputs are masked in BERT tokenization format (the default model is BERT-large). For example: from transformers import pipeline nlp_bert_lg = pipeline('ner') print(nlp_bert_lg('Hugging Face is a French

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

若如初见. 提交于 2020-05-15 05:13:07
问题 I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline back to my original text. Moreover, the outputs are masked in BERT tokenization format (the default model is BERT-large). For example: from transformers import pipeline nlp_bert_lg = pipeline('ner') print(nlp_bert_lg('Hugging Face is a French

BERT token importance measuring issue. Grad is none

我只是一个虾纸丫 提交于 2020-04-30 06:36:26
问题 I am trying to measure token importance for BERT via comparing token embedding grad value. So, to get the grad, I've copied the 2.8.0 forward of BertModel and changed it a bit: huggingface transformers 2.8.0 BERT https://github.com/huggingface/transformers/blob/11c3257a18c4b5e1a3c1746eefd96f180358397b/src/transformers/modeling_bert.py Code: embedding_output = self.embeddings( input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds ) embedding