huggingface-transformers

Pretraining a language model on a small custom corpus

阅读更多关于 Pretraining a language model on a small custom corpus

问题 I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text. For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning. Putting it as a pipeline, I would describe this as: Using a pre-trained BERT

Pretraining a language model on a small custom corpus

阅读更多关于 Pretraining a language model on a small custom corpus

Hugging-Face Transformers: Loading model from path error

阅读更多关于 Hugging-Face Transformers: Loading model from path error

问题 I am pretty new to Hugging-Face transformers. I am facing the following issue when I try to load xlm-roberta-base model from a given path: >> tokenizer = AutoTokenizer.from_pretrained(model_path) >> Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 182, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/user

Pytorch error “RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows”

阅读更多关于 Pytorch error “RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows”

问题 I have sentences that I vectorize using sentence_vector() method of BiobertEmbedding python module (https://pypi.org/project/biobert-embedding/). For some group of sentences I have no problem but for some others I have the following error message : File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", line 133, in sentence_vector encoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert

Pytorch error “RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows”

阅读更多关于 Pytorch error “RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows”

Save model wrapped in Keras

阅读更多关于 Save model wrapped in Keras

问题 Sorry for my naive question but I am trying to save my keras model () in which I use TFBertModel() function as an hidden layer. To do that I use the save() function provided by the tf.keras package. But I got this error: --------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) <ipython-input-13-3b315f7219da> in <module>() ----> 1 model.save('model_weights.h5') 8 frames /tensorflow-2.1.0/python3.6/tensorflow_core/python

How to use Hugging Face Transformers library in Tensorflow for text classification on custom data?

阅读更多关于 How to use Hugging Face Transformers library in Tensorflow for text classification on custom data?

问题 I am trying to do binary text classification on custom data (which is in csv format) using different transformer architectures that Hugging Face 'Transformers' library offers. I am using this Tensorflow blog post as reference. I am loading the custom dataset into 'tf.data.Dataset' format using the following code: def get_dataset(file_path, **kwargs): dataset = tf.data.experimental.make_csv_dataset( file_path, batch_size=5, # Artificially small to make examples easier to show. na_value="", num

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

阅读更多关于 How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

问题 I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline back to my original text. Moreover, the outputs are masked in BERT tokenization format (the default model is BERT-large). For example: from transformers import pipeline nlp_bert_lg = pipeline('ner') print(nlp_bert_lg('Hugging Face is a French

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

阅读更多关于 How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

BERT token importance measuring issue. Grad is none

阅读更多关于 BERT token importance measuring issue. Grad is none

问题 I am trying to measure token importance for BERT via comparing token embedding grad value. So, to get the grad, I've copied the 2.8.0 forward of BertModel and changed it a bit: huggingface transformers 2.8.0 BERT https://github.com/huggingface/transformers/blob/11c3257a18c4b5e1a3c1746eefd96f180358397b/src/transformers/modeling_bert.py Code: embedding_output = self.embeddings( input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds ) embedding