问题
I am trying to do binary text classification on custom data (which is in csv format) using different transformer architectures that Hugging Face 'Transformers' library offers. I am using this Tensorflow blog post as reference.
I am loading the custom dataset into 'tf.data.Dataset' format using the following code:
def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=5, # Artificially small to make examples easier to show.
na_value="",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
After this when I tried using the 'glue_convert_examples_to_features' method to tokenize as below:
train_dataset = glue_convert_examples_to_features(
examples = train_data,
tokenizer = tokenizer,
task = None,
label_list = ['0', '1'],
max_length = 128
)
which throws an error "UnboundLocalError: local variable 'processor' referenced before assignment" at:
if is_tf_dataset:
example = processor.get_example_from_tensor_dict(example)
example = processor.tfds_map(example)
In all the examples, I see that they are using the tasks like 'mrpc' etc which are pre-defined and have a glue_processor to handle. Error raises at the 'line 85' in source code.
Can anyone help with solving this issue using with 'custom data' ?
回答1:
I had the same starting problem.
This Kaggle submission helped me a lot. There you can see how you can tokenize the data according to the chosen pre-trained model:
def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
tokenized_sentences = []
for sentence in tqdm(sentences):
tokenized_sentence = tokenizer.encode(
sentence, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = max_seq_len, # Truncate all sentences.
)
tokenized_sentences.append(tokenized_sentence)
return tokenized_sentences
tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=True)
train_ids = tokenize_sentences(your_sentence_list, tokenizer)
Furthermore, I looked into the source of glue_convert_examples_to_features
. There you can see how a tf.data.dataset compatible with the BERT model can be created. I created a function for this:
def create_dataset(ids, masks, labels):
def gen():
for i in range(len(train_ids)):
yield (
{
"input_ids": ids[i],
"attention_mask": masks[i]
},
labels[i],
)
return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None])
},
tf.TensorShape([None]),
),
)
train_dataset = create_dataset(train_ids, train_masks, train_labels)
I then use the dataset like this:
from transformers import TFBertForSequenceClassification, BertConfig
model = TFBertForSequenceClassification.from_pretrained(
bert_model_name,
config=BertConfig.from_pretrained(bert_model_name, num_labels=20)
)
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.CategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=1, steps_per_epoch=115, validation_data=val_dataset, validation_steps=7)
来源:https://stackoverflow.com/questions/59978959/how-to-use-hugging-face-transformers-library-in-tensorflow-for-text-classificati