问题
I would like to get BERT embedding using tensorflow hub. I found it very easy to get ELMO embedding and my steps are below. Could anyone explain how to get BERT embedding on a windows machine? I found this but couldn't get it work on windows machine
https://tfhub.dev/google/elmo/3 go to this link and then download.
Unzip it twice till you see "tfhub_module.pb", provide path of that folder to get embedding
import tensorflow as tf import tensorflow_hub as hub elmo = hub.Module("C:/Users/nnnn/Desktop/BERT/elmo/3.tar/3", trainable=True) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) abc1=sess.run(elmo(x, signature="default", as_dict=True)["default"])
+++++++++++++++++++++++++++++++++++++++++ update 1
list of the problems that I am facing are below - I will add them one by one. This page has the complete notebook from the same author.
- when i try
import tokenization
, i get an errorModuleNotFoundError: No module named 'tokenization'
How do i get rid of it? Do I need to download thetokenization.py
and refer to it? Please clarify
==============update 2 I was able to get it work. The code with comments are as below
#manually copy paste code from https://github.com/google-research/bert/blob/master/tokenization.py and create a file called C:\\Users\\nn\\Desktop\\BERT\\tokenization.py
#for some reason direct download doesn’t work
#https://github.com/vineetm/tfhub-bert/blob/master/bert_tfhub.ipynb
#https://stackoverflow.com/questions/44891069/how-to-import-python-file
import sys
import os
print (sys.path)
script_dir = "C:\\Users\\nn\\Desktop\\BERT"
# Add the absolute directory path containing your
# module to the Python path
sys.path.append(os.path.abspath(script_dir))
import tokenization
import tensorflow_hub as hub
import tensorflow as tf
#download https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1 and unzip twice
def create_tokenizer(vocab_file='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~\\assets\\vocab.txt', do_lower_case=False):
return tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
tokenizer = create_tokenizer()
def convert_sentence_to_features(sentence, tokenizer, max_seq_len):
tokens = ['[CLS]']
tokens.extend(tokenizer.tokenize(sentence))
if len(tokens) > max_seq_len-1:
tokens = tokens[:max_seq_len-1]
tokens.append('[SEP]')
segment_ids = [0] * len(tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
#Zero Mask till seq_length
zero_mask = [0] * (max_seq_len-len(tokens))
input_ids.extend(zero_mask)
input_mask.extend(zero_mask)
segment_ids.extend(zero_mask)
return input_ids, input_mask, segment_ids
def convert_sentences_to_features(sentences, tokenizer, max_seq_len=20):
all_input_ids = []
all_input_mask = []
all_segment_ids = []
for sentence in sentences:
input_ids, input_mask, segment_ids = convert_sentence_to_features(sentence, tokenizer, max_seq_len)
all_input_ids.append(input_ids)
all_input_mask.append(input_mask)
all_segment_ids.append(segment_ids)
return all_input_ids, all_input_mask, all_segment_ids
#BERT_URL = 'https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1'
BERT_URL ='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~'
module = hub.Module(BERT_URL)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
input_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
input_mask = tf.placeholder(dtype=tf.int32, shape=[None, None])
segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
bert_inputs = dict(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids)
bert_outputs = module(bert_inputs, signature="tokens", as_dict=True)
sentences = ['New Delhi is the capital of India', 'The capital of India is Delhi']
input_ids_vals, input_mask_vals, segment_ids_vals = convert_sentences_to_features(sentences, tokenizer, 10)#max_seq_len parameter
out = sess.run(bert_outputs, feed_dict={input_ids: input_ids_vals, input_mask: input_mask_vals, segment_ids: segment_ids_vals})
out['sequence_output'].shape
out['pooled_output'].shape
out.keys()
type(out['pooled_output'])
x1=out['sequence_output'][0,:,:]
x2=out['sequence_output'][1,:,:]#Sentences length is 7, even if i add cls and sep tokens, the length is 9. max_seq_len parameter is 10, then why are the last row of x1 and x2 not same?
来源:https://stackoverflow.com/questions/58961467/tensorflow-hub-to-pull-bert-embedding-on-windows-machine