问题
I made a TensorFlow CsvDataset
, and I'm trying to tokenize the data as such:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import os
os.chdir('/home/nicolas/Documents/Datasets')
fname = 'rotten_tomatoes_reviews.csv'
def preprocess(target, inputs):
tok = Tokenizer(num_words=5_000, lower=True)
tok.fit_on_texts(inputs)
vectors = tok.texts_to_sequences(inputs)
return vectors, target
dataset = tf.data.experimental.CsvDataset(filenames=fname,
record_defaults=[tf.int32, tf.string],
header=True).map(preprocess)
Running this, gives the following error:
ValueError: len requires a non-scalar tensor, got one of shape Tensor("Shape:0", shape=(0,), dtype=int32)
What I've tried: just about anything in the realm of possibilities. Note that everything runs if I remove the preprocessing step.
What the data looks like:
(<tf.Tensor: shape=(), dtype=int32, numpy=1>,
<tf.Tensor: shape=(), dtype=string, numpy=b" Some movie critic review...">)
回答1:
First of all, let's find out the problems in your code:
The first problem, which is also the reason behind the given error, is that the
fit_on_texts
method accepts a list of texts, not a single text string. Therefore, it should be:tok.fit_on_texts([inputs])
.After fixing that and running the code again, you would get another error:
AttributeError: 'Tensor' object has no attribute 'lower'
. This is due to the fact that the elements in the dataset are Tensor objects, and the map function should be able to handle them; however, theTokenizer
class is not designed to handle Tensor objects (there is a fix for this problem, but I won't address it now because of the next problem).The biggest problem is that each time the map function, i.e.
preprocess
, is called, a new instance ofTokenizer
class is created and it would be fit on a single text document. However, as the name offit_on_texts
method suggests, this is designed for applying it on all the text documents only once. In other words, applying it on only one text document does not make sense, simply because you don't build a vocabulary based on only one example (however, if there was a partial fit method, then probably it could be done that way)! Therefore, you can't usetf.keras.preprocessing.text.Tokenizer
class here, i.e. it's not applicable in this specific data pipeline.
So, what should we do? As mentioned above, in almost all of the models which deal with text data, we first need to convert the texts into numerical features, i.e. encode them. For performing encoding, first we need a vocabulary set or a dictionary of tokens. Therefore, the steps we should take are as follows:
If there is a pre-built vocabulary available, then skip to the next step. Otherwise, tokenize all the text data first and build the vocabulary.
Encode the text data using the vocabulary set.
For performing the first step, we use tfds.features.text.Tokenizer to tokenize text data and build the vocabulary by iterating over the dataset.
For the second step, we use tfds.features.text.TokenTextEncoder to encode the text data using the vocabulary set built in previous step. Note that, for this step we are using map
method; however, since map
only functions in graph mode, we have wrapped our encode
function in tf.py_function
so that it could be used with map
.
Here is the code (please read the comments in the code for additional points; I have not included them in the answer because they are not directly relevant, but they are useful and practical):
import tensorflow as tf
import tensorflow_datasets as tfds
from collections import Counter
fname = "rotten_tomatoes_reviews.csv"
dataset = tf.data.experimental.CsvDataset(filenames=fname,
record_defaults=[tf.int32, tf.string],
header=True)
# Create a tokenizer instance to tokenize text data.
tokenizer = tfds.features.text.Tokenizer()
# Find unique tokens in the dataset.
lowercase = True # set this to `False` if case-sensitivity is important.
vocabulary = Counter()
for _, text in dataset:
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
# Select the most common tokens as final vocabulary set.
# Note: if you want all the tokens to be included,
# set `vocab_size = len(vocabulary)` instead.
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
# Create an encoder instance given our vocabulary set.
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=lowercase,
tokenizer=tokenizer)
# Set this to a non-zero integer if you want the texts
# to be truncated when they have more than `max_len` tokens.
max_len = None
def encode(target, text):
text_encoded = encoder.encode(text.numpy())
if max_len:
text_encoded = text_encoded[:max_len]
return text_encoded, target
# Wrap `encode` function inside `tf.py_function` so that
# it could be used with `map` method.
def encode_pyfn(target, text):
text_encoded, target = tf.py_function(encode,
inp=[target, text],
Tout=(tf.int32, tf.int32))
# (optional) Set the shapes for efficiency.
text_encoded.set_shape([None])
target.set_shape([])
return text_encoded, target
# Apply encoding and then padding.
# Note: if you want the sequences in all the batches
# to have the same length, set `padded_shapes` argument accordingly.
dataset = dataset.map(encode_pyfn).padded_batch(batch_size=3,
padded_shapes=([None,], []))
# Important Note: probably this dataset would be used as input to a model
# which uses an Embedding layer. Therefore, don't forget that you
# should set the vocabulary size for this layer properly, i.e. the
# current value of `vocab_size` does not include the padding (added
# by `padded_batch` method) and also the OOV token (added by encoder).
Side note for future readers: notice that the order of arguments, i.e. target, text
, and the data types are based on the OP's dataset. Adapt as needed based on your own dataset/task (although, at the end, i.e. return text_encoded, target
, we adjusted this to make it compatible with expected format of fit
method).
来源:https://stackoverflow.com/questions/61445913/how-do-i-preprocess-and-tokenize-a-tensorflow-csvdataset-inside-the-map-method