How do I preprocess and tokenize a TensorFlow CsvDataset inside the map method?

允我心安 提交于 2021-01-21 10:39:09

问题


I made a TensorFlow CsvDataset, and I'm trying to tokenize the data as such:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import os
os.chdir('/home/nicolas/Documents/Datasets')

fname = 'rotten_tomatoes_reviews.csv'


def preprocess(target, inputs):
    tok = Tokenizer(num_words=5_000, lower=True)
    tok.fit_on_texts(inputs)
    vectors = tok.texts_to_sequences(inputs)
    return vectors, target


dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True).map(preprocess)

Running this, gives the following error:

ValueError: len requires a non-scalar tensor, got one of shape Tensor("Shape:0", shape=(0,), dtype=int32)

What I've tried: just about anything in the realm of possibilities. Note that everything runs if I remove the preprocessing step.

What the data looks like:

(<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=string, numpy=b" Some movie critic review...">)

回答1:


First of all, let's find out the problems in your code:

  • The first problem, which is also the reason behind the given error, is that the fit_on_texts method accepts a list of texts, not a single text string. Therefore, it should be: tok.fit_on_texts([inputs]).

  • After fixing that and running the code again, you would get another error: AttributeError: 'Tensor' object has no attribute 'lower'. This is due to the fact that the elements in the dataset are Tensor objects, and the map function should be able to handle them; however, the Tokenizer class is not designed to handle Tensor objects (there is a fix for this problem, but I won't address it now because of the next problem).

  • The biggest problem is that each time the map function, i.e. preprocess, is called, a new instance of Tokenizer class is created and it would be fit on a single text document. However, as the name of fit_on_texts method suggests, this is designed for applying it on all the text documents only once. In other words, applying it on only one text document does not make sense, simply because you don't build a vocabulary based on only one example (however, if there was a partial fit method, then probably it could be done that way)! Therefore, you can't use tf.keras.preprocessing.text.Tokenizer class here, i.e. it's not applicable in this specific data pipeline.


So, what should we do? As mentioned above, in almost all of the models which deal with text data, we first need to convert the texts into numerical features, i.e. encode them. For performing encoding, first we need a vocabulary set or a dictionary of tokens. Therefore, the steps we should take are as follows:

  1. If there is a pre-built vocabulary available, then skip to the next step. Otherwise, tokenize all the text data first and build the vocabulary.

  2. Encode the text data using the vocabulary set.

For performing the first step, we use tfds.features.text.Tokenizer to tokenize text data and build the vocabulary by iterating over the dataset.

For the second step, we use tfds.features.text.TokenTextEncoder to encode the text data using the vocabulary set built in previous step. Note that, for this step we are using map method; however, since map only functions in graph mode, we have wrapped our encode function in tf.py_function so that it could be used with map.

Here is the code (please read the comments in the code for additional points; I have not included them in the answer because they are not directly relevant, but they are useful and practical):

import tensorflow as tf
import tensorflow_datasets as tfds
from collections import Counter

fname = "rotten_tomatoes_reviews.csv"
dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True)

# Create a tokenizer instance to tokenize text data.
tokenizer = tfds.features.text.Tokenizer()

# Find unique tokens in the dataset.
lowercase = True  # set this to `False` if case-sensitivity is important.
vocabulary = Counter()
for _, text in dataset:
    if lowercase:
       text = tf.strings.lower(text)
    tokens = tokenizer.tokenize(text.numpy())
    vocabulary.update(tokens)

# Select the most common tokens as final vocabulary set.
# Note: if you want all the tokens to be included,
# set `vocab_size = len(vocabulary)` instead.
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))

# Create an encoder instance given our vocabulary set.
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
                                              lowercase=lowercase,
                                              tokenizer=tokenizer)

# Set this to a non-zero integer if you want the texts
# to be truncated when they have more than `max_len` tokens.
max_len = None

def encode(target, text):
    text_encoded = encoder.encode(text.numpy())
    if max_len:
        text_encoded = text_encoded[:max_len]
    return text_encoded, target

# Wrap `encode` function inside `tf.py_function` so that
# it could be used with `map` method.
def encode_pyfn(target, text):
    text_encoded, target = tf.py_function(encode,
                                          inp=[target, text],
                                          Tout=(tf.int32, tf.int32))

    # (optional) Set the shapes for efficiency.
    text_encoded.set_shape([None])
    target.set_shape([])

    return text_encoded, target

# Apply encoding and then padding.
# Note: if you want the sequences in all the batches 
# to have the same length, set `padded_shapes` argument accordingly.
dataset = dataset.map(encode_pyfn).padded_batch(batch_size=3,
                                                padded_shapes=([None,], []))

# Important Note: probably this dataset would be used as input to a model
# which uses an Embedding layer. Therefore, don't forget that you
# should set the vocabulary size for this layer properly, i.e. the
# current value of `vocab_size` does not include the padding (added
# by `padded_batch` method) and also the OOV token (added by encoder).

Side note for future readers: notice that the order of arguments, i.e. target, text, and the data types are based on the OP's dataset. Adapt as needed based on your own dataset/task (although, at the end, i.e. return text_encoded, target, we adjusted this to make it compatible with expected format of fit method).



来源:https://stackoverflow.com/questions/61445913/how-do-i-preprocess-and-tokenize-a-tensorflow-csvdataset-inside-the-map-method

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!