Tensorflow input pipeline where multiple rows correspond to a single observation?

社会主义新天地 提交于 2021-02-08 05:44:24

问题


So I've just started using Tensorflow, and I'm struggling to properly understand input pipelines.

The problem I'm working on is sequence classification. I'm trying to read in a CSV file with shape (100000, 4). First 3 columns are features, 4th column is the label. BUT - the data represents sequences of length 10 i.e. rows 1-10 are sequence 1, rows 11-20 are sequence 2 etc. This also means each label is repeated 10 times.

So at some point in this input pipeline, I'll need to reshape my feature tensor like tf.reshape(features, [batch_size_, rows_per_ob, input_dim]). And only take every 10th row of my label tensor like label[::rows_per_ob]

Another thing I should point out is that my actual dataset is in the billions of rows so I have to think about performance.

I've put together the below code from documentation and other posts on here, but I don't think I fully understand this because I'm seeing the following error:

INFO:tensorflow:Error reported to Coordinator: , Attempting to use uninitialized value input_producer_2/limit_epochs/epochs

There seems to be an out of range error.

I also can't figure out what to do with these batches once I get them working. Initially, I thought I would reshape them then just feed them into "feed_dict", but then I read that this is really bad, and I should be using a tf.data.Dataset object. But I'm not sure how to feed these batches into a Dataset. I'm also not entirely sure when would be the optimal time in this process to reshape my data?

And a final point of confusion - when you use an Iterator with a Dataset object, I see that we use the get_next() method. Does this mean that each element in the Dataset represent a full batch of data? And does this then mean that if we want to change the batch size, we need rebuild the entire Dataset object?

I'm really struggling to fit all the pieces together. If anyone has any pointers for me, it would be very much appreciated! Thanks!

# import
import tensorflow as tf

# constants
filename = "tensorflow_test_data.csv"
num_rows = 100000
rows_per_ob = 10
batch_size_ = 5
num_epochs_ = 2
num_batches = int(num_rows * num_epochs_ / batch_size_ / rows_per_ob)

# read csv line
def read_from_csv(filename_queue):
    reader = tf.TextLineReader(skip_header_lines=1)
    _, value = reader.read(filename_queue)
    record_defaults = [[0.0], [0.0], [0.0], [0.0]]
    a, b, c, d = tf.decode_csv(value, record_defaults=record_defaults)
    features = tf.stack([a, b, c])
    return features, d

def input_pipeline(filename=filename, batch_size=batch_size_, num_epochs=num_epochs_):
    filename_queue = tf.train.string_input_producer([filename],
                                                    num_epochs=num_epochs,
                                                    shuffle=False)
    x, y = read_from_csv(filename_queue)
    x_batch, y_batch = tf.train.batch([x, y],
                                      batch_size = batch_size * rows_per_ob,
                                      num_threads=1,
                                      capacity=10000)
    return x_batch, y_batch

###
x, y = input_pipeline(filename, batch_size=batch_size_,
                      num_epochs = num_epochs_)

# I imagine using lists is wrong here - this was more just for me to
# see the output
x_list = []
y_list = []
with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    for _ in range(num_batches):
        x_batch, y_batch = sess.run([x, y])
        x_list.append(x_batch)
        y_list.append(y_batch)
    coord.request_stop()
    coord.join(threads)

回答1:


You can express the entire pipeline using tf.data.Dataset objects, which might make things slightly easier:

dataset = tf.data.TextLineDataset(filename)

# Skip the header line.
dataset = dataset.skip(1)

# Combine 10 lines into a single observation.   
dataset = dataset.batch(rows_per_ob)

def parse_observation(line_batch):
  record_defaults = [[0.0], [0.0], [0.0], [0.0]]
  a, b, c, d = tf.decode_csv(value, record_defaults=record_defaults)
  features = tf.stack([a, b, c])
  label = d[-1]  # Take the label from the last row.
  return features, label

# Parse each observation into a `row_per_ob X 2` matrix of features and a
# scalar label.
dataset = dataset.map(parse_observation)

# Batch multiple observations.
dataset = dataset.batch(batch_size)

# Optionally add a prefetch for performance.
dataset = dataset.prefetch(1)

To use the values from the dataset, you can make a tf.data.Iterator to get the next element as a pair of tf.Tensor objects, then use these as the input to your model.

iterator = dataset.make_one_shot_iterator()

features_batch, label_batch = iterator.get_next()

# Use the `features_batch` and `label_batch` tensors as the inputs to
# the model, rather than fetching them and feeding them via the `Session`
# interface.
train_op = build_model(features_batch, label_batch)


来源:https://stackoverflow.com/questions/49899526/tensorflow-input-pipeline-where-multiple-rows-correspond-to-a-single-observation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!