问题
So I've just started using Tensorflow, and I'm struggling to properly understand input pipelines.
The problem I'm working on is sequence classification. I'm trying to read in a CSV file with shape (100000, 4). First 3 columns are features, 4th column is the label. BUT - the data represents sequences of length 10 i.e. rows 1-10 are sequence 1, rows 11-20 are sequence 2 etc. This also means each label is repeated 10 times.
So at some point in this input pipeline, I'll need to reshape my feature tensor like tf.reshape(features, [batch_size_, rows_per_ob, input_dim]). And only take every 10th row of my label tensor like label[::rows_per_ob]
Another thing I should point out is that my actual dataset is in the billions of rows so I have to think about performance.
I've put together the below code from documentation and other posts on here, but I don't think I fully understand this because I'm seeing the following error:
INFO:tensorflow:Error reported to Coordinator: , Attempting to use uninitialized value input_producer_2/limit_epochs/epochs
There seems to be an out of range error.
I also can't figure out what to do with these batches once I get them working. Initially, I thought I would reshape them then just feed them into "feed_dict", but then I read that this is really bad, and I should be using a tf.data.Dataset object. But I'm not sure how to feed these batches into a Dataset. I'm also not entirely sure when would be the optimal time in this process to reshape my data?
And a final point of confusion - when you use an Iterator with a Dataset object, I see that we use the get_next() method. Does this mean that each element in the Dataset represent a full batch of data? And does this then mean that if we want to change the batch size, we need rebuild the entire Dataset object?
I'm really struggling to fit all the pieces together. If anyone has any pointers for me, it would be very much appreciated! Thanks!
# import
import tensorflow as tf
# constants
filename = "tensorflow_test_data.csv"
num_rows = 100000
rows_per_ob = 10
batch_size_ = 5
num_epochs_ = 2
num_batches = int(num_rows * num_epochs_ / batch_size_ / rows_per_ob)
# read csv line
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=1)
_, value = reader.read(filename_queue)
record_defaults = [[0.0], [0.0], [0.0], [0.0]]
a, b, c, d = tf.decode_csv(value, record_defaults=record_defaults)
features = tf.stack([a, b, c])
return features, d
def input_pipeline(filename=filename, batch_size=batch_size_, num_epochs=num_epochs_):
filename_queue = tf.train.string_input_producer([filename],
num_epochs=num_epochs,
shuffle=False)
x, y = read_from_csv(filename_queue)
x_batch, y_batch = tf.train.batch([x, y],
batch_size = batch_size * rows_per_ob,
num_threads=1,
capacity=10000)
return x_batch, y_batch
###
x, y = input_pipeline(filename, batch_size=batch_size_,
num_epochs = num_epochs_)
# I imagine using lists is wrong here - this was more just for me to
# see the output
x_list = []
y_list = []
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for _ in range(num_batches):
x_batch, y_batch = sess.run([x, y])
x_list.append(x_batch)
y_list.append(y_batch)
coord.request_stop()
coord.join(threads)
回答1:
You can express the entire pipeline using tf.data.Dataset
objects, which might make things slightly easier:
dataset = tf.data.TextLineDataset(filename)
# Skip the header line.
dataset = dataset.skip(1)
# Combine 10 lines into a single observation.
dataset = dataset.batch(rows_per_ob)
def parse_observation(line_batch):
record_defaults = [[0.0], [0.0], [0.0], [0.0]]
a, b, c, d = tf.decode_csv(value, record_defaults=record_defaults)
features = tf.stack([a, b, c])
label = d[-1] # Take the label from the last row.
return features, label
# Parse each observation into a `row_per_ob X 2` matrix of features and a
# scalar label.
dataset = dataset.map(parse_observation)
# Batch multiple observations.
dataset = dataset.batch(batch_size)
# Optionally add a prefetch for performance.
dataset = dataset.prefetch(1)
To use the values from the dataset, you can make a tf.data.Iterator
to get the next element as a pair of tf.Tensor
objects, then use these as the input to your model.
iterator = dataset.make_one_shot_iterator()
features_batch, label_batch = iterator.get_next()
# Use the `features_batch` and `label_batch` tensors as the inputs to
# the model, rather than fetching them and feeding them via the `Session`
# interface.
train_op = build_model(features_batch, label_batch)
来源:https://stackoverflow.com/questions/49899526/tensorflow-input-pipeline-where-multiple-rows-correspond-to-a-single-observation