How to input a list of lists with different sizes in tf.data.Dataset

前端 未结 4 1296
一整个雨季
一整个雨季 2020-12-03 01:49

I have a long list of lists of integers (representing sentences, each one of different sizes) that I want to feed using the tf.data library. Each list (of the lists of list)

相关标签:
4条回答
  • 2020-12-03 01:54

    For those working with TensorFlow 2 and looking for an answer I found the following to work directly with ragged tensors. which should be much faster than generator, as long as the entire dataset fits in memory.

    t = [[[4,2]],
         [[3,4,5]]]
    
    rt=tf.ragged.constant(t)
    dataset = tf.data.Dataset.from_tensor_slices(rt)
    
    for x in dataset:
      print(x)
    

    produces

    <tf.RaggedTensor [[4, 2]]>
    <tf.RaggedTensor [[3, 4, 5]]>
    

    For some reason, it's very particular about having at least 2 dimensions on the individual arrays.

    0 讨论(0)
  • 2020-12-03 02:03

    In addition to @mrry's answer, the following code is also possible if you would like to create (images, labels) pair:

    import itertools
    data = tf.data.Dataset.from_generator(lambda: itertools.izip_longest(images, labels),
                                          output_types=(tf.float32, tf.float32),
                                          output_shapes=(tf.TensorShape([None, None, 3]), 
                                                         tf.TensorShape([None])))
    
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()
    
    with tf.Session() as sess:
        image, label = sess.run(next_element)  # ==> shape: [320, 420, 3], [20]
        image, label = sess.run(next_element)  # ==> shape: [1280, 720, 3], [40]
    
    0 讨论(0)
  • 2020-12-03 02:04

    You can use tf.data.Dataset.from_generator() to convert any iterable Python object (like a list of lists) into a Dataset:

    t = [[4, 2], [3, 4, 5]]
    
    dataset = tf.data.Dataset.from_generator(lambda: t, tf.int32, output_shapes=[None])
    
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()
    
    with tf.Session() as sess:
      print(sess.run(next_element))  # ==> '[4, 2]'
      print(sess.run(next_element))  # ==> '[3, 4, 5]'
    
    0 讨论(0)
  • 2020-12-03 02:15

    I don't think tensorflow supports tensors with varying numbers of elements along a given dimension.

    However, a simple solution is to pad the nested lists with trailing zeros (where necessary):

    t = [[4,2], [3,4,5]]
    max_length = max(len(lst) for lst in t)
    t_pad = [lst + [0] * (max_length - len(lst)) for lst in t]
    print(t_pad)
    dataset = tf.data.Dataset.from_tensor_slices(t_pad)
    print(dataset)
    

    Outputs:

    [[4, 2, 0], [3, 4, 5]]
    <TensorSliceDataset shapes: (3,), types: tf.int32>
    

    The zeros shouldn't be a big problem for the model: semantically they're just extra sentences of size zero at the end of each list of actual sentences.

    0 讨论(0)
提交回复
热议问题