dataset_bucket_by_sequence_length {tfdatasets}R Documentation

A transformation that buckets elements in a Dataset by length

Description

A transformation that buckets elements in a Dataset by length

Usage

dataset_bucket_by_sequence_length(
  dataset,
  element_length_func,
  bucket_boundaries,
  bucket_batch_sizes,
  padded_shapes = NULL,
  padding_values = NULL,
  pad_to_bucket_boundary = FALSE,
  no_padding = FALSE,
  drop_remainder = FALSE,
  name = NULL
)

Arguments

dataset

A tf_dataset

element_length_func

function from element in Dataset to tf$int32, determines the length of the element, which will determine the bucket it goes into.

bucket_boundaries

integers, upper length boundaries of the buckets.

bucket_batch_sizes

integers, batch size per bucket. Length should be length(bucket_boundaries) + 1.

padded_shapes

Nested structure of tf.TensorShape (returned by tensorflow::shape()) to pass to tf.data.Dataset.padded_batch. If not provided, will use dataset.output_shapes, which will result in variable length dimensions being padded out to the maximum length in each batch.

padding_values

Values to pad with, passed to tf.data.Dataset.padded_batch. Defaults to padding with 0.

pad_to_bucket_boundary

bool, if FALSE, will pad dimensions with unknown size to maximum length in batch. If TRUE, will pad dimensions with unknown size to bucket boundary minus 1 (i.e., the maximum length in each bucket), and caller must ensure that the source Dataset does not contain any elements with length longer than max(bucket_boundaries).

no_padding

boolean, indicates whether to pad the batch features (features need to be either of type tf.sparse.SparseTensor or of same shape).

drop_remainder

(Optional.) A logical scalar, representing whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is not to drop the smaller batch.

name

(Optional.) A name for the tf.data operation.

Details

Elements of the Dataset are grouped together by length and then are padded and batched.

This is useful for sequence tasks in which the elements have variable length. Grouping together elements that have similar lengths reduces the total fraction of padding in a batch which increases training step efficiency.

Below is an example to bucketize the input data to the 3 buckets "[0, 3), [3, 5), [5, Inf)" based on sequence length, with batch size 2.

See Also

Examples

## Not run: 
dataset <- list(c(0),
                c(1, 2, 3, 4),
                c(5, 6, 7),
                c(7, 8, 9, 10, 11),
                c(13, 14, 15, 16, 17, 18, 19, 20),
                c(21, 22)) %>%
  lapply(as.array) %>% lapply(as_tensor, "int32") %>%
  lapply(tensors_dataset) %>%
  Reduce(dataset_concatenate, .)

dataset %>%
  dataset_bucket_by_sequence_length(
    element_length_func = function(elem) tf$shape(elem)[1],
    bucket_boundaries = c(3, 5),
    bucket_batch_sizes = c(2, 2, 2)
  ) %>%
  as_array_iterator() %>%
  iterate(print)
#      [,1] [,2] [,3] [,4]
# [1,]    1    2    3    4
# [2,]    5    6    7    0
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,]    7    8    9   10   11    0    0    0
# [2,]   13   14   15   16   17   18   19   20
#      [,1] [,2]
# [1,]    0    0
# [2,]   21   22

## End(Not run)

[Package tfdatasets version 2.17.0 Index]