dataset_bucket_by_sequence_length {tfdatasets} | R Documentation |
A transformation that buckets elements in a Dataset
by length
Description
A transformation that buckets elements in a Dataset
by length
Usage
dataset_bucket_by_sequence_length(
dataset,
element_length_func,
bucket_boundaries,
bucket_batch_sizes,
padded_shapes = NULL,
padding_values = NULL,
pad_to_bucket_boundary = FALSE,
no_padding = FALSE,
drop_remainder = FALSE,
name = NULL
)
Arguments
dataset |
A |
element_length_func |
function from element in |
bucket_boundaries |
integers, upper length boundaries of the buckets. |
bucket_batch_sizes |
integers, batch size per bucket. Length should be
|
padded_shapes |
Nested structure of |
padding_values |
Values to pad with, passed to
|
pad_to_bucket_boundary |
bool, if |
no_padding |
boolean, indicates whether to pad the batch features (features
need to be either of type |
drop_remainder |
(Optional.) A logical scalar, representing
whether the last batch should be dropped in the case it has fewer than
|
name |
(Optional.) A name for the tf.data operation. |
Details
Elements of the Dataset
are grouped together by length and then are padded
and batched.
This is useful for sequence tasks in which the elements have variable length. Grouping together elements that have similar lengths reduces the total fraction of padding in a batch which increases training step efficiency.
Below is an example to bucketize the input data to the 3 buckets "[0, 3), [3, 5), [5, Inf)" based on sequence length, with batch size 2.
See Also
Examples
## Not run:
dataset <- list(c(0),
c(1, 2, 3, 4),
c(5, 6, 7),
c(7, 8, 9, 10, 11),
c(13, 14, 15, 16, 17, 18, 19, 20),
c(21, 22)) %>%
lapply(as.array) %>% lapply(as_tensor, "int32") %>%
lapply(tensors_dataset) %>%
Reduce(dataset_concatenate, .)
dataset %>%
dataset_bucket_by_sequence_length(
element_length_func = function(elem) tf$shape(elem)[1],
bucket_boundaries = c(3, 5),
bucket_batch_sizes = c(2, 2, 2)
) %>%
as_array_iterator() %>%
iterate(print)
# [,1] [,2] [,3] [,4]
# [1,] 1 2 3 4
# [2,] 5 6 7 0
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 7 8 9 10 11 0 0 0
# [2,] 13 14 15 16 17 18 19 20
# [,1] [,2]
# [1,] 0 0
# [2,] 21 22
## End(Not run)