Wednesday, November 13, 2024

NumPy-style broadcasting for R TensorFlow users

NumPy-style broadcasting for R TensorFlow users

We develop, train, and deploy TensorFlow models from R. But that doesn’t mean we don’t make use of documentation, blog posts, and examples written in Python. We look up specific functionality in the official TensorFlow API docs; we get inspiration from other people’s code.

Depending on how comfortable you are with Python, there’s a problem. For example: You’re supposed to know how broadcasting works. And perhaps, you’d say you’re vaguely familiar with it: So when arrays have different shapes, some elements get duplicated until their shapes match and … and isn’t R vectorized anyway?

While such a global notion may work in general, like when skimming a blog post, it’s not enough to understand, say, examples in the TensorFlow API docs. In this post, we’ll try to arrive at a more exact understanding, and check it on concrete examples.

Speaking of examples, here are two motivating ones.

Broadcasting in action

The first uses TensorFlow’s matmul to multiply two tensors. Would you like to guess the result – not the numbers, but how it comes about in general? Does this even run without error – shouldn’t matrices be two-dimensional (rank-2 tensors, in TensorFlow speak)?

a <- tf$constant(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a 
# tf.Tensor(
# [[[ 1.  2.  3.]
#   [ 4.  5.  6.]]
# 
#  [[ 7.  8.  9.]
#   [10. 11. 12.]]], shape=(2, 2, 3), dtype=float64)

b <- tf$constant(keras::array_reshape(101:106, dim = c(1, 3, 2)))
b  
# tf.Tensor(
# [[[101. 102.]
#   [103. 104.]
#   [105. 106.]]], shape=(1, 3, 2), dtype=float64)

c <- tf$matmul(a, b)

Second, here is a “real example” from a TensorFlow Probability (TFP) github issue. (Translated to R, but keeping the semantics).
In TFP, we can have batches of distributions. That, per se, is not surprising. But look at this:

library(tfprobability)
d <- tfd_normal(loc = c(0, 1), scale = matrix(1.5:4.5, ncol = 2, byrow = TRUE))
d
# tfp.distributions.Normal("Normal", batch_shape=[2, 2], event_shape=[], dtype=float64)

We create a batch of four normal distributions: each with a different scale (1.5, 2.5, 3.5, 4.5). But wait: there are only two location parameters given. So what are their scales, respectively?
Thankfully, TFP developers Brian Patton and Chris Suter explained how it works: TFP actually does broadcasting – with distributions – just like with tensors!

We get back to both examples at the end of this post. Our main focus will be to explain broadcasting as done in NumPy, as NumPy-style broadcasting is what numerous other frameworks have adopted (e.g., TensorFlow).

Before though, let’s quickly review a few basics about NumPy arrays: How to index or slice them (indexing normally referring to single-element extraction, while slicing would yield – well – slices containing several elements); how to parse their shapes; some terminology and related background.
Though not complicated per se, these are the kinds of things that can be confusing to infrequent Python users; yet they’re often a prerequisite to successfully making use of Python documentation.

Stated upfront, we’ll really restrict ourselves to the basics here; for example, we won’t touch advanced indexing which – just like lots more –, can be looked up in detail in the NumPy documentation.

Few facts about NumPy

Basic slicing

For simplicity, we’ll use the terms indexing and slicing more or less synonymously from now on. The basic device here is a slice, namely, a start:stop structure indicating, for a single dimension, which range of elements to include in the selection.

In contrast to R, Python indexing is zero-based, and the end index is exclusive:

c(4L, 1L))
a
# tf.Tensor(
# [[1.]
#  [1.]
#  [1.]
#  [1.]], shape=(4, 1), dtype=float32)

b <- tf$constant(c(1, 2, 3, 4))
b
# tf.Tensor([1. 2. 3. 4.], shape=(4,), dtype=float32)

a + b
# tf.Tensor(
# [[2. 3. 4. 5.]
# [2. 3. 4. 5.]
# [2. 3. 4. 5.]
# [2. 3. 4. 5.]], shape=(4, 4), dtype=float32)

And second, when we add tensors with shapes (3, 3) and (3,), the 1-d tensor should get added to every row (not every column):

a <- tf$constant(matrix(1:9, ncol = 3, byrow = TRUE), dtype = tf$float32)
a
# tf.Tensor(
# [[1. 2. 3.]
#  [4. 5. 6.]
#  [7. 8. 9.]], shape=(3, 3), dtype=float32)

b <- tf$constant(c(100, 200, 300))
b
# tf.Tensor([100. 200. 300.], shape=(3,), dtype=float32)

a + b
# tf.Tensor(
# [[101. 202. 303.]
#  [104. 205. 306.]
#  [107. 208. 309.]], shape=(3, 3), dtype=float32)

Now back to the initial matmul example.

Back to the puzzles

The documentation for matmul says,

The inputs must, following any transpositions, be tensors of rank >= 2 where the inner 2 dimensions specify valid matrix multiplication dimensions, and any further outer dimensions specify matching batch size.

So here (see code just below), the inner two dimensions look good – (2, 3) and (3, 2) – while the one (one and only, in this case) batch dimension shows mismatching values 2 and 1, respectively.
A case for broadcasting thus: Both “batches” of a get matrix-multiplied with b.

a <- tf$constant(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a 
# tf.Tensor(
# [[[ 1.  2.  3.]
#   [ 4.  5.  6.]]
# 
#  [[ 7.  8.  9.]
#   [10. 11. 12.]]], shape=(2, 2, 3), dtype=float64)

b <- tf$constant(keras::array_reshape(101:106, dim = c(1, 3, 2)))
b  
# tf.Tensor(
# [[[101. 102.]
#   [103. 104.]
#   [105. 106.]]], shape=(1, 3, 2), dtype=float64)

c <- tf$matmul(a, b)
c
# tf.Tensor(
# [[[ 622.  628.]
#   [1549. 1564.]]
# 
#  [[2476. 2500.]
#   [3403. 3436.]]], shape=(2, 2, 2), dtype=float64) 

Let’s quickly check this really is what happens, by multiplying both batches separately:

tf$matmul(a[1, , ], b)
# tf.Tensor(
# [[[ 622.  628.]
#   [1549. 1564.]]], shape=(1, 2, 2), dtype=float64)

tf$matmul(a[2, , ], b)
# tf.Tensor(
# [[[2476. 2500.]
#   [3403. 3436.]]], shape=(1, 2, 2), dtype=float64)

Is it too weird to be wondering if broadcasting would also happen for matrix dimensions? E.g., could we try matmuling tensors of shapes (2, 4, 1) and (2, 3, 1), where the 4 x 1 matrix would be broadcast to 4 x 3? – A quick test shows that no.

To see how really, when dealing with TensorFlow operations, it pays off overcoming one’s initial reluctance and actually consult the documentation, let’s try another one.

In the documentation for matvec, we are told:

Multiplies matrix a by vector b, producing a * b.
The matrix a must, following any transpositions, be a tensor of rank >= 2, with shape(a)[-1] == shape(b)[-1], and shape(a)[:-2] able to broadcast with shape(b)[:-1].

In our understanding, given input tensors of shapes (2, 2, 3) and (2, 3), matvec should perform two matrix-vector multiplications: once for each batch, as indexed by each input’s leftmost dimension. Let’s check this – so far, there is no broadcasting involved:

# two matrices
a <- tf$constant(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a
# tf.Tensor(
# [[[ 1.  2.  3.]
#   [ 4.  5.  6.]]
# 
#  [[ 7.  8.  9.]
#   [10. 11. 12.]]], shape=(2, 2, 3), dtype=float64)

b = tf$constant(keras::array_reshape(101:106, dim = c(2, 3)))
b
# tf.Tensor(
# [[101. 102. 103.]
#  [104. 105. 106.]], shape=(2, 3), dtype=float64)

c <- tf$linalg$matvec(a, b)
c
# tf.Tensor(
# [[ 614. 1532.]
#  [2522. 3467.]], shape=(2, 2), dtype=float64)

Doublechecking, we manually multiply the corresponding matrices and vectors, and get:

tf$linalg$matvec(a[1,  , ], b[1, ])
# tf.Tensor([ 614. 1532.], shape=(2,), dtype=float64)

tf$linalg$matvec(a[2,  , ], b[2, ])
# tf.Tensor([2522. 3467.], shape=(2,), dtype=float64)

The same. Now, will we see broadcasting if b has just a single batch?

b = tf$constant(keras::array_reshape(101:103, dim = c(1, 3)))
b
# tf.Tensor([[101. 102. 103.]], shape=(1, 3), dtype=float64)

c <- tf$linalg$matvec(a, b)
c
# tf.Tensor(
# [[ 614. 1532.]
#  [2450. 3368.]], shape=(2, 2), dtype=float64)

Multiplying every batch of a with b, for comparison:

tf$linalg$matvec(a[1,  , ], b)
# tf.Tensor([ 614. 1532.], shape=(2,), dtype=float64)

tf$linalg$matvec(a[2,  , ], b)
# tf.Tensor([[2450. 3368.]], shape=(1, 2), dtype=float64)

It worked!

Now, on to the other motivating example, using tfprobability.

Broadcasting everywhere

Here again is the setup:

library(tfprobability)
d <- tfd_normal(loc = c(0, 1), scale = matrix(1.5:4.5, ncol = 2, byrow = TRUE))
d
# tfp.distributions.Normal("Normal", batch_shape=[2, 2], event_shape=[], dtype=float64)

What is going on? Let’s inspect location and scale separately:

d$loc
# tf.Tensor([0. 1.], shape=(2,), dtype=float64)

d$scale
# tf.Tensor(
# [[1.5 2.5]
#  [3.5 4.5]], shape=(2, 2), dtype=float64)

Just focusing on these tensors and their shapes, and having been told that there’s broadcasting going on, we can reason like this: Aligning both shapes on the right and extending loc’s shape by 1 (on the left), we have (1, 2) which may be broadcast with (2,2) – in matrix-speak, loc is treated as a row and duplicated.

Meaning: We have two distributions with mean \(0\) (one of scale \(1.5\), the other of scale \(3.5\)), and also two with mean \(1\) (corresponding scales being \(2.5\) and \(4.5\)).

Here’s a more direct way to see this:

d$mean()
# tf.Tensor(
# [[0. 1.]
#  [0. 1.]], shape=(2, 2), dtype=float64)

d$stddev()
# tf.Tensor(
# [[1.5 2.5]
#  [3.5 4.5]], shape=(2, 2), dtype=float64)

Puzzle solved!

Summing up, broadcasting is simple “in theory” (its rules are), but may need some practicing to get it right. Especially in conjunction with the fact that functions / operators do have their own views on which parts of its inputs should broadcast, and which shouldn’t. Really, there is no way around looking up the actual behaviors in the documentation.

Hopefully though, you’ve found this post to be a good start into the topic. Maybe, like the author, you feel like you might see broadcasting going on anywhere in the world now. Thanks for reading!

Related Articles

Latest Articles