A matsindf_apply primer

Matthew Kuperus Heun

2019-02-05

Introduction

matsindf_apply is a powerful and versatile function that enables analysis of data frames by applying FUN in helpful ways. The function is called matsindf_apply, because it can be used to apply FUN to a matsindf data frame, a data frame that contains matrices as individual entries in a data frame. (A matsindf data frame can be created by calling collapse_to_matrices, as demonstrated below.)

But matsindf_apply can apply FUN across much more: data frames of single numbers, lists of matrices, lists of single numbers, and individual numbers. This vignette demonstrates matsindf_apply, starting with simple examples and proceeding to sophisticated analyses.

The basics

The basis of all analyses conducted with matsindf_apply is a function (FUN) to be applied across data. FUN must return a named list of variables, its result. Here is an example function that both adds and subtracts is arguments, a and b and returns a list containing its result, c and d.

example_fun <- function(a, b){
  return(list(c = sum_byname(a, b), d = difference_byname(a, b)))
}

Similar to lapply and its siblings, additional argument(s) to matsindf_apply include the data over which FUN is to be applied. These arguments can, in the first instance, be supplied as named arguments to the ... argument of matsindf_apply. The ... arguments to matsindf_apply are passed to FUN according to their names. In this case, the output of matsindf_apply is the the named list returned by FUN.

matsindf_apply(FUN = example_fun, a = 2, b = 1)
#> $c
#> [1] 3
#> 
#> $d
#> [1] 1

Passing an additional argument (z = 2) causes the familiar unused argument error, because example_fun does not have a z argument.

tryCatch(
  matsindf_apply(FUN = example_fun, a = 2, b = 1, z = 2),
  error = function(e){e}
)
#> <simpleError in FUN(...): unused argument (z = 2)>

Failing to pass a needed argument (b = 1) causes the familiar argument X is missing error, because example_fun requires a value for b.

tryCatch(
  matsindf_apply(FUN = example_fun, a = 2),
  error = function(e){e}
)
#> <simpleError in sum_byname(a, b): argument "b" is missing, with no default>

(If example_fun tolerated a missing argument, no such error would be created.)

Alternatively, arguments to FUN can be given in a named list to the first argument to matsindf_apply (.dat). When a value is assigned to .dat, the return value from matsindf_apply contains all named variables in .dat (in this case both a and b) in addition to the results provided by FUN (in this case both c and d).

matsindf_apply(list(a = 2, b = 1), FUN = example_fun)
#> $a
#> [1] 2
#> 
#> $b
#> [1] 1
#> 
#> $c
#> [1] 3
#> 
#> $d
#> [1] 1

Extra variables are tolerated in .dat, because .dat is considered to be a store of data from which variables can be drawn as needed.

matsindf_apply(list(a = 2, b = 1, z = 42), FUN = example_fun)
#> $a
#> [1] 2
#> 
#> $b
#> [1] 1
#> 
#> $z
#> [1] 42
#> 
#> $c
#> [1] 3
#> 
#> $d
#> [1] 1

In contrast, named arguments to ... are specified by the user, so including an extra variable is considered an error, as shown above.

Some details

If a named argument is supplied by both .dat and ..., the argument in ... takes precedence, overriding the argument in .dat.

matsindf_apply(list(a = 2, b = 1), FUN = example_fun, a = 10)
#> $a
#> [1] 10
#> 
#> $b
#> [1] 1
#> 
#> $c
#> [1] 11
#> 
#> $d
#> [1] 9

When supplying both .dat and ..., ... can contain named strings which are interpreted as mappings from item names in .dat to arguments in the signature of FUN. In the example below, a = "z" indicates that argument a to FUN should be supplied by item z in .dat.

matsindf_apply(list(a = 2, b = 1, z = 42),
               FUN = example_fun, a = "z")
#> $a
#> [1] 2
#> 
#> $b
#> [1] 1
#> 
#> $z
#> [1] 42
#> 
#> $c
#> [1] 43
#> 
#> $d
#> [1] 41

If a named argument appears in both .dat and the output of FUN, a name collision occurs in the output of matsindf_apply, and a warning is issued.

tryCatch(
  matsindf_apply(list(a = 2, b = 1, c = 42), FUN = example_fun),
  warning = function(w){w}
)
#> <simpleWarning in matsindf_apply(list(a = 2, b = 1, c = 42), FUN = example_fun): name collision in matsindf_apply: c>

matsindf_apply and data frames

.dat can be a list (as shown in several examples above), but it can also be a data frame.

df <- data.frame(a = 2:4, b = 1:3)
matsindf_apply(df, FUN = example_fun)
#>   a b c d
#> 1 2 1 3 1
#> 2 3 2 5 1
#> 3 4 3 7 1

Furthermore, matsindf_apply works with a matsindf data frame, a data frame wherein each entry in the data frame is a matrix. To demonstrate use of matsindf_apply with a data frame, we’ll construct a simple matsindf data frame (midf) using functions in this package.

# Create a tidy data frame containing data for matrices
tidy <- data.frame(Year = rep(c(rep(2017, 4), rep(2018, 4)), 2),
                   matnames = c(rep("U", 8), rep("V", 8)),
                   matvals = c(1:4, 11:14, 21:24, 31:34),
                   rownames = c(rep(c(rep("p1", 2), rep("p2", 2)), 2), 
                                rep(c(rep("i1", 2), rep("i2", 2)), 2)),
                   colnames = c(rep(c("i1", "i2"), 4), 
                                rep(c("p1", "p2"), 4))) %>%
  mutate(
    rowtypes = case_when(
      matnames == "U" ~ "product",
      matnames == "V" ~ "industry", 
      TRUE ~ NA_character_
    ),
    coltypes = case_when(
      matnames == "U" ~ "industry",
      matnames == "V" ~ "product",
      TRUE ~ NA_character_
    )
  )

tidy
#>    Year matnames matvals rownames colnames rowtypes coltypes
#> 1  2017        U       1       p1       i1  product industry
#> 2  2017        U       2       p1       i2  product industry
#> 3  2017        U       3       p2       i1  product industry
#> 4  2017        U       4       p2       i2  product industry
#> 5  2018        U      11       p1       i1  product industry
#> 6  2018        U      12       p1       i2  product industry
#> 7  2018        U      13       p2       i1  product industry
#> 8  2018        U      14       p2       i2  product industry
#> 9  2017        V      21       i1       p1 industry  product
#> 10 2017        V      22       i1       p2 industry  product
#> 11 2017        V      23       i2       p1 industry  product
#> 12 2017        V      24       i2       p2 industry  product
#> 13 2018        V      31       i1       p1 industry  product
#> 14 2018        V      32       i1       p2 industry  product
#> 15 2018        V      33       i2       p1 industry  product
#> 16 2018        V      34       i2       p2 industry  product

# Convert to a matsindf data frame
midf <- tidy %>% 
  group_by(Year, matnames) %>% 
  collapse_to_matrices(rowtypes = "rowtypes", coltypes = "coltypes") %>% 
  spread(key = "matnames", value = "matvals")

# Take a look at the midf data frame and some of the matrices it contains.
midf
#>   Year              U              V
#> 1 2017     1, 3, 2, 4 21, 23, 22, 24
#> 2 2018 11, 13, 12, 14 31, 33, 32, 34
midf$U[[1]]
#>    i1 i2
#> p1  1  2
#> p2  3  4
#> attr(,"rowtype")
#> [1] "product"
#> attr(,"coltype")
#> [1] "industry"
midf$V[[1]]
#>    p1 p2
#> i1 21 22
#> i2 23 24
#> attr(,"rowtype")
#> [1] "industry"
#> attr(,"coltype")
#> [1] "product"

With midf in hand, we can demonstrate use of tidyverse-style functional programming to perform matrix algebra within a data frame. The functions of the matsbyname package (such as difference_byname below) can be used for this purpose.

result <- midf %>% 
  mutate(
    W = difference_byname(transpose_byname(V), U)
  )
result
#>   Year              U              V              W
#> 1 2017     1, 3, 2, 4 21, 23, 22, 24 20, 19, 21, 20
#> 2 2018 11, 13, 12, 14 31, 33, 32, 34 20, 19, 21, 20
result$W[[1]]
#>    i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "product"
#> attr(,"coltype")
#> [1] "industry"
result$W[[2]]
#>    i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "product"
#> attr(,"coltype")
#> [1] "industry"

This way of performing matrix calculations works equally well within a 2-row matsindf data frame (as shown above) or a 1000-row matsindf data frame.

Programming with matsindf_apply

Users can write their own functions using matsindf_apply. A flexible calc_W function can be written as follows.

calc_W <- function(.DF = NULL, U = "U", V = "V", W = "W"){
  # The inner function does all the work.
  W_func <- function(U_mat, V_mat){
    # When we get here, U_mat and V_mat will be single matrices or single numbers, 
    # not a column in a data frame or an item in a list.
    # Calculate W_mat from the inputs U_mat and V_mat.
    W_mat <- difference_byname(transpose_byname(V_mat), U_mat)
    # Return a named list.
    list(W_mat) %>% magrittr::set_names(W)
  }
  # The body of the main function consists of a call to matsindf_apply
  # that specifies the inner function
  matsindf_apply(.DF, FUN = W_func, U_mat = U, V_mat = V)
}

This style of writing matsindf_apply functions is incredibly versatile, leveraging the capabilities of both the matsindf and matsbyname packages. (Indeed, the Recca package uses matsindf_apply heavily and is built upon the functions in the matsindf and matsbyname packages.)

Functions written like calc_W can operate in ways similar to matsindf_apply itself. To demonstrate, we’ll use calc_W in all the ways that matsindf_apply can be used, going in the reverse order to our demonstration of the capabilities of matsindf_apply above.

calc_W can be used as a specialized mutate function that operates on matsindf data frames.

midf %>% calc_W()
#>   Year              U              V              W
#> 1 2017     1, 3, 2, 4 21, 23, 22, 24 20, 19, 21, 20
#> 2 2018 11, 13, 12, 14 31, 33, 32, 34 20, 19, 21, 20

The added column could be given a different name from the default (“W”) using the W argument.

midf %>% calc_W(W = "W_prime")
#>   Year              U              V        W_prime
#> 1 2017     1, 3, 2, 4 21, 23, 22, 24 20, 19, 21, 20
#> 2 2018 11, 13, 12, 14 31, 33, 32, 34 20, 19, 21, 20

As with matsindf_apply, column names in midf can be mapped to the arguments of calc_W by the arguments to calc_W.

midf %>% 
  rename(X = U, Y = V) %>% 
  calc_W(U = "X", V = "Y")
#>   Year              X              Y              W
#> 1 2017     1, 3, 2, 4 21, 23, 22, 24 20, 19, 21, 20
#> 2 2018 11, 13, 12, 14 31, 33, 32, 34 20, 19, 21, 20

calc_W can operate on lists of single matrices, too. This approach works, because the default values for the U and V arguments to calc_W are “U” and “V”, respectively. The input list members (in this case midf$U[[1]] and midf$V[[1]]) are returned with the output.

calc_W(list(U = midf$U[[1]], V = midf$V[[1]]))
#> $U
#>    i1 i2
#> p1  1  2
#> p2  3  4
#> attr(,"rowtype")
#> [1] "product"
#> attr(,"coltype")
#> [1] "industry"
#> 
#> $V
#>    p1 p2
#> i1 21 22
#> i2 23 24
#> attr(,"rowtype")
#> [1] "industry"
#> attr(,"coltype")
#> [1] "product"
#> 
#> $W
#>    i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "product"
#> attr(,"coltype")
#> [1] "industry"

It may be clearer to name the arguments as required by the calc_W function without wrapping in a list first, as shown below. But in this approach, the input matrices are not returned with the output.

calc_W(U = midf$U[[1]], V = midf$V[[1]])
#> $W
#>    i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "product"
#> attr(,"coltype")
#> [1] "industry"

calc_W can operate on data frames containing single numbers.

data.frame(U = c(1, 2), V = c(3, 4)) %>% calc_W()
#>   U V W
#> 1 1 3 2
#> 2 2 4 2

Finally, calc_W can be applied to single numbers, and the result is 1x1 matrix.

calc_W(U = 2, V = 3)
#> $W
#>      [,1]
#> [1,]    1

Conclusion

This vignette demonstrated use of the versatile matsindf_apply function. Inputs to matsindf_apply can be

matsindf_apply can be used for programming, and functions constructed as demonstrated above share characteristics with matsindf_apply: