Nothing’s ever perfect, and data isn’t either. One type of “imperfection” is missing data, where some features are unobserved for some subjects. (A topic for another post.) Another is censored data, where an event whose characteristics we want to measure does not occur in the observation interval. The example in Richard McElreath’s Statistical Rethinking is time to adoption of cats in an animal shelter. If we fix an interval and observe wait times for those cats that actually did get adopted, our estimate will end up too optimistic: We don’t take into account those cats who weren’t adopted during this interval and thus, would have contributed wait times of length longer than the complete interval.
In this post, we use a slightly less emotional example which nonetheless may be of interest, especially to R package developers: time to completion of R CMD check
, collected from CRAN and provided by the parsnip
package as check_times
. Here, the censored portion are those checks that errored out for whatever reason, i.e., for which the check did not complete.
Why do we care about the censored portion? In the cat adoption scenario, this is pretty obvious: We want to be able to get a realistic estimate for any unknown cat, not just those cats that will turn out to be “lucky”. How about check_times
? Well, if your submission is one of those that errored out, you still care about how long you wait, so even though their percentage is low (< 1%) we don’t want to simply exclude them. Also, there is the possibility that the failing ones would have taken longer, had they run to completion, due to some intrinsic difference between both groups. Conversely, if failures were random, the longer-running checks would have a greater chance to get hit by an error. So here too, exluding the censored data may result in bias.
How can we model durations for that censored portion, where the “true duration” is unknown? Taking one step back, how can we model durations in general? Making as few assumptions as possible, the maximum entropy distribution for displacements (in space or time) is the exponential. Thus, for the checks that actually did complete, durations are assumed to be exponentially distributed.
For the others, all we know is that in a virtual world where the check completed, it would take at least as long as the given duration. This quantity can be modeled by the exponential complementary cumulative distribution function (CCDF). Why? A cumulative distribution function (CDF) indicates the probability that a value lower or equal to some reference point was reached; e.g., “the probability of durations <= 255 is 0.9”. Its complement, 1 – CDF, then gives the probability that a value will exceed than that reference point.
Let’s see this in action.
The data
The following code works with the current stable releases of TensorFlow and TensorFlow Probability, which are 1.14 and 0.7, respectively. If you don’t have tfprobability
installed, get it from Github:
These are the libraries we need. As of TensorFlow 1.14, we call tf$compat$v2$enable_v2_behavior()
to run with eager execution.
Besides the check durations we want to model, check_times
reports various features of the package in question, such as number of imported packages, number of dependencies, size of code and documentation files, etc. The status
variable indicates whether the check completed or errored out.
df <- check_times %>% select(-package)
glimpse(df)
Observations: 13,626
Variables: 24
$ authors <int> 1, 1, 1, 1, 5, 3, 2, 1, 4, 6, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
$ imports <dbl> 0, 6, 0, 0, 3, 1, 0, 4, 0, 7, 0, 0, 0, 0, 3, 2, 14, 2, 2, 0…
$ suggests <dbl> 2, 4, 0, 0, 2, 0, 2, 2, 0, 0, 2, 8, 0, 0, 2, 0, 1, 3, 0, 0,…
$ depends <dbl> 3, 1, 6, 1, 1, 1, 5, 0, 1, 1, 6, 5, 0, 0, 0, 1, 1, 5, 0, 2,…
$ Roxygen <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,…
$ gh <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,…
$ rforge <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ descr <int> 217, 313, 269, 63, 223, 1031, 135, 344, 204, 335, 104, 163,…
$ r_count <int> 2, 20, 8, 0, 10, 10, 16, 3, 6, 14, 16, 4, 1, 1, 11, 5, 7, 1…
$ r_size <dbl> 0.029053, 0.046336, 0.078374, 0.000000, 0.019080, 0.032607,…
$ ns_import <dbl> 3, 15, 6, 0, 4, 5, 0, 4, 2, 10, 5, 6, 1, 0, 2, 2, 1, 11, 0,…
$ ns_export <dbl> 0, 19, 0, 0, 10, 0, 0, 2, 0, 9, 3, 4, 0, 1, 10, 0, 16, 0, 2…
$ s3_methods <dbl> 3, 0, 11, 0, 0, 0, 0, 2, 0, 23, 0, 0, 2, 5, 0, 4, 0, 0, 0, …
$ s4_methods <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ doc_count <int> 0, 3, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ doc_size <dbl> 0.000000, 0.019757, 0.038281, 0.000000, 0.007874, 0.000000,…
$ src_count <int> 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 3, 0, 0, 0, 0, 0, 0, 54, 0, 0…
$ src_size <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000,…
$ data_count <int> 2, 0, 0, 3, 3, 1, 10, 0, 4, 2, 2, 146, 0, 0, 0, 0, 0, 10, 0…
$ data_size <dbl> 0.025292, 0.000000, 0.000000, 4.885864, 4.595504, 0.006500,…
$ testthat_count <int> 0, 8, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0,…
$ testthat_size <dbl> 0.000000, 0.002496, 0.000000, 0.000000, 0.000000, 0.000000,…
$ check_time <dbl> 49, 101, 292, 21, 103, 46, 78, 91, 47, 196, 200, 169, 45, 2…
$ status <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Of these 13,626 observations, just 103 are censored:
0 1
103 13523
For better readability, we’ll work with a subset of the columns. We use surv_reg
to help us find a useful and interesting subset of predictors:
survreg_fit <-
surv_reg(dist = "exponential") %>%
set_engine("survreg") %>%
fit(Surv(check_time, status) ~ .,
data = df)
tidy(survreg_fit)
# A tibble: 23 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.86 0.0219 176. 0. NA NA
2 authors 0.0139 0.00580 2.40 1.65e- 2 NA NA
3 imports 0.0606 0.00290 20.9 7.49e-97 NA NA
4 suggests 0.0332 0.00358 9.28 1.73e-20 NA NA
5 depends 0.118 0.00617 19.1 5.66e-81 NA NA
6 Roxygen 0.0702 0.0209 3.36 7.87e- 4 NA NA
7 gh 0.00898 0.0217 0.414 6.79e- 1 NA NA
8 rforge 0.0232 0.0662 0.351 7.26e- 1 NA NA
9 descr 0.000138 0.0000337 4.10 4.18e- 5 NA NA
10 r_count 0.00209 0.000525 3.98 7.03e- 5 NA NA
11 r_size 0.481 0.0819 5.87 4.28e- 9 NA NA
12 ns_import 0.00352 0.000896 3.93 8.48e- 5 NA NA
13 ns_export -0.00161 0.000308 -5.24 1.57e- 7 NA NA
14 s3_methods 0.000449 0.000421 1.06 2.87e- 1 NA NA
15 s4_methods -0.00154 0.00206 -0.745 4.56e- 1 NA NA
16 doc_count 0.0739 0.0117 6.33 2.44e-10 NA NA
17 doc_size 2.86 0.517 5.54 3.08e- 8 NA NA
18 src_count 0.0122 0.00127 9.58 9.96e-22 NA NA
19 src_size -0.0242 0.0181 -1.34 1.82e- 1 NA NA
20 data_count 0.0000415 0.000980 0.0423 9.66e- 1 NA NA
21 data_size 0.0217 0.0135 1.61 1.08e- 1 NA NA
22 testthat_count -0.000128 0.00127 -0.101 9.20e- 1 NA NA
23 testthat_size 0.0108 0.0139 0.774 4.39e- 1 NA NA
It seems that if we choose imports
, depends
, r_size
, doc_size
, ns_import
and ns_export
we end up with a mix of (comparatively) powerful predictors from different semantic spaces and of different scales.
Before pruning the dataframe, we save away the target variable. In our model and training setup, it is convenient to have censored and uncensored data stored separately, so here we create two target matrices instead of one:
Now we can zoom in on the variables of interest, setting up one dataframe for the censored data and one for the uncensored data each. All predictors are normalized to avoid overflow during sampling. We add a column of 1
s for use as an intercept.
df <- df %>% select(status,
depends,
imports,
doc_size,
r_size,
ns_import,
ns_export) %>%
mutate_at(.vars = 2:7, .funs = function(x) (x - min(x))/(max(x)-min(x))) %>%
add_column(intercept = rep(1, nrow(df)), .before = 1)
# dataframe of predictors for censored data
df_c <- df %>% filter(status == 0) %>% select(-status)
# dataframe of predictors for non-censored data
df_nc <- df %>% filter(status == 1) %>% select(-status)
That’s it for preparations. But of course we’re curious. Do check times look different? Do predictors – the ones we chose – look different?
Comparing a few meaningful percentiles for both classes, we see that durations for uncompleted checks are higher than those for completed checks throughout, apart from the 100% percentile. It’s not surprising that given the enormous difference in sample size, maximum duration is higher for completed checks. Otherwise though, doesn’t it look like the errored-out package checks “were going to take longer”?
completed | 36 | 54 | 79 | 115 | 211 | 1343 |
not completed | 42 | 71 | 97 | 143 | 293 | 696 |
How about the predictors? We don’t see any differences for depends
, the number of package dependencies (apart from, again, the higher maximum reached for packages whose check completed):
completed | 0 | 1 | 1 | 2 | 4 | 12 |
not completed | 0 | 1 | 1 | 2 | 4 | 7 |
But for all others, we see the same pattern as reported above for check_time
. Number of packages imported is higher for censored data at all percentiles besides the maximum:
completed | 0 | 0 | 2 | 4 | 9 | 43 |
not completed | 0 | 1 | 5 | 8 | 12 | 22 |
Same for ns_export
, the estimated number of exported functions or methods:
completed | 0 | 1 | 2 | 8 | 26 | 2547 |
not completed | 0 | 1 | 5 | 13 | 34 | 336 |
As well as for ns_import
, the estimated number of imported functions or methods:
completed | 0 | 1 | 3 | 6 | 19 | 312 |
not completed | 0 | 2 | 5 | 11 | 23 | 297 |
Same pattern for r_size
, the size on disk of files in the R
directory:
completed | 0.005 | 0.015 | 0.031 | 0.063 | 0.176 | 3.746 |
not completed | 0.008 | 0.019 | 0.041 | 0.097 | 0.217 | 2.148 |
And finally, we see it for doc_size
too, where doc_size
is the size of .Rmd
and .Rnw
files:
completed | 0.000 | 0.000 | 0.000 | 0.000 | 0.023 | 0.988 |
not completed | 0.000 | 0.000 | 0.000 | 0.011 | 0.042 | 0.114 |
Given our task at hand – model check durations taking into account uncensored as well as censored data – we won’t dwell on differences between both groups any longer; nonetheless we thought it interesting to relate these numbers.
So now, back to work. We need to create a model.
The model
As explained in the introduction, for completed checks duration is modeled using an exponential PDF. This is as straightforward as adding tfd_exponential() to the model function, tfd_joint_distribution_sequential(). For the censored portion, we need the exponential CCDF. This one is not, as of today, easily added to the model. What we can do though is calculate its value ourselves and add it to the “main” model likelihood. We’ll see this below when discussing sampling; for now it means the model definition ends up straightforward as it only covers the non-censored data. It is made of just the said exponential PDF and priors for the regression parameters.
As for the latter, we use 0-centered, Gaussian priors for all parameters. Standard deviations of 1 turned out to work well. As the priors are all the same, instead of listing a bunch of tfd_normal
s, we can create them all at once as
tfd_sample_distribution(tfd_normal(0, 1), sample_shape = 7)
Mean check time is modeled as an affine combination of the six predictors and the intercept. Here then is the complete model, instantiated using the uncensored data only:
model <- function(data) {
tfd_joint_distribution_sequential(
list(
tfd_sample_distribution(tfd_normal(0, 1), sample_shape = 7),
function(betas)
tfd_independent(
tfd_exponential(
rate = 1 / tf$math$exp(tf$transpose(
tf$matmul(tf$cast(data, betas$dtype), tf$transpose(betas))))),
reinterpreted_batch_ndims = 1)))
}
m <- model(df_nc %>% as.matrix())
Always, we test if samples from that model have the expected shapes:
samples <- m %>% tfd_sample(2)
samples
[[1]]
tf.Tensor(
[[ 1.4184642 0.17583323 -0.06547955 -0.2512014 0.1862184 -1.2662812
1.0231884 ]
[-0.52142304 -1.0036682 2.2664437 1.29737 1.1123234 0.3810004
0.1663677 ]], shape=(2, 7), dtype=float32)
[[2]]
tf.Tensor(
[[4.4954767 7.865639 1.8388556 ... 7.914391 2.8485563 3.859719 ]
[1.549662 0.77833986 0.10015647 ... 0.40323067 3.42171 0.69368565]], shape=(2, 13523), dtype=float32)
This looks fine: We have a list of length two, one element for each distribution in the model. For both tensors, dimension 1 reflects the batch size (which we arbitrarily set to 2 in this test), while dimension 2 is 7 for the number of normal priors and 13523 for the number of durations predicted.
How likely are these samples?
m %>% tfd_log_prob(samples)
tf.Tensor([-32464.521 -7693.4023], shape=(2,), dtype=float32)
Here too, the shape is correct, and the values look reasonable.
The next thing to do is define the target we want to optimize.
Optimization target
Abstractly, the thing to maximize is the log probility of the data – that is, the measured durations – under the model.
Now here the data comes in two parts, and the target does as well. First, we have the non-censored data, for which
m %>% tfd_log_prob(list(betas, tf$cast(target_nc, betas$dtype)))
will calculate the log probability. Second, to obtain log probability for the censored data we write a custom function that calculates the log of the exponential CCDF:
get_exponential_lccdf <- function(betas, data, target) {
e <- tfd_independent(tfd_exponential(rate = 1 / tf$math$exp(tf$transpose(tf$matmul(
tf$cast(data, betas$dtype), tf$transpose(betas)
)))),
reinterpreted_batch_ndims = 1)
cum_prob <- e %>% tfd_cdf(tf$cast(target, betas$dtype))
tf$math$log(1 - cum_prob)
}
Both parts are combined in a little wrapper function that allows us to compare training including and excluding the censored data. We won’t do that in this post, but you might be interested to do it with your own data, especially if the ratio of censored and uncensored parts is a little less imbalanced.
get_log_prob <-
function(target_nc,
censored_data = NULL,
target_c = NULL) {
log_prob <- function(betas) {
log_prob <-
m %>% tfd_log_prob(list(betas, tf$cast(target_nc, betas$dtype)))
potential <-
if (!is.null(censored_data) && !is.null(target_c))
get_exponential_lccdf(betas, censored_data, target_c)
else
0
log_prob + potential
}
log_prob
}
log_prob <-
get_log_prob(
check_time_nc %>% tf$transpose(),
df_c %>% as.matrix(),
check_time_c %>% tf$transpose()
)
Sampling
With model and target defined, we’re ready to do sampling.
n_chains <- 4
n_burnin <- 1000
n_steps <- 1000
# keep track of some diagnostic output, acceptance and step size
trace_fn <- function(state, pkr) {
list(
pkr$inner_results$is_accepted,
pkr$inner_results$accepted_results$step_size
)
}
# get shape of initial values
# to start sampling without producing NaNs, we will feed the algorithm
# tf$zeros_like(initial_betas)
# instead
initial_betas <- (m %>% tfd_sample(n_chains))[[1]]
For the number of leapfrog steps and the step size, experimentation showed that a combination of 64 / 0.1 yielded reasonable results:
hmc <- mcmc_hamiltonian_monte_carlo(
target_log_prob_fn = log_prob,
num_leapfrog_steps = 64,
step_size = 0.1
) %>%
mcmc_simple_step_size_adaptation(target_accept_prob = 0.8,
num_adaptation_steps = n_burnin)
run_mcmc <- function(kernel) {
kernel %>% mcmc_sample_chain(
num_results = n_steps,
num_burnin_steps = n_burnin,
current_state = tf$ones_like(initial_betas),
trace_fn = trace_fn
)
}
# important for performance: run HMC in graph mode
run_mcmc <- tf_function(run_mcmc)
res <- hmc %>% run_mcmc()
samples <- res$all_states
Results
Before we inspect the chains, here is a quick look at the proportion of accepted steps and the per-parameter mean step size:
0.995
0.004953894
We also store away effective sample sizes and the rhat metrics for later addition to the synopsis.
effective_sample_size <- mcmc_effective_sample_size(samples) %>%
as.matrix() %>%
apply(2, mean)
potential_scale_reduction <- mcmc_potential_scale_reduction(samples) %>%
as.numeric()
We then convert the samples
tensor to an R array for use in postprocessing.
# 2-item list, where each item has dim (1000, 4)
samples <- as.array(samples) %>% array_branch(margin = 3)
How well did the sampling work? The chains mix well, but for some parameters, autocorrelation is still pretty high.
prep_tibble <- function(samples) {
as_tibble(samples,
.name_repair = ~ c("chain_1", "chain_2", "chain_3", "chain_4")) %>%
add_column(sample = 1:n_steps) %>%
gather(key = "chain", value = "value",-sample)
}
plot_trace <- function(samples) {
prep_tibble(samples) %>%
ggplot(aes(x = sample, y = value, color = chain)) +
geom_line() +
theme_light() +
theme(
legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
)
}
plot_traces <- function(samples) {
plots <- purrr::map(samples, plot_trace)
do.call(grid.arrange, plots)
}
plot_traces(samples)
Now for a synopsis of posterior parameter statistics, including the usual per-parameter sampling indicators effective sample size and rhat.
all_samples <- map(samples, as.vector)
means <- map_dbl(all_samples, mean)
sds <- map_dbl(all_samples, sd)
hpdis <- map(all_samples, ~ hdi(.x) %>% t() %>% as_tibble())
summary <- tibble(
mean = means,
sd = sds,
hpdi = hpdis
) %>% unnest() %>%
add_column(param = colnames(df_c), .after = FALSE) %>%
add_column(
n_effective = effective_sample_size,
rhat = potential_scale_reduction
)
summary
# A tibble: 7 x 7
param mean sd lower upper n_effective rhat
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 4.05 0.0158 4.02 4.08 508. 1.17
2 depends 1.34 0.0732 1.18 1.47 1000 1.00
3 imports 2.89 0.121 2.65 3.12 1000 1.00
4 doc_size 6.18 0.394 5.40 6.94 177. 1.01
5 r_size 2.93 0.266 2.42 3.46 289. 1.00
6 ns_import 1.54 0.274 0.987 2.06 387. 1.00
7 ns_export -0.237 0.675 -1.53 1.10 66.8 1.01
From the diagnostics and trace plots, the model seems to work reasonably well, but as there is no straightforward error metric involved, it’s hard to know if actual predictions would even land in an appropriate range.
To make sure they do, we inspect predictions from our model as well as from surv_reg
.
This time, we also split the data into training and test sets. Here first are the predictions from surv_reg
:
train_test_split <- initial_split(check_times, strata = "status")
check_time_train <- training(train_test_split)
check_time_test <- testing(train_test_split)
survreg_fit <-
surv_reg(dist = "exponential") %>%
set_engine("survreg") %>%
fit(Surv(check_time, status) ~ depends + imports + doc_size + r_size +
ns_import + ns_export,
data = check_time_train)
survreg_fit(sr_fit)
# A tibble: 7 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.05 0.0174 234. 0. NA NA
2 depends 0.108 0.00701 15.4 3.40e-53 NA NA
3 imports 0.0660 0.00327 20.2 1.09e-90 NA NA
4 doc_size 7.76 0.543 14.3 2.24e-46 NA NA
5 r_size 0.812 0.0889 9.13 6.94e-20 NA NA
6 ns_import 0.00501 0.00103 4.85 1.22e- 6 NA NA
7 ns_export -0.000212 0.000375 -0.566 5.71e- 1 NA NA
For the MCMC model, we re-train on just the training set and obtain the parameter summary. The code is analogous to the above and not shown here.
We can now predict on the test set, for simplicity just using the posterior means:
df <- check_time_test %>% select(
depends,
imports,
doc_size,
r_size,
ns_import,
ns_export) %>%
add_column(intercept = rep(1, nrow(check_time_test)), .before = 1)
mcmc_pred <- df %>% as.matrix() %*% summary$mean %>% exp() %>% as.numeric()
mcmc_pred <- check_time_test %>% select(check_time, status) %>%
add_column(.pred = mcmc_pred)
ggplot(mcmc_pred, aes(x = check_time, y = .pred, color = factor(status))) +
geom_point() +
coord_cartesian(ylim = c(0, 1400))
This looks good!
Wrapup
We’ve shown how to model censored data – or rather, a frequent subtype thereof involving durations – using tfprobability
. The check_times
data from parsnip
were a fun choice, but this modeling technique may be even more useful when censoring is more substantial. Hopefully his post has provided some guidance on how to handle censored data in your own work. Thanks for reading!