Census-level data refers to a data set wherein there is one row per policy. Exposure-level data expands census-level data such that there is one record per policy per observation period. Observation periods could be any meaningful period of time such as a policy year, policy month, calendar year, calendar quarter, calendar month, etc.
A common step in experience studies is converting census-level data
into exposure-level data. The expose()
family of functions
assists with this task. Specifically, the expose()
family:
NA
for all periods except the last.To get started, we’re going to use a toy census data frame from the actxps package that contains 3 policies: one active, one that terminated due to death, and one that terminated due to surrender.
toy_census
contains the 4 columns necessary to compute
exposures:
pol_num
: a unique identifier for individual
policiesstatus
: the policy statusissue_date
: issue dateterm_date
: termination date, if any. Otherwise
NA
library(actxps)
library(dplyr)
toy_census#> pol_num status issue_date term_date
#> 1 1 Active 2010-01-01 <NA>
#> 2 2 Death 2011-05-27 2020-09-14
#> 3 3 Surrender 2009-11-10 2022-02-25
Let’s assume we’re performing an experience study as of 2022-12-31 and we’re interested in policy year exposures. Here’s what we should expect for our 3 policies.
To calculate exposures, we pass our data to the expose()
function and we specify a study end_date
.
<- expose(toy_census, end_date = "2022-12-31") exposed_data
This creates an exposed_df
object, which is a type of
data frame with some additional attributes related to the experience
study.
is_exposed_df(exposed_data)
#> [1] TRUE
Let’s examine what happened to each policy.
Policy 1: As expected, there are 13 rows for this
policy. New columns were added for the policy year
(pol_yr
), anniversary (pol_date_yr
), and
exposure. All exposures are 100% since this policy was active for all 13
years.
When the data is printed, additional attributes from the
exposed_df
class are displayed.
|> filter(pol_num == 1)
exposed_data #> Exposure data
#>
#> Exposure type: policy_year
#> Target status:
#> Study range: 1900-01-01 to 2022-12-31
#>
#> # A tibble: 13 × 8
#> pol_num status issue_date term_date pol_yr pol_date_yr pol_date_yr_…¹ expos…²
#> <int> <fct> <date> <date> <int> <date> <date> <dbl>
#> 1 1 Active 2010-01-01 NA 1 2010-01-01 2010-12-31 1
#> 2 1 Active 2010-01-01 NA 2 2011-01-01 2011-12-31 1
#> 3 1 Active 2010-01-01 NA 3 2012-01-01 2012-12-31 1
#> 4 1 Active 2010-01-01 NA 4 2013-01-01 2013-12-31 1
#> 5 1 Active 2010-01-01 NA 5 2014-01-01 2014-12-31 1
#> 6 1 Active 2010-01-01 NA 6 2015-01-01 2015-12-31 1
#> 7 1 Active 2010-01-01 NA 7 2016-01-01 2016-12-31 1
#> 8 1 Active 2010-01-01 NA 8 2017-01-01 2017-12-31 1
#> 9 1 Active 2010-01-01 NA 9 2018-01-01 2018-12-31 1
#> 10 1 Active 2010-01-01 NA 10 2019-01-01 2019-12-31 1
#> 11 1 Active 2010-01-01 NA 11 2020-01-01 2020-12-31 1
#> 12 1 Active 2010-01-01 NA 12 2021-01-01 2021-12-31 1
#> 13 1 Active 2010-01-01 NA 13 2022-01-01 2022-12-31 1
#> # … with abbreviated variable names ¹pol_date_yr_end, ²exposure
Policy 2: There are 10 rows for this policy. The
first 9 periods show the policy in an active status
and the
termination date (term_date
) is set to NA
. The
last period includes the final status of “Death” and the actual
termination date. The last exposure is less than one because roughly a
third of a year elapsed between the last anniversary date on 2020-05-27
and the termination date on 2020-09-14.
|> filter(pol_num == 2)
exposed_data #> Exposure data
#>
#> Exposure type: policy_year
#> Target status:
#> Study range: 1900-01-01 to 2022-12-31
#>
#> # A tibble: 10 × 8
#> pol_num status issue_date term_date pol_yr pol_date_yr pol_date_yr…¹ expos…²
#> <int> <fct> <date> <date> <int> <date> <date> <dbl>
#> 1 2 Active 2011-05-27 NA 1 2011-05-27 2012-05-26 1
#> 2 2 Active 2011-05-27 NA 2 2012-05-27 2013-05-26 1
#> 3 2 Active 2011-05-27 NA 3 2013-05-27 2014-05-26 1
#> 4 2 Active 2011-05-27 NA 4 2014-05-27 2015-05-26 1
#> 5 2 Active 2011-05-27 NA 5 2015-05-27 2016-05-26 1
#> 6 2 Active 2011-05-27 NA 6 2016-05-27 2017-05-26 1
#> 7 2 Active 2011-05-27 NA 7 2017-05-27 2018-05-26 1
#> 8 2 Active 2011-05-27 NA 8 2018-05-27 2019-05-26 1
#> 9 2 Active 2011-05-27 NA 9 2019-05-27 2020-05-26 1
#> 10 2 Death 2011-05-27 2020-09-14 10 2020-05-27 2021-05-26 0.304
#> # … with abbreviated variable names ¹pol_date_yr_end, ²exposure
Policy 3: There are 13 rows for this policy. The
first 12 periods show the policy in an active status
and
the termination date (term_date
) is set to NA
.
The last period includes the final status of “Surrender” and the actual
termination date. The last exposure is less than one because the roughly
a third of a year elapsed between the last anniversary date on
2021-11-10 and the termination date on 2022-02-25.
|> filter(pol_num == 3)
exposed_data #> Exposure data
#>
#> Exposure type: policy_year
#> Target status:
#> Study range: 1900-01-01 to 2022-12-31
#>
#> # A tibble: 13 × 8
#> pol_num status issue_date term_date pol_yr pol_date_yr pol_date…¹ expos…²
#> <int> <fct> <date> <date> <int> <date> <date> <dbl>
#> 1 3 Active 2009-11-10 NA 1 2009-11-10 2010-11-09 1
#> 2 3 Active 2009-11-10 NA 2 2010-11-10 2011-11-09 1
#> 3 3 Active 2009-11-10 NA 3 2011-11-10 2012-11-09 1
#> 4 3 Active 2009-11-10 NA 4 2012-11-10 2013-11-09 1
#> 5 3 Active 2009-11-10 NA 5 2013-11-10 2014-11-09 1
#> 6 3 Active 2009-11-10 NA 6 2014-11-10 2015-11-09 1
#> 7 3 Active 2009-11-10 NA 7 2015-11-10 2016-11-09 1
#> 8 3 Active 2009-11-10 NA 8 2016-11-10 2017-11-09 1
#> 9 3 Active 2009-11-10 NA 9 2017-11-10 2018-11-09 1
#> 10 3 Active 2009-11-10 NA 10 2018-11-10 2019-11-09 1
#> 11 3 Active 2009-11-10 NA 11 2019-11-10 2020-11-09 1
#> 12 3 Active 2009-11-10 NA 12 2020-11-10 2021-11-09 1
#> 13 3 Surrender 2009-11-10 2022-02-25 13 2021-11-10 2022-11-09 0.296
#> # … with abbreviated variable names ¹pol_date_yr_end, ²exposure
The previous section only supplied data and a study
end_date
to expose()
. These are the minimum
required arguments for the function. Optionally, a
start_date
can be supplied that will drop exposure periods
that begin before a specified date.
expose(toy_census, end_date = "2022-12-31", start_date = "2019-12-31")
#> Exposure data
#>
#> Exposure type: policy_year
#> Target status:
#> Study range: 2019-12-31 to 2022-12-31
#>
#> # A tibble: 6 × 8
#> pol_num status issue_date term_date pol_yr pol_date_yr pol_date_…¹ expos…²
#> * <int> <fct> <date> <date> <int> <date> <date> <dbl>
#> 1 1 Active 2010-01-01 NA 11 2020-01-01 2020-12-31 1
#> 2 1 Active 2010-01-01 NA 12 2021-01-01 2021-12-31 1
#> 3 1 Active 2010-01-01 NA 13 2022-01-01 2022-12-31 1
#> 4 2 Death 2011-05-27 2020-09-14 10 2020-05-27 2021-05-26 0.304
#> 5 3 Active 2009-11-10 NA 12 2020-11-10 2021-11-09 1
#> 6 3 Surrender 2009-11-10 2022-02-25 13 2021-11-10 2022-11-09 0.296
#> # … with abbreviated variable names ¹pol_date_yr_end, ²exposure
Most experience studies use the annual exposure method which allocates a full period of exposure for the particular termination event of interest in the scope of the study.
The intuition for this approach is simple: let’s assume we have an unrealistically small study with a single data point for one policy over the course of one year. Let’s assume that policy terminated due to surrender half way through the year.
If we don’t apply the annual exposure method, we would calculate a termination rate as:
\[ q^{surr} = \frac{claims}{exposures} = \frac{1}{0.5} = 200\% \]
A termination rate of 200% doesn’t make any sense. Under the annual exposure method we would see a rate of 100%, which is intuitive.
\[ q^{surr} = \frac{claims}{exposures} = \frac{1}{1} = 100\% \]
The annual exposure method can be applied by passing a character
vector of target statuses to the expose()
function.
Let’s assume we are performing a surrender study.
<- expose(toy_census, end_date = "2022-12-31",
exposed_data2 target = "Surrender")
Now let’s verify that the exposure on the surrendered policy increased to 100% in the last exposure period.
|>
exposed_data2 group_by(pol_num) |>
slice_max(pol_yr)
#> # A tibble: 3 × 8
#> # Groups: pol_num [3]
#> pol_num status issue_date term_date pol_yr pol_date_yr pol_date_…¹ expos…²
#> <int> <fct> <date> <date> <int> <date> <date> <dbl>
#> 1 1 Active 2010-01-01 NA 13 2022-01-01 2022-12-31 1
#> 2 2 Death 2011-05-27 2020-09-14 10 2020-05-27 2021-05-26 0.304
#> 3 3 Surrender 2009-11-10 2022-02-25 13 2021-11-10 2022-11-09 1
#> # … with abbreviated variable names ¹pol_date_yr_end, ²exposure
The default exposure basis used by expose()
is policy
years. Using the arguments cal_expo
and
expo_length
other exposure periods can be used.
If cal_expo
is set to TRUE
, calendar year
exposures will be calculated.
Looking at the second policy, we can see that the first year is left-censored because the policy was issued two-fifths of the way through the year, and the last period is right-censored because the policy terminated roughly seven-tenths of the way through the year.
2, ] |>
toy_census[expose(end_date = "2022-12-31", cal_expo = TRUE, target_status = "Surrender")
#> Exposure data
#>
#> Exposure type: calendar_year
#> Target status: Surrender
#> Study range: 1900-01-01 to 2022-12-31
#>
#> # A tibble: 10 × 7
#> pol_num status issue_date term_date cal_yr cal_yr_end exposure
#> * <int> <fct> <date> <date> <date> <date> <dbl>
#> 1 2 Active 2011-05-27 NA 2011-01-01 2011-12-31 0.6
#> 2 2 Active 2011-05-27 NA 2012-01-01 2012-12-31 1
#> 3 2 Active 2011-05-27 NA 2013-01-01 2013-12-31 1
#> 4 2 Active 2011-05-27 NA 2014-01-01 2014-12-31 1
#> 5 2 Active 2011-05-27 NA 2015-01-01 2015-12-31 1
#> 6 2 Active 2011-05-27 NA 2016-01-01 2016-12-31 1
#> 7 2 Active 2011-05-27 NA 2017-01-01 2017-12-31 1
#> 8 2 Active 2011-05-27 NA 2018-01-01 2018-12-31 1
#> 9 2 Active 2011-05-27 NA 2019-01-01 2019-12-31 1
#> 10 2 Death 2011-05-27 2020-09-14 2020-01-01 2020-12-31 0.705
The length of the exposure period can be decreased by passing
"quarter"
, "month"
, or "week"
to
the expo_length
argument. This can be used with policy or
calendar-based exposures.
2, ] |>
toy_census[expose(end_date = "2022-12-31",
cal_expo = TRUE,
expo_length = "quarter",
target_status = "Surrender")
#> Exposure data
#>
#> Exposure type: calendar_quarter
#> Target status: Surrender
#> Study range: 1900-01-01 to 2022-12-31
#>
#> # A tibble: 38 × 7
#> pol_num status issue_date term_date cal_qtr cal_qtr_end exposure
#> * <int> <fct> <date> <date> <date> <date> <dbl>
#> 1 2 Active 2011-05-27 NA 2011-04-01 2011-06-30 0.385
#> 2 2 Active 2011-05-27 NA 2011-07-01 2011-09-30 1
#> 3 2 Active 2011-05-27 NA 2011-10-01 2011-12-31 1
#> 4 2 Active 2011-05-27 NA 2012-01-01 2012-03-31 1
#> 5 2 Active 2011-05-27 NA 2012-04-01 2012-06-30 1
#> 6 2 Active 2011-05-27 NA 2012-07-01 2012-09-30 1
#> 7 2 Active 2011-05-27 NA 2012-10-01 2012-12-31 1
#> 8 2 Active 2011-05-27 NA 2013-01-01 2013-03-31 1
#> 9 2 Active 2011-05-27 NA 2013-04-01 2013-06-30 1
#> 10 2 Active 2011-05-27 NA 2013-07-01 2013-09-30 1
#> # … with 28 more rows
The following functions are convenience wrappers around
expose()
that can be used to target a specific exposure
type without specifying cal_expo
and
expo_length
.
expose_py
= exposures by policy yearexpose_pq
= exposures by policy quarterexpose_pm
= exposures by policy monthexpose_pw
= exposures by policy weekexpose_cy
= exposures by calendar yearexpose_cq
= exposures by calendar quarterexpose_cm
= exposures by calendar monthexpose_cw
= exposures by calendar weekFor machine learning feature engineering, the actxps package contains
a function called step_expose()
that is compatible with the
recipes package from tidymodels. This function can be used to apply the
expose()
function within a recipe.
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
<- recipe(status ~ ., toy_census) |>
expo_rec step_expose(end_date = "2022-12-31", target_status = "Surrender",
options = list(expo_length = "month")) |>
prep()
expo_rec#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 3
#>
#> Training data contained 3 data points and 1 incomplete row.
#>
#> Operations:
#>
#> Exposed data based on policy months for target status Surrender <none> [trained]
tidy(expo_rec, number = 1)
#> # A tibble: 1 × 4
#> exposure_type target_status start_date end_date
#> <chr> <chr> <date> <chr>
#> 1 policy_month Surrender 1900-01-01 2022-12-31
bake(expo_rec, new_data = NULL)
#> # A tibble: 416 × 7
#> issue_date term_date status pol_mth pol_date_mth pol_date_mth_end exposure
#> <date> <date> <fct> <int> <date> <date> <dbl>
#> 1 2010-01-01 NA Active 1 2010-01-01 2010-01-31 1
#> 2 2010-01-01 NA Active 2 2010-02-01 2010-02-28 1
#> 3 2010-01-01 NA Active 3 2010-03-01 2010-03-31 1
#> 4 2010-01-01 NA Active 4 2010-04-01 2010-04-30 1
#> 5 2010-01-01 NA Active 5 2010-05-01 2010-05-31 1
#> 6 2010-01-01 NA Active 6 2010-06-01 2010-06-30 1
#> 7 2010-01-01 NA Active 7 2010-07-01 2010-07-31 1
#> 8 2010-01-01 NA Active 8 2010-08-01 2010-08-31 1
#> 9 2010-01-01 NA Active 9 2010-09-01 2010-09-30 1
#> 10 2010-01-01 NA Active 10 2010-10-01 2010-10-31 1
#> # … with 406 more rows
As a default, the expose()
functions assume the census
data frame uses the following naming conventions:
pol_num
status
issue_date
term_date
These default names can be overridden using the
col_pol_num
, col_status
,
col_issue_date
, and col_term_date
arguments.
For example, if the policy number column was called id
in our census-level data, we could write:
expose(toy_census, end_date = "2022-12-31",
target = "Surrender",
col_pol_num = "id")
If the census-level data contains other policy attributes like plan type or policy values, they will be broadcast across all exposure periods. Depending on the nature of the data, this may or may not be desirable. Constant policy attributes like plan type make sense to broadcast, but numeric values may or may not depending on the circumstances.
<- toy_census |>
toy_census2 mutate(plan_type = c("X", "Y", "Z"),
policy_value = c(100, 125, 90))
expose(toy_census2, end_date = "2022-12-31",
target = "Surrender")
#> Exposure data
#>
#> Exposure type: policy_year
#> Target status: Surrender
#> Study range: 1900-01-01 to 2022-12-31
#>
#> # A tibble: 36 × 10
#> pol_num status issue_date term_date plan_type policy_value pol_yr pol_date_yr
#> * <int> <fct> <date> <date> <chr> <dbl> <int> <date>
#> 1 1 Active 2010-01-01 NA X 100 1 2010-01-01
#> 2 1 Active 2010-01-01 NA X 100 2 2011-01-01
#> 3 1 Active 2010-01-01 NA X 100 3 2012-01-01
#> 4 1 Active 2010-01-01 NA X 100 4 2013-01-01
#> 5 1 Active 2010-01-01 NA X 100 5 2014-01-01
#> 6 1 Active 2010-01-01 NA X 100 6 2015-01-01
#> 7 1 Active 2010-01-01 NA X 100 7 2016-01-01
#> 8 1 Active 2010-01-01 NA X 100 8 2017-01-01
#> 9 1 Active 2010-01-01 NA X 100 9 2018-01-01
#> 10 1 Active 2010-01-01 NA X 100 10 2019-01-01
#> # … with 26 more rows, and 2 more variables: pol_date_yr_end <date>,
#> # exposure <dbl>
If your experience study requires a numeric feature that varies over
time (ex: policy values, crediting rates, etc.), you can always attach
it to an exposed_df
object using a join function.
# Illustrative example - assume `values` is a data frame containing the columns pol_num and pol_yr.
|>
exposed_data left_join(values, by = c("pol_num", "pol_yr"))
The expose()
family does not support: