The R package case-based-reasoning provides an R interface case based reasoning using machine learning.

`install.packages("CaseBasedReasoning")`

```
install.packages("devtools")
devtools::install_github("sipemu/case-based-reasoning")
```

This R package provides two methods case based reasoning by using an endpoint:

Linear, logistic, and Cox regression

Proximity and Depth Measure extracted from a fitted random forest (ranger package)

Besides the functionality of searching similar cases, some additional features are included:

automatic validation of the key variables between the query and similar cases dataset

checking proportional hazard assumption for the Cox Model

C++-functions for distance calculation

In the first example, we use the Cox-Model and the `ovarian`

data set from the `survival`

package. In the first step we initialize the R6 data object.

```
library(tidyverse)
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)
# initialize R6 object
coxBeta <- CoxBetaModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps)
```

After the initialization, we may want to get for each case in the query data the most similar case from the learning data.

```
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]
# fit model
ovarian[trainID, ] %>%
coxBeta$fit()
# get similar cases
ovarian[testID, ] %>%
coxBeta$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData
```

You may extract then the similar cases and the verum data and put them together:

**Note 1:** In the initialization step, we dropped all cases with missing values in the variables of `data`

and `endPoint`

. So, you need to make sure that NA handling is done by you.

**Note 2:** The `data.table`

returned from `coxBeta$get_similar_cases`

has four additional columns:

`caseId`

: By this column you may map the similar cases to cases in data, e.g. if you had chosen`k = 3`

, then the first three elements in the column`caseId`

will be`1`

(following three`2`

and so on). This means that this three cases are the three most similar cases to case`0`

in verum data.`scDist`

: The calculated distance`scCaseId`

: Grouping number of query with matched data`group`

: Grouping matched or query data

Alternatively, you may just be interested in the distance matrix, then you go this way:

```
ovarian %>%
coxBeta$calc_distance_matrix() -> distMatrix
```

`coxBeta$calc_distance_matrix()`

calculates the full distance matrix. This matrix the dimension: cases of data versus cases of query data. If the query dataset is bot available, this functions calculates a n times n distance matrix of all pairs in data. The distance matrix is saved internally in the cbrCoxModel object: `coxBeta$distMat`

.

In the second example, we present the Random Forest model for a distance measure approximation applied on the `ovarian`

data set from the `survival`

package. This package offers two ways for distance/similarity calculation (see documentation):

proximity

depth

Let’s initialize the R6 data object.

```{r, warning=FALSE, message=FALSE} library(tidyverse) library(survival) library(CaseBasedReasoning) ovarian\(resid.ds <- factor(ovarian\)resid.ds) ovarian\(rx <- factor(ovarian\)rx) ovarian\(ecog.ps <- factor(ovarian\)ecog.ps)

rfSC <- RFModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps) ```

All cases with missing values in the learning and end point variables are dropped (`na.omit`

) and the reduced data set without missing values is saved internally. You get a text output on how many cases were dropped. `character`

variables will be transformed to `factor`

.

Optionally, you may want to adjust some parameters in the fitting step of the random forest algorithm. Possible arguments are: , `ntree`

, `mtry`

, and `splitrule`

. The documentation of this parameters can be found in the ranger R-package. Furthermore, you are able to choose the two distance measures:

`Proximity`

: the proximity matrix`Depth`

(Default): Calculates the average edge length over all trees

This can be done by

`{r, warning=FALSE, message=FALSE} rfSC$set_dist(distMethod = "Proximity")`

All other steps (excluding checking for proportional hazard assumption are the same as for the Cox-Model).

**Similar Cases:**

```
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]
# fit model
ovarian[trainID, ] %>%
rfSC$fit()
# get similar cases
ovarian[trainID, ] %>%
rfSC$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData
```

**Distance Matrix Calculation:**

```
ovarian %>%
rfSC$calc_distance_matrix() -> distMatrix
```

PD Dr. Jürgen Dippon, Institut für Stochastik und Anwendungen, Universität Stuttgart

Dr. Simon Müller, TTI GmbH - MUON-STAT

Dr. Peter Fritz

Professor Dr. Friedel

The work was funded by the Robert Bosch Foundation. Special thanks go to Professor Dr. Friedel (Thoraxchirugie - Klinik Schillerhöhe).

Dippon et al. A statistical approach to case based reasoning, with application to breast cancer data (2002),

Friedel et al. Postoperative Survival of Lung Cancer Patients: Are There Predictors beyond TNM? (2012).

Englund and Verikas A novel approach to estimate proximity in a random forest: An exploratory study

Stuart, E. et al. Matching methods for causal inference: Designing observational studies

Defossez et al. Temporal representation of care trajectories of cancer patients using data from a regional information system: an application in breast cancer