---
title: "Categorical Association Measures"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Categorical Association Measures}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(moderncor)
```

The `moderncor_cat()` function provides a unified interface for computing association measures between categorical (factor) variables. All measures require the `DescTools` package.

```{r check-desctool, eval = !requireNamespace("DescTools", quietly = TRUE), echo = FALSE}
message("The DescTools package is not installed. Install it with: install.packages('DescTools')")
```

## Basic Usage

`moderncor_cat()` accepts two factor (or character/numeric-as-categorical) vectors:

```{r basic, eval = requireNamespace("DescTools", quietly = TRUE)}
set.seed(42)
x <- factor(sample(c("A", "B", "C"), 100, replace = TRUE))
y <- factor(sample(c("X", "Y"), 100, replace = TRUE))

moderncor_cat(x, y, method = "cramers_v")
```

The output is an S3 object of class `"moderncor_cat"` with the same structure as `moderncor()` output:

- `$estimate`: the association coefficient
- `$statistic`: the chi-square test statistic (for nominal methods)
- `$p.value`: the p-value (for nominal methods; `NULL` for ordinal methods)
- `$n`: the sample size
- `$method_label`: human-readable method name

## Querying Available Methods

```{r available-methods-cat}
available_methods_cat()
```

Methods fall into two categories:

- **Nominal**: for unordered categories (Cramér's V, Phi, Contingency Coefficient, Tschuprow's T)
- **Ordinal**: for ordered categories (Goodman-Kruskal Gamma, Somers' D)

## Nominal Association Measures

Nominal measures are appropriate when categories have no natural ordering. They are all based on the chi-square statistic and return a p-value.

### Cramér's V

Cramér's V is the most widely used measure of nominal association. It ranges from 0 (no association) to 1 (perfect association) and is symmetric:

```{r cramers-v, eval = requireNamespace("DescTools", quietly = TRUE)}
moderncor_cat(x, y, method = "cramers_v")
```

For a 2×2 table, Cramér's V equals the absolute value of the Phi coefficient.

### Phi Coefficient

The Phi coefficient is designed for 2×2 contingency tables. For larger tables it can exceed 1, so prefer Cramér's V in that case:

```{r phi, eval = requireNamespace("DescTools", quietly = TRUE)}
x_bin <- factor(sample(c("Yes", "No"), 100, replace = TRUE))
y_bin <- factor(sample(c("Pass", "Fail"), 100, replace = TRUE))

moderncor_cat(x_bin, y_bin, method = "phi")
```

### Contingency Coefficient

The contingency coefficient (Pearson's C) is bounded between 0 and $\sqrt{(k-1)/k}$ where $k$ is the number of categories, so it is not comparable across tables of different sizes:

```{r contingency, eval = requireNamespace("DescTools", quietly = TRUE)}
moderncor_cat(x, y, method = "contingency")
```

### Tschuprow's T

Tschuprow's T is similar to Cramér's V but uses the geometric mean of the marginal category counts as its normalizer. It is symmetric and ranges from 0 to 1:

```{r tschuprow, eval = requireNamespace("DescTools", quietly = TRUE)}
moderncor_cat(x, y, method = "tschuprow")
```

## Ordinal Association Measures

Ordinal measures are appropriate when categories have a natural ordering (e.g., Likert scales, severity grades). They do not return p-values by default.

### Goodman-Kruskal Gamma

Goodman-Kruskal Gamma ($\gamma$) measures the tendency for pairs of observations to be concordant (both variables increase together) vs. discordant. It ranges from −1 to 1 and is symmetric:

```{r gamma-data, eval = requireNamespace("DescTools", quietly = TRUE)}
# Simulate ordinal survey data
set.seed(1)
quality  <- factor(sample(c("Low", "Medium", "High"), 100, replace = TRUE,
                           prob = c(0.3, 0.4, 0.3)),
                   levels = c("Low", "Medium", "High"), ordered = TRUE)
satisfaction <- factor(sample(c("Dissatisfied", "Neutral", "Satisfied"), 100,
                               replace = TRUE, prob = c(0.3, 0.4, 0.3)),
                       levels = c("Dissatisfied", "Neutral", "Satisfied"), ordered = TRUE)

moderncor_cat(quality, satisfaction, method = "gamma")
```

### Somers' D

Somers' D is an asymmetric ordinal measure: it measures the predictability of `y` from `x` (but not vice versa). Values range from −1 to 1:

```{r somers-d, eval = requireNamespace("DescTools", quietly = TRUE)}
moderncor_cat(quality, satisfaction, method = "somers_d")
```

Note that swapping `x` and `y` gives a different result:

```{r somers-d-reversed, eval = requireNamespace("DescTools", quietly = TRUE)}
moderncor_cat(satisfaction, quality, method = "somers_d")
```

## Pairwise Matrix for Multiple Variables

Pass a `data.frame` of factor columns to compute pairwise associations across all pairs:

```{r matrix-input, eval = requireNamespace("DescTools", quietly = TRUE)}
df <- data.frame(
  cyl   = factor(mtcars$cyl),
  gear  = factor(mtcars$gear),
  am    = factor(mtcars$am)
)

res_mat <- moderncor_cat(df, method = "cramers_v")
res_mat
```

The result is a matrix of association coefficients. For nominal methods, the associated p-value matrix is also stored in `$p.value`:

```{r matrix-pvalue, eval = requireNamespace("DescTools", quietly = TRUE)}
res_mat$p.value
```

Use `as.data.frame()` to convert to tidy format:

```{r as-data-frame, eval = requireNamespace("DescTools", quietly = TRUE)}
as.data.frame(res_mat)
```

## Handling Missing Values

The `use` argument controls how missing values are handled, mirroring the interface of `moderncor()`:

- `"complete.obs"` (default): remove all rows with any NA before computing
- `"pairwise.complete.obs"`: remove NAs per pair
- `"everything"`: propagate NAs (returns NA for any pair with missing values)

```{r missing-values, eval = requireNamespace("DescTools", quietly = TRUE)}
x_na <- factor(c("A", "B", NA, "A", "B", "C"))
y_na <- factor(c("X", "Y", "X", NA, "Y", "X"))

moderncor_cat(x_na, y_na, method = "cramers_v", use = "complete.obs")
```

## Choosing the Right Method

| Situation | Recommended method |
|---|---|
| Two unordered categorical variables (general) | `cramers_v` |
| Two binary variables (2×2 table) | `phi` |
| Two ordered categorical (Likert) variables | `gamma` |
| Predicting one ordered variable from another | `somers_d` |
| Comparing association across different table sizes | `cramers_v` or `tschuprow` |

For continuous variables, use `moderncor()` instead. See `vignette("introduction")` for a full overview.