---
title: "PONG2 Imputation Workflow"
author: "Norman Lab"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 7
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{PONG2 Imputation Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Overview

This vignette provides a complete, step-by-step guide to performing KIR allele imputation
using the `impute` command in PONG2.

The workflow covers:

- Preparing input data (PLINK → chr19 extraction)
- Running basic PONG2 imputation
- Checking SNP overlap with the 1KGP reference panel
- Pre-phasing the KIR region with Eagle2
- Local pre-imputation using minimac4 (`--fill-missing`)
- External pre-imputation via Michigan Imputation Server
- Interpreting results

---

## Prerequisites

| Requirement | Version | Notes |
|-------------|---------|-------|
| PLINK2 | ≥ 2.0 | Must be in PATH |
| R | ≥ 4.0 | With PONG2 installed |
| minimac4 | ≥ 4.1.6 | Only for `--fill-missing` |
| Eagle2 | ≥ 2.4 | Only for pre-phasing |
| bgzip & tabix | HTSlib | Only for `--fill-missing` |

---

## Step 1: Prepare Input Data

PONG2 works best when input files are restricted to chromosome 19 (covering the KIR locus).
Extract chr19 from your full-genome PLINK files:

```bash
plink2 \
  --bfile your_full_genome_prefix \
  --chr 19 \
  --make-bed \
  --out chr19_only
```

This creates `chr19_only.bed`, `chr19_only.bim`, and `chr19_only.fam`.

---

## Step 2: Run Basic PONG2 Imputation

```bash
# --filter can be 0.005 or 0.01
# 0.005 allows more rare KIR alleles in the output
pong2 impute \
  -i chr19_only \
  -o results/basic \
  -l KIR3DL1 \
  -a hg38 \
  -t 16 \
  --filter 0.005
```

PONG2 will automatically check the SNP overlap between your data and the 1KGP reference
panel in the KIR region and report the match rate.

---

## Step 3: Check SNP Overlap

> **NOTE: KIR Region SNP Overlap between input data and 1KGP**
>
> Overlap rate is computed between your input data and the 1000 Genomes Project (1KGP)
> reference panel in the KIR region:
>
> | Assembly | KIR Region Coordinates |
> |----------|----------------------|
> | hg19 | chr19:55,000,000–55,400,000 |
> | hg38 | chr19:54,000,000–55,000,000 |
>
> | Overlap Rate | Status | Action |
> |-------------|--------|--------|
> | ≥ 50% | Pass | Proceed with PONG2 directly |
> | < 50% | Fail | Run Eagle2 + pre-imputation first |

If your match rate is sufficient (≥ 50%), PONG2 will proceed automatically.
If not, use one of the pre-imputation strategies below.

---

## Step 4: Pre-imputation (when SNP overlap < 50%)

Pre-phasing the KIR region is **required** before any pre-imputation strategy.

### Pre-phase with Eagle2

#### hg19

```bash
eagle \
  --bfile=chr19_only \
  --geneticMapFile=genetic_map_hg19.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=55000000 \
  --bpEnd=55400000
```

#### hg38

```bash
eagle \
  --bfile=chr19_only \
  --geneticMapFile=genetic_map_hg38.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=54000000 \
  --bpEnd=55000000
```

Eagle2 outputs a phased VCF: `chr19.phased.vcf.gz`

---

### Option A: Local Pre-imputation with minimac4 (built-in)

Pass the pre-phased VCF directly to PONG2 using `--vcf` and `--fill-missing`.

> **Important:** `--vcf` is the **only input** required with `--fill-missing`.  
> PLINK files cannot hold phased haplotype data — the pipeline derives everything
> from the VCF internally. Do **not** supply `-i` together with `--fill-missing`.

```bash
pong2 impute \
  --vcf chr19.phased.vcf.gz \
  -o results/local_impute \
  -l KIR3DL1 \
  -a hg19 \
  -t 20 \
  --filter 0.005 \
  --fill-missing
```

---

### Option B: External Pre-imputation (recommended for highest accuracy)

Pre-impute your chr19 data using a public server before running PONG2.
This is the approach used in the PONG2 manuscript.

#### Step B1: Export phased VCF

The phased VCF from Eagle2 (`chr19.phased.vcf.gz`) is ready for upload. If you
need to export from PLINK first:

```bash
plink2 \
  --bfile chr19_only \
  --export vcf bgz \
  --out chr19_only
tabix -p vcf chr19_only.vcf.gz
```

#### Step B2: Upload to Michigan Imputation Server

- URL: [https://imputationserver.sph.umich.edu/](https://imputationserver.sph.umich.edu/)
- Reference panel: TOPMed r5 (recommended for diverse populations) or 1KGP Phase 3
- Genome build: match your data (hg19 or hg38)
- Chromosome: 19 only
- Phasing: select `Eagle v2.4` if uploading unphased data; skip if already phased
- Submit and wait for email notification (typically hours to days)

#### Step B3: Download and convert imputed VCF to PLINK

```bash
# Unzip results (password provided by server via email)
unzip -P <password> chr19.zip

# Convert imputed VCF to PLINK
plink2 \
  --vcf chr19.dose.vcf.gz dosage=DS \
  --import-dosage-certainty 0.3 \
  --make-bed \
  --out imputed_chr19
```

#### Step B4: Run PONG2 on imputed data

```bash
pong2 impute \
  -i imputed_chr19 \
  -o results/final \
  -l KIR3DL1 \
  -a hg38 \
  -t 16 \
  --filter 0.005
```

---

### Option C: Force imputation (not recommended)

Proceed despite low SNP match rate — use only when you understand the implications
for accuracy:

```bash
pong2 impute \
  -i chr19_only \
  -o results/forced \
  -l KIR3DL1 \
  -a hg19 \
  --force
```

---

## Step 5: Interpreting Output

After `pong2 impute` completes, results are saved in `<output>/KIR/`:

| File | Description |
|------|-------------|
| `KIR/<locus>.csv` | Predicted KIR alleles per sample (main results) |
| `KIR/<locus>.RData` | Full prediction object including allele probabilities |

### Output CSV format

```
sample.id, KIR3DL1.1, KIR3DL1.2, prob.KIR3DL1.1, prob.KIR3DL1.2
HG00096,   KIR3DL1*001, KIR3DL1*002, 0.98, 0.95
HG00097,   KIR3DL1*005, KIR3DL1*015, 0.87, 0.91
```

### Large sample datasets

For datasets with **>2,000 samples**, PONG2 automatically splits prediction into
chunks of 2,000 samples to prevent memory issues. Results are combined and saved
as a single output file — no action required from the user.

---

## Summary: Which Workflow to Choose?

| Scenario | Recommended approach |
|----------|---------------------|
| SNP overlap ≥ 50% | Run `pong2 impute -i` directly |
| SNP overlap < 50%, quick run needed | Eagle2 → `pong2 impute --vcf --fill-missing` |
| SNP overlap < 50%, highest accuracy | Eagle2 → Michigan Server → `pong2 impute -i` |
| Low overlap, understand risks | `pong2 impute -i --force` |

---

## Next Steps

- See vignette [PONG2-training](https://normanlabucd.github.io/PONG2/articles/PONG2-training.html)
  for custom model training
- Run the complete end-to-end workflow script: [example/full_workflow.sh](https://github.com/NormanLabUCD/PONG2/blob/main/example/full_workflow.sh)
- Report issues: [Open a GitHub issue](https://github.com/NormanLabUCD/PONG2/issues/new)

Happy KIR imputation! 🧬
