Why and how to use the Ease package?

Ease aims to implement in a simple and efficient way in R the possibility to perform population genetics simulations considering multiple loci whose epistasis is fully customizable. Specifically suited to the modelling of multilocus nucleocytoplasmic systems, it is nevertheless possible to simulate purely nucleic, i.e. diploid (or purely cytoplasmic, i.e. haploid) genetic models. The simulations are not individual-centred in that the transition from one generation to the next is done matrix-wise on the basis of deterministic equations. Instead of each individual being described separately, the simulations only handle the genotype frequencies within the population. All possible genotype frequencies considering the loci and alleles defined by the user are explicitly tracked. The simulations are therefore fast only if the number of genotypes is not too large.

The consideration of genetic drift and thus a specific population size is nevertheless introduced as a multinomial draw each generation, which adds to the realism of the simulations by adding randomisation. In the Ease package, the life cycle of the simulated population is standard ([selection on gamete production] - [gametogenesis (recombination + meiosis + mutation)] - [selection on gametes] - [syngamy] - [selection on individuals] - [drift]) and may consider the population dioecious or hermaphroditic.

NOTE: because selection is only definable genotype by genotype and haplotype by haplotype, Ease is (for the moment at least) not suitable when many genotypes are generated by multiple loci and alleles, unless you automate the process yourself. Very complex genetic models or those involving many loci are not the most optimised way to be simulated with Ease. Note that roughly speaking, if the number of genotypes possible by the input genome configuration is greater than the number of individuals desired, an individual-centred model is probably more suitable (see SLiM software; BC Haller, PW Messer (2019). SLiM 3: Forward genetic simulations beyond the Wright–Fisher model. Molecular Biology and Evolution. 36:632.).

Genome

Definition

A genome is defined by the set of loci to which lists of alleles are attached. Each loci and each allele is defined by a unique name, which allows it to be unequivocally identified.

There are two types of loci: diploid and haploid. A genotype is defined as an allelic combination of all the alleles of an individual’s loci and a haplotype as only those alleles that have been inherited together from a single parent. A genotype is therefore made up of two haplotypes. A distinction is also made between diploid (resp. haploid) haplotypes which correspond to allelic combinations taking into account only diploid (resp. haploid) loci. The loci are defined by a list of vectors that enumerates their respective alleles. The order in which the loci are placed is not important in the case of haploid loci. It does matter in the case of diploid loci because recombination is likely to affect the haplotypes. In the Ease package, diploid loci are

In the case of diploid loci, however, if several are defined, the order of the diploid loci in the list is not trivial. The rates of two-to-one combinations between them must indeed be defined by a vector of recombination rates. For example, if three diploid loci are defined, this vector must be of length 2, the first of its values defining the recombination rate between the first and second loci, the second of its values the recombination rate between the second and third loci. For example, if we want to define two groups of two loci that are linked to each other but are on two different chromosomes, we can define the recombination rate vector as c(0.1, 0.5, 0.1). The first two loci are thus relatively linked (recombination rate of 0.1), as are the last two loci. On the other hand, the recombination rate of 0.5 between the second and third loci ensures that the two groups are independent.

To create a haplotype ID, we concatenate all diploid alleles and all haploid alleles separately, then concatenate these two strings by separating them with "||". For example "Ab||CD" corresponds to a haplotype with four loci, two diploid with alleles A and b, and two haploid with alleles C and D. The principle is the same for the genotypes, but the second diploid haplotype is added by separating it from the first by a "/", for example "Ab/ab||CD".

Construction

Each loci is represented by a name and a factor vector that lists its alleles. If one wish to consider a system with two loci, a diploid and a haploid, each of which has two alleles, A and a, and B and b respectively, the construction of the genome is done as follows:

LD = list(dl = as.factor(c("A", "a")))
HL = list(hl = as.factor(c("B", "b")))
genomeObj = setGenome(listHapLoci = HL, listDipLoci = LD)

The haplotypes and genotypes have been generated automatically, their numbers can be retrieved by simply displaying the Genome object created:

genomeObj
#> -=-=-=-=-=-= GENOME OBJECT =-=-=-=-=-=- 
#>  #  1 haploid locus, with 2 allele(s)
#>  #  1 diploid locus, with 2 allele(s)
#>  #  4 haplotypes 
#>  #  6 genotypes 
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
#> (use print for a list of haplotypes and genotypes)

and an exhaustive list can be displayed using the print method:

print(genomeObj)
#> -=-=-=-=-=-= GENOME OBJECT =-=-=-=-=-=- 
#>               in details 
#> 
#>  #   1  haploid loci: 
#>       - 'hl' with B and b alleles 
#> 
#>  #   1  diploid loci: 
#>       - 'dl' with A and a alleles 
#> 
#>  #   4  haplotypes: 
#> [1]    1) A||B    2) a||B    3) A||b    4) a||b
#> 
#>  #   6  genotypes: 
#> [1]    1) A/A||B    2) A/a||B    3) a/a||B    4) A/A||b    5) A/a||b
#> [6]    6) a/a||b
#> 
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The haplotypes and genotypes are numbered, and these numberings will be important in defining the different types of fitness, as we shall now see.

Mutation matrix

Definition

A genome necessarily has a mutation matrix attached to it. This mutation matrix is haplotypic: it is a square probability matrix (the sum of the rows of which is equal to 1), of size equal to the number of haplotypes defined in the genome. This mutation matrix is not provided as is by the user, in which case it would be too tedious to define. Instead the user is asked to either :

NOTE: In practice, the mutation matrix is not used as such in the simulations. It is associated with the recombination matrix and the meiosis matrix which associates to each genotype the probability that they produce each haplotype by chromosomal segregation. It is with a matrix product Recombination matrix x Meiosis matrix x Mutation matrix that a single gametogenesis matrix is produced and used for the simulations.

Construction

Definition of the haplotypic mutation matrix by filling in the allelic mutation matrices :

mutMatrixObj = setMutationMatrix(genomeObj = genomeObj,
                                 mutHapLoci = list(matrix(c(0.95, 0.05, 0.03, 0.97), 2, byrow = T)),
                                 mutDipLoci = list(matrix(c(0.9, 0.1, 0.09, 0.91), 2, byrow = T)))
mutMatrixObj
#> -=-=-=- MUTATION MATRIX OBJECT -=-=-=- 
#>  #  Haplotypic mutation matrix: 
#>        A||B   a||B   A||b   a||b
#> A||B 0.8550 0.0950 0.0450 0.0050
#> a||B 0.0855 0.8645 0.0045 0.0455
#> A||b 0.0270 0.0030 0.8730 0.0970
#> a||b 0.0027 0.0273 0.0873 0.8827
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
#> (use print to access the allelic mutation matrices)

Definition of the haplotypic mutation matrix by filling in the forward and backward mutation rates:

mutMatrixObj = setMutationMatrixByRates(genomeObj = genomeObj, forwardMut = 1e-2)
mutMatrixObj
#> -=-=-=- MUTATION MATRIX OBJECT -=-=-=- 
#>  #  Haplotypic mutation matrix: 
#>        A||B   a||B   A||b  a||b
#> A||B 0.9801 0.0099 0.0099 1e-04
#> a||B 0.0000 0.9900 0.0000 1e-02
#> A||b 0.0000 0.0000 0.9900 1e-02
#> a||b 0.0000 0.0000 0.0000 1e+00
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
#> (use print to access the allelic mutation matrices)

Selection

Definition

Selection can be defined at three stages of this cycle: on mature individuals directly, on their gamete production or on the gametes. For individuals and gamete production, a fitness value is associated with each genotype. For gametes, it is with each haplotype. When defining fitness vectors, it is therefore necessary to know the order of haplotypes and genotypes (see previous section).

A fitness value is any positive or zero real. Fitness values are relative, so if all genotypes have a fitness value of 3, there will be no effect on the dynamics of the model.

Construction

In all cases, the construction of a Selection object is done using a genome class object (which is used to check the compatibility between the constructed genome and the desired selection parameters).

Neutral selection

Then it is for example possible to define no selection (neutral model) with the function setSelectNeutral to construct a selection object where the fitnesses are all identical (equal to 1):

selectionObj = setSelectNeutral(genomeObj = genomeObj)

We can then check that no selection has been defined:

selectionObj
#> -=-=-=-=-=- SELECTION OJBECT =-=-=-=-=- 
#>  #  On individuals:  NO 
#>  #  On gametes:  NO 
#>  #  On gamete production:  NO 
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
#> (use print to access the fitness values)

or with :

print(selectionObj)
#> -=-=-=-=-=- SELECTION OJBECT =-=-=-=-=- 
#>               in details 
#> 
#> No selection defined. 
#> 
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Non-neutral selection

Using the example given in the Genome section, one might want to simulate a system of genetic incompatibility where when the derived alleles a and b are put together within the same genotype, they induce a fitness cost through negative epistasis. This cost, which we will call s, is associated with h dominance which reduces this cost when the a nuclear allele is in the heterozygous state. Thus individuals A/A||B, A/a||B, a/a||B and A/A||b do not suffer any fitness cost (because they have only one of the two incompatible alleles), their fitness is equal to 1. The genotype A/a||b undergoing the reduced cost of incompatibility has a fitness of 1-h*s and the genotype a/a||b undergoing the full cost of incompatibility has a fitness of 1 - s.

s = 0.8
h = 0.5
selectionObj = setSelectOnInds(genomeObj = genomeObj, indFit = c(1, 1, 1, 1, 1 - h*s, 1 - s))

We can then check that selection has been defined:

selectionObj
#> -=-=-=-=-=- SELECTION OJBECT =-=-=-=-=- 
#>  #  On individuals:  YES 
#>  #  On gametes:  NO 
#>  #  On gamete production:  NO 
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
#> (use print to access the fitness values)

Regarding selection on individuals, it is necessary to understand that it will potentially not be identical if the modelled population is hermaphroditic or dioecious. In the case of hermaphroditism there is no distinction between female and male fitness, and so the indFit parameter will govern their fitness. If the sexes are separated, however, one can either define a fitness in individuals indFit that will apply to both males and females, or specify separately for males and females with the parameters femaleFit and maleFit.

In any case it is good to check with the print method that the fitnesses are those wanted:

print(selectionObj)
#> -=-=-=-=-=- SELECTION OJBECT =-=-=-=-=- 
#>               in details 
#> 
#>        Individuals Female Male
#> A/A||B         1.0    1.0  1.0
#> A/a||B         1.0    1.0  1.0
#> a/a||B         1.0    1.0  1.0
#> A/A||b         1.0    1.0  1.0
#> A/a||b         0.6    0.6  0.6
#> a/a||b         0.2    0.2  0.2
#> 
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Selection can also be defined on gamete production:

selectionObj = setSelectOnGametesProd(genomeObj = genomeObj, indProdFit = c(1, 1, 1, 1, 1 - h*s, 1 - s))

or on the gametes directly:

selectionObj = setSelectOnGametes(genomeObj = genomeObj, femaleFit = c(1, 1, 1 - s, 1 - s))

For these two ways of selecting for gametes, one can define fitness on a sex-by-sex basis or on all gametes, as desired.

Last but not least, it is obviously possible to combine these different layers of selections. This is done using the selectionObj parameter that each of the setSelect... functions has (except setSelectNeutral), it is then unnecessary to recall the genome to which the selection refers. For example, if we want to combine the three types of selections presented here :

s = 0.8
h = 0.5
selectionObj = setSelectOnInds(genomeObj = genomeObj, 
                               indFit = c(1, 1, 1, 1, 1 - h*s, 1 - s))
selectionObj = setSelectOnGametesProd(indProdFit = c(1, 1, 1, 1, 1 - h*s, 1 - s),
                                      selectionObj = selectionObj)
selectionObj = setSelectOnGametes(femaleFit = c(1, 1, 1 - s, 1 - s),
                                  selectionObj = selectionObj)
print(selectionObj)
#> -=-=-=-=-=- SELECTION OJBECT =-=-=-=-=- 
#>               in details 
#> 
#>        Individuals Female Male
#> A/A||B         1.0    1.0  1.0
#> A/a||B         1.0    1.0  1.0
#> a/a||B         1.0    1.0  1.0
#> A/A||b         1.0    1.0  1.0
#> A/a||b         0.6    0.6  0.6
#> a/a||b         0.2    0.2  0.2
#> 
#>  #  On gametes:  
#>      Female gamete Male gamete
#> A||B           1.0           1
#> a||B           1.0           1
#> A||b           0.2           1
#> a||b           0.2           1
#> 
#>  #  On gamete production:  
#>        Female gamete Male gamete
#> A/A||B           1.0         1.0
#> A/a||B           1.0         1.0
#> a/a||B           1.0         1.0
#> A/A||b           1.0         1.0
#> A/a||b           0.6         0.6
#> a/a||b           0.2         0.2
#> 
#> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Ease object

Definition

The Ease class object, eponymous with the name of the package, has the role of gathering all the parameters necessary for the construction of the model for simulations. As an input, it therefore takes all that is necessary, deduces the transition matrices that will be used for the simulations, and then, thanks to the simulate method, generates the results of the simulations. The results of the simulations can be obtained thanks to the getResults function, but a summary analysis can be done thanks to the plot and summary methods. The simulations can be saved every x generations, i.e. the genotypic and allelic frequencies will be stored for all simulations. This allows a better understanding of the dynamics of the simulations if needed, but requires longer calculation times. The access to these records is done through the getRecords method.

There are two ways in which a simulation can stop: it has reached a stop condition, or it has reached a user-defined generation threshold beyond which the simulation stops. A stop condition is a vector containing the name(s) of the allele(s) which, when set, cause the simulation to stop. Whether one or more stop conditions are defined, a list is systematically created which brings them together, and which allows them to be named (which is recommended).

Construction

We build a model via the setEase function by giving it as parameters the population size N, the threshold of generations not to be exceeded, the type of system, dioecy (dioecy = TRUE) or hermaphroditism (dioecy = FALSE), the rate of self-fertilisation (which will be ignored in dioecy), the list of stop conditions, and then the three objects of the three classes that were presented in the previous sections, i.e., the mutation matrix, the genome, and the selection object.

mod = setEase(N = 100, threshold = 1e6, dioecy = F, selfRate = 0.5,
            stopCondition = list(nucleo = "a", cyto = "b"),
            mutMatrixObj = mutMatrixObj,
            genomeObj = genomeObj,
            selectionObj = selectionObj)

Then simulations can be generated:

mod = simulate(mod, nsim = 50, recording = T, seed = 123)

And the results can be displayed using the plot method:

plot(mod)

#> <Press enter to go to the next graph>
#> Warning: Removed 1 rows containing missing values (position_stack).

#> <Press enter to go to the next graph>

#> <Press enter to go to the next graph>