The Latent Dirichlet Allocation model (LDA), first proposed by @blei2003latent, has been extensively used for text-mining in multiple fields. @tsai2011tag used LDA to construct clusters of tags that represent the most common topics in blogs. @lee2010empirical compared LDA against three other text mining methods that are frequently used: latent semantic analysis, probabilistic latent semantic analysis, and the correlated topic model. The major limitation of LDA, as identified by these authors, was that the method does not consider relationship between topics as a mixed-membership clustering approach does [@erosheva2005bayesian]. Despite these limitations, however, LDA continues to be used in multiple disciplines. For instance, [@griffiths2004finding] used LDA to identify the main scientific topics in a large corpus of the Proceedings of the National Academy of Science articles. In conservation biology, LDA has been used to identify research gaps in the conservation literature [@westgate2015text]. LDA has also been proposed as a promising method for the automatic annotation of remote sensing imagery [@lienou2010semantic]. In marketing, LDA has been used to extract information from product reviews across 15 firms in five markets over four years, enabling the identification of the most important latent dimensions of consumer decision making in each studied market [@tirunillai2014mining]. Finally, in finance, a stock market analysis system based on LDA was used to combine financial news items together with stock market data to identify and characterize major events that impact the market. This system was then used to make predictions regarding stock market behavior based on news items identified by LDA [@mahajan2008mining].
Despite its success in text mining across multiple fields, LDA is a model that need not be restricted to text-mining. More specifically, LDA can be viewed as a mixed membership models since each element in the sample can belong to more than one cluster (or state) simultaneously. There are a few examples of LDA being used for other purposes than text-mining. For instance, a modified version of LDA has been extensively used on genetic data to identify populations and admixed probabilities of individuals [@pritchard2000inference]. Similarly, LDA has been used in ecology to identify plant communities from tree data for the eastern United States and from a tropical forest chronosequence [@valle2014decomposing].
The aim of this paper is to present the Rlda package for mixed-membership clustering analysis and describe this novel Bayesian model based on different types of data (i.e., multinomial, Bernoulli and Binomial), illustrating its use in a diverse set of examples. The innovative features of this model are twofold. First generalizes LDA for other types of commonly encountered categorical data. Second it enables the selection of the optimal number of clusters based on a truncated stick-breaking prior approach regularizing model results.
This vignette is organized as follows. Section 2 and section 3 describe the mathematical formulation for the Bayesian LDA mixed-membership cluster model. Section 4 and 5 present examples of the use of the package and the conclusions, respectively.
In the Bayesian LDA mixed-membership cluster model we postulate that each element is allocated to a single cluster, represented by a latent state variable. Specifically, consider a latent matrix \(\mathbf{Z}\) with dimension equals to \(L\times C\) where each row represents a sampling unit (\(l=1,\dots,L\)) and each column a possible state or cluster (\(c=1,\dots,C\)). The Data Generating Process postulated for this latent matrix is given by:
\begin{equation} \boldsymbol Z{l\cdot}\sim Multinomial(n{l},\boldsymbol\theta_{l}) \label{eq:eq0001} \end{equation}
\noindent where \(n_{l}\) is total number of elements drawn for location \(l\) and \(\boldsymbol\theta_{l}=(\theta_{l1},\dots,\theta_{lC})\) is a vector of parameters representing the probability of allocation in each cluster. Following Occam's razor, we intend to create the least number of clusters as possible, which is achieved by assuming a truncated stick-breaking prior:
\begin{equation} \theta{lc}=V{lc}\displaystyle\prod{c{*}=1}{c-1}(1-V{lc{*}}) \label{eq:eq0002} \end{equation}
\noindent where \(V_{lc}\sim Beta(1,\gamma)\) for \(c=1,\dots,C-1\) and \(V_{lC}=1\) by definition. This truncated stick-breaking prior will force the elements to be aggregated in the minimum number of clusters, given that \(\theta_{lc^{*}}\) is stochastically exponentially decreasing.
In the second hierarchical level, we consider a matrix \(\mathbf{Y}\) with dimension equal to \(L\times S\) where each row represents a sampling unit (e.g., locations, firms, individuals, plots) and each column a variable that describes these elements. In the Bayesian LDA model for mixed-membership clustering, after integrating over the latent vector \(\mathbf{Z}_{l\cdot}\), \(Y_{ls}\) can follow one of these distributions:
\begin{equation}
\begin{cases}
\mathbf{Y}{l\cdot}\sim Multinomial(n{l},\boldsymbol\theta{i}{t}\Phi)\
Y{ls}\sim Bernoulli(\boldsymbol\theta{i}{t}\boldsymbol\phi{s})\
Y{ls}\sim Binomial(n{ls},\boldsymbol\theta{i}{t}\boldsymbol\phi{s})
\label{eq:eq0003}
\end{cases}
\end{equation}
\noindent for \(l=1,\dots,L\) and \(s=1,\dots,S\) possible variables. \(Y_{ls}\) represents a random variable, \(\boldsymbol Y_{l\cdot}\) is a vector with these random variables for location \(l\), \(n_{l}\) is the total number of elements in sampling unit \(l\), \(n_{ls}\) is the total number of elements in sampling unit \(l\) and variable \(s\). In these models, \(\boldsymbol\phi_{s}=(\phi_{1s},\dots,\phi_{Cs})\) is a vector of parameters, while \(\boldsymbol\Phi\) is a \(C\times S\) matrix of parameters, given by:
\[ \mathbf{\Phi}=\begin{bmatrix} \phi_{11} & \phi_{12} & \dots & \phi_{1S} \ \phi_{21} & \phi_{22} & \dots & \phi_{2S} \ \vdots & \vdots & \ddots & \vdots \ \phi_{C1} & \phi_{C2} & \dots & \phi_{CS} \end{bmatrix} \]
In the last step, we specify the priors for \(\phi_{cs}\). For the multinomial model, we adopt a Dirichlet prior (i.e. \(\boldsymbol\phi_{c}\sim Dirichlet(\boldsymbol\beta)\) where \(\boldsymbol\beta=(\beta_{1},\dots,\beta_{S})\) is the hyperparameter vector). For the Bernoulli and Binomial representations, we assume that \(\phi_{cs}\) comes from a Beta distribution, (i.e., \(\phi_{cs}\sim Beta(\alpha_{0},\alpha_{1})\)).
These models are fit using Gibbs Sampling where parameter draws are iteratively made from each full conditional distribution. From a conceptual perspective, all of these models assume the following matrix decomposition:
\begin{equation} \mathbb{E}[\mathbf{Y}{L\times S}]=\mathbf{K}\circ[\Theta{L\times C}\Phi_{C\times S}] \label{eq:eq0004} \end{equation}
where \(\mathbf{K}\) is a matrix of constants and \(\circ\) is the Hadamard product. Sparseness is ensured by forcing large \(c\) in the \(\Theta_{L\times C}\) matrix to be close to zero. For the multinomial model, the \(\mathbf{K}\) matrix contains the total number of elements in each row whereas for the Bernoulli model, this matrix is equal to the identity matrix. Finally, for the Binomial model, the \(\mathbf{K}\) matrix has the total number of trials of each binomial distribution (i.e., \(n_{ls}\)). Although there are many ways matrices can be decomposed, the key characteristic of the form of matrix decomposition we choose is that each row of \(\mathbf{\Theta}\) is comprised of probabilities that sum to one. As a result, one can interpret \(\Phi_{C\times S}\) as the matrix that contain the “pure” features of the data, which are then mixed by the matrix \(\Theta_{L\times C}\) and multiplied by \(\mathbf{K}\) to generate the expected data.
##Full Conditional Distributions - FCD. ###Bernoulli model.
The probability of community membership status for each element \(W_{ls}\), where \(Z_{lc}=\sum_{s=1}^S I(W_{ls}=c)\), is given by:
\begin{equation} \begin{array}{lll} p(W{ls}=c{*}|…)&=&\frac{\theta{lc{}}\phi_{c{}s}{y{ls}}(1-\phi{c{*}s}){1-y{ls}}}{\displaystyle\sum{c=1}{C}\theta{lc}\phi{cs}{y{ls}}(1-\phi{cs}){1-y_{ls}}} \end{array} \label{eq:eq0005} \end{equation} Therefore, \(W_{ls}\) can be drawn from a categorical distribution. The FCD for \(V_{lc}\) is given by:
\[ p(V_{lc}|…)=Beta(z_{lc}+1,z_{l(c^{*}>c)}+\gamma) \]
\noindent where \(z_{lc}\) is the total number of elements in location \(l\) classified into cluster \(c\), and \(z_{l(c^{*}>c)}\) is the total number of elements in location \(l\) classified in clusters larger than \(c\). This latter quantity is given by \(z_{l(c^{*}>c)}=\sum_{s=1}^{S}\sum_{c^{*}=c+1}^{C}I(w_{ls}=c^{*})\). Finally, the FCD for \(\phi_{cs}\) is given by:
\[ p(\phi_{cs}|…)=Beta(q_{cs}^{(1)}+\alpha_0,q_{cs}^{(0)}+\alpha_1) \] where \(q_{cs}^{(j)}\) is the number of elements assigned to group \(c\) and for which \(y_{ls}=j\) (i.e., \(q_{cs}^{(j)}=\sum_{l=1}^L {1} (w_{ls}=c,y_{ls}=j)\)).
###Binomial model. For this model, we have \(n_{ls}\) elements for each sampling unit \(l\) and variable \(s\). The community membership status of the \(i\)-th element is denoted by \(W_{ils}\), where \(Z_{lc}=\sum_{s=1}^S \sum_{i=1}^{n_l}I(W_{ils}=c)\), and its probability is similar to the one for the Bernoulli model:
\begin{equation} \begin{array}{lll} p(W{ils}=c{*}|…)&=&\frac{\theta{lc{}}\phi_{c{}s}{x{ils}}(1-\phi{c{*}s}){1-x{ils}}}{\displaystyle\sum{c=1}{C}\theta{lc}\phi{cs}{x{ils}}(1-\phi{cs}){1-x_{ils}}} \end{array} \label{eq:eq0006} \end{equation}
\noindent where \(x_{ils}\) are binary random variables such that \(\sum_{i=1}^{n_{ls}}x_{ils}=y_{ls}\). Therefore, \(W_{ils}\) can be drawn from a multinomial distribution. The FCD for \(\phi_{cs}\) is given by:
\begin{equation} \begin{array}{lll} p(\phi{cs}|…)= Beta\left(\displaystyle q{cs}{(1)}+\alpha{0},q{cs}{(0)}+\alpha_{1}\right) \end{array} \label{eq:eq0007} \end{equation}
\noindent where, similar to the Bernoulli model, \(q_{cs}^{(j)}=\sum_{l=1}^{L} \sum_{i=1}^{n_{ls}}I(x_{ils}=j,w_{ils}=c)\).
Finally, the FCD for \(V_{lc}\) is given by:
\begin{equation} p\left(V{lc}|…\right) = Beta(z{lc}+1,z_{l(c{*}>c)}+\gamma) \label{eq:eq0008} \end{equation}
\noindent where \(z_{lc}\) is the total number of elements in location \(l\) classified into cluster \(c\) and \(z_{l(c^{*}>c)}\) is the total number of elements in location \(l\) classified in clusters larger than \(c\). This latter quantity is given by \(z_{l(c^{*}>c)}=\sum_{s=1}^{S}\sum_{i=1}^{n_{ls}}\sum_{c^{*}=c+1}^{C}I(w_{ils}=c^{*})\).
###Multinomial model. For the Multinomial case, if unit \(i\) in location \(l\) is associated with variable \(s\) (i.e., \(x_{il}=s\) such that \(y_{ls}=\sum_{i=1}^{n_{l}}I(x_{il}=s)\)), we have that:
\begin{equation} p(W{il}=c{*}|…)=\frac{\theta{lc{}}\phi_{sc{}}}{\left(\theta{1l}\phi{s1}+\dots+\theta{Cl}\phi{sC}\right)} \label{eq:eq0009} \end{equation}
In this equation, \(W_{il}\) is the group assignment of element i in location l, such that \(Z_{lc}=\sum_{i=1}^{n_l} {1} (W_{il}=c)\), and it can be sampled from a categorical distribution. Since we assumed a conjugate prior for \(\boldsymbol\phi_{c}\) with \(c\in\{1,\dots,C\}\), the Full Conditional Distribution for this vector of parameters is a straight-forward Dirichlet distribution:
\begin{equation} p(\boldsymbol\phi{c}|…)=Dirichlet([q{c1}+\beta{1},\dots,q{cS}+\beta_{S}]) \label{eq:eq0010} \end{equation}
where \(q_{cs}=\sum_{l=1}^{L}\sum_{i=1}^{n_{l}}I(w_{il}=c,x_{il}=s)\).
Finally, the FCD for \(V_{lc}\) is given by:
\begin{equation} p(V{lc}|…)=Beta(z{lc}+1,z_{l(c{*}>c)} +\gamma) \label{eq:eq0011} \end{equation}
\noindent where \(z_{l(c^{*}>c)}\) is the total number of elements in observation \(l\) classified in clusters larger than \(c\). This quantity is given by \(z_{l(c^{*}>c)}=\sum_{i=1}^{n_{l}}\sum_{c^{*}=c+1}^{C}I(w_{il}=c^{*})\).
##Examples. In this section we present some applications of our Rlda package for mixed-membership clustering in marketing and ecology.
###Marketing. It is well known that attracting a new customer is often considerably costlier than keeping current customer [@kotler2006principles]. For this reason, firms can better retain their customers if they pay careful attention to their consumers' complaints and work to solve them in a satisfactory way. Therefore, our first application considers the classical LDA for a Multinomial entry in the field of Marketing. Specifically, we are interested in characterizing firms based on their consumers' complaints.
The data come from the 2015 Consumer Complaint Database and consist of complaints received by the Bureau of Consumer Financial Protection in US regarding financial products and services. In this example, we work only with credit card complaints. This dataset contains information on the number of complaints for each firm (\(L = 226\)), categorized according to the specific type of issue (\(S = 30\)). Examples of issues include billing disputes, identity theft / fraud, and unsolicited issuance of credit card. In this case, each sampling unit represents a firm and each variable represents an issue.
The characterization of firms provided by our analysis can be useful to reveal commonalities and differences across different firms. This can then be used by managers to identify and potentially adopt the solutions that are employed by other firms to deal with these issues.
To use the Rlda package for the Multinomial entry, it is first necessary to create a matrix where each cell represents the total number of cases observed for each sampling unit and type of complaint.
library(Rlda)
#Read the Complaints data
data(complaints)
#Create the abundance matrix
library(reshape2)
mat1<- dcast(complaints[ ,c("Company","Issue")],
Company~Issue, length,
value.var="Issue")
#Create the rowname
rownames(mat1)<- mat1[,1]
#Remove the ID variable
mat1<- mat1[,-1]
To use the rlda.multinomial method we need to specify several arguments. We set the maximum number of clusters (n_community) to 30 and the number of Gibbs Sampling iterations (n_gibbs) to 1000. Finally, we ask that the algorithm output the sum of the log-likelihood and the log-prior (ll_prior=TRUE) and we choose not to display the progress bar (display_progress=FALSE):
#Set seed
set.seed(9292)
#Hyperparameters for each prior distribution
beta<- rep(1,ncol(mat1))
gamma<- 0.01
#Execute the LDA for the Multinomial entry
res<- rlda.multinomial(data=mat1, n_community=30, beta, gamma, n_gibbs=1000,
ll_prior=TRUE, display_progress=FALSE)
We can visually evaluate the convergence by examining:
#Get the logLikelihood
ll<- res$logLikelihood
#Plot the log-likelihood
plot(ll, type="l", xlab="Iterations",
ylab="Log(likel.)+log(prior)")
abline(v=700,col='grey')
Samples of our parameter estimates are given in the Theta and Phi matrices, where each line in these matrices contains the result of a Gibbs iteration. Thus, parameter estimates can be obtained by averaging the results in each column of these matrices after discarding the burn-in iterations. This can be quickly done using the function summary, as illustrated below:
#Get the Theta Estimate
Theta<-summary(res)$Theta
The Theta matrix has a sparse structure since our truncated stick-breaking prior tends to reduce the total number of dominant clusters. In this matrix, each cell contains the estimated probability of the \(l\)-th firm being allocated to cluster \(c\). A useful way to explore the results from this matrix is using interactive 3D graphics:
library('rgl')
library('car')
scatter3d(x=Theta[,'Cluster1'], y=Theta[,'Cluster2'], z=Theta[,'Cluster3'],
surface=F, xlab='Cluster 1', ylab='Cluster 2', zlab='Cluster 3',
labels=rownames(Theta), id.n=20)
In many ecological studies, it is not possible to determine the total number of individuals per species in each sampling unit. As a result, these data are often presented as or summarized into binary presence/absence matrices (1 and 0, respectively) [@pearce2006modelling].
In this example we used data(“presence”), which includes presence/absence information on 13 species at 386 forested locations [@moisen2006predicting]. We analyze these data using the rlda.bernoulli S3 method.
#Load data
data(presence)
#Set seed
set.seed(9842)
#Hyperparameters for each prior distribution
gamma <-0.01
alpha0<-0.01
alpha1<-0.01
#Execute the LDA for the Binomial entry
res<-rlda.bernoulli(presence, 10, alpha0, alpha1, gamma,
5000, TRUE, FALSE)
We can visually evaluate the cluster distribution across species, after the burn-in phase:
#Burnout
Phi<-summary(res, burnin=0.1, silent=TRUE)$Phi
#Color
library(RColorBrewer)
myColor<-brewer.pal(n = 10, name = "RdBu")
#Labels
stars(Phi,col.segments=myColor,scale=TRUE,
draw.segments=TRUE,ncol=4,flip.labels=FALSE,cex=0.6)
The goal of this vignette was to describe the Bayesian LDA model for mixed membership clustering based on different types of discrete data. We have demonstrated how to use the model for Multinomial entry in the Marketing example and Bernoulli trial using ecological data.
One of the main properties of the model presented here is the fact that this model adopts the truncated stick-breaking prior which enables the selection of the optimal number of clusters by regularizing model results. The next step in the development of the model presented here is the possibility to work with explanatory variables within our algorithms, which can be useful to make inference on the drivers of the probability of each cluster.
##Appendix. In this Appendix we provide the derivation for the Full Conditional Distributions associated with each model.
###Bernoulli model. For \(W_{ls}\):
\[p(W_{ls}=c^{*}|…)=k\times Cat(W_{ls}=c^{*}| \boldsymbol{\theta_{l}}) \times Bernoulli(y_{ls}|\phi_{c^{*}s})\] \[=k\times\theta_{lc^{*}}\times\phi_{c^{*}s}^{y_{ls}}(1-\phi_{c^{*}s})^{1-y_{ls}}\]
Since \(W_{ls}\) is a categorical random variable with support in \(\mathcal{Z}=(1,2,\dots,C)\), the sum of the probabilities for all elements must equal one. Therefore, the constant \(k\) is given by: \[k=\sum_{c=1}^C \theta_{lc}\times\phi_{cs}^{y_{ls}}(1-\phi_{cs})^{1-y_{ls}}\] As a result, \(W_{ls}\) can be sampled from a categorical distribution.
For \(V_{ls}\):
\[p(V_{ls}|…)\propto Binomial(z_{lc}|z_{lc}+z_{l(c^{*}>c)},V_{lc})\times Beta(V_{lc}|1,\gamma)\] \[\propto V_{lc}^{z_{lc}}(1-V_{lc})^{z_{l(c^{*}>c)}}\times(1-V_{lc})^{\gamma-1}\] \[\propto V_{lc}^{(z_{lc}+1)-1}(1-V_{lc})^{(z_{l(c^{*}>c)}+\gamma)-1}\] \[p(V_{ls}|…)=Beta(z_{lc}+1,z_{l(c^{*}>c)}+\gamma)\]
For \(\phi_{cs}\):
\[p(\phi_{cs}|…)\propto [\prod_{l=1}^L Bernoulli(y_{ls}|\phi_{cs})^{I(w_{ls}=c)}]\times Beta(\phi_{cs}|\alpha_0,\alpha_1)\] \[\propto [\prod_{l=1}^L \phi_{cs}^{I(w_{ls}=c,y_{ls}=1)}(1-\phi_{cs})^{I(w_{ls}=c,y_{ls}=0)}] \times \phi_{cs}^{\alpha_0-1}(1-\phi_{cs})^{\alpha_1-1}\] \[\propto \phi_{cs}^{\sum_{l=1}^LI(w_{ls}=c,y_{ls}=1)+\alpha_0-1}(1-\phi_{cs})^{\sum_{l=1}^LI(w_{ls}=c,y_{ls}=0)+\alpha_1-1}\] \[p(\phi_{cs}|…)=Beta(q_{cs}^{(1)}+\alpha_0,q_{cs}^{(0)}+\alpha_1)\]
\noindent where \(q_{cs}^{(j)}=\sum_{l=1}^LI(w_{ls}=c,y_{ls}=j)\).
###Binomial model. For \(W_{ils}\): \[p(W_{ils}=c^{*}|…)=k\times Cat(W_{ils}=c^{*}|\boldsymbol{\theta_{l}})\times Bernoulli(x_{ils}|\phi_{c^{*}s})\] \[=k\times\theta_{lc^{*}}\times\phi_{c^{*}s}^{x_{ils}}(1-\phi_{c^{*}s})^{1-x_{ils}}\]
Since \(W_{ils}\) is a categorical random variable with support in \(\mathcal{Z}=(1,2,\dots,C)\), the sum of the probabilities for all elements must equal one. Therefore, the constant \(k\) is given by: \[k=\sum_{c=1}^C \theta_{lc}\times\phi_{cs}^{x_{ils}}(1-\phi_{cs})^{1-x_{ils}}\] As a result, \(W_{ils}\) can be sampled from a categorical distribution.
For \(V_{ls}\):
\[p(V_{ls}|…)\propto Binomial(z_{lc}|z_{lc}+z_{l(c^{*}>c)},V_{lc})\times Beta(V_{lc}|1,\gamma)\] \[\propto V_{lc}^{z_{lc}}(1-V_{lc})^{z_{l(c^{*}>c)}}\times(1-V_{lc})^{\gamma-1}\] \[\propto V_{lc}^{(z_{lc}+1)-1}(1-V_{lc})^{(z_{l(c^{*}>c)}+\gamma)-1}\] \[p(V_{ls}|…)=Beta(z_{lc}+1,z_{l(c^{*}>c)}+\gamma)\]
For \(\phi_{cs}\):
\[p(\phi_{cs}|…)\propto [\prod_{l=1}^L \prod_{i=1}^{n_{ls}} Bernoulli(x_{ils}|\phi_{cs})^{I(w_{ils}=c)}]\times Beta(\phi_{cs}|\alpha_0,\alpha_1)\] \[\propto [\prod_{l=1}^L \prod_{i=1}^{n_{ls}} \phi_{cs}^{I(w_{ils}=c,x_{ils}=1)}(1-\phi_{cs})^{I(w_{ils}=c,x_{ils}=0)}] \times \phi_{cs}^{\alpha_0-1}(1-\phi_{cs})^{\alpha_1-1}\] \[\propto \phi_{cs}^{\sum_{l=1}^L \sum_{i=1}^{n_{ls}} I(w_{ils}=c,x_{ils}=1)+\alpha_0-1}(1-\phi_{cs})^{\sum_{l=1}^L \sum_{i=1}^{n_{ls}} I(w_{ils}=c,x_{ils}=0)+\alpha_1-1}\] \[p(\phi_{cs}|…)=Beta(q_{cs}^{(1)}+\alpha_0,q_{cs}^{(0)}+\alpha_1)\] \noindent where \(q_{cs}^{(j)}=\sum_{l=1}^L \sum_{i=1}^{n_{ls}} I(w_{ls}=c,y_{ls}=j)\).
###Multinomial model. For \(W_{il}\):
\[p(W_{il}=c^*|…)=k\times Cat(W_{il}=c^*|\boldsymbol{\theta_l})\times Cat(x_{il}=s|\phi_{sc^*})\] \[=k\times \theta_{lc^*}\phi_{sc^*}\] Since \(W_{il}\) is a categorical random variable with support in \(\mathcal{Z}=(1,2,\dots,C)\), the sum of the probabilities for all elements must equal one. Therefore, the constant \(k\) is given by:
\[k=\sum_{c=1}^C \theta_{lc}\times\phi_{cs}\]
As a result, \(W_{il}\) can be sampled from a categorical distribution.
For \(V_{ls}\):
\[p(V_{ls}|…)\propto Binomial(z_{lc}|z_{lc}+z_{l(c^{*}>c)},V_{lc})\times Beta(V_{lc}|1,\gamma)\] \[\propto V_{lc}^{z_{lc}}(1-V_{lc})^{z_{l(c^{*}>c)}}\times(1-V_{lc})^{\gamma-1}\] \[\propto V_{lc}^{(z_{lc}+1)-1}(1-V_{lc})^{(z_{l(c^{*}>c)}+\gamma)-1}\] \[p(V_{ls}|…)=Beta(z_{lc}+1,z_{l(c^{*}>c)}+\gamma)\] For \(\boldsymbol{\phi_c}\): \[p(\boldsymbol{\phi_c}|…)\propto [\prod_{l=1}^L \prod_{i=1}^{n_l}Cat(x_{il}|\boldsymbol{\phi_c})^{I(w_{il}=c)}]\times Dirichlet(\boldsymbol{\phi_c}|\boldsymbol{\beta})\] \[\propto [\prod_{l=1}^L \prod_{i=1}^{n_l} \phi_{1c}^{I(x_{il}=1,w_{li}=c)}\times…\times\phi_{Sc}^{I(x_{il}=S,w_{li}=c)}]\times\phi_{1c}^{\beta_1-1}\times…\times\phi_{Sc}^{\beta_S-1}\] \[\phi_{1c}^{\sum_{l=1}^L \sum_{i=1}^{n_l} I(x_{il}=1,w_{li}=c) +\beta_1-1}\times…\times\phi_{Sc}^{\sum_{i=1}^{n_l} I(x_{il}=S,w_{li}=c) + \beta_S-1}\] \[p(\phi_c|…)= Dirichlet([q_{c1}+\beta_1,…,q_{cS}+\beta_S])\] \noindent where \(q_{cs}=\sum_{i=1}^{n_l} I(x_{il}=s,w_{li}=c)\).