% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/compare.r
\name{newsflow.compare}
\alias{newsflow.compare}
\title{Compare the documents in a dtm with a sliding window over time}
\usage{
newsflow.compare(dtm, dtm.y = NULL, meta = NULL, meta.y = NULL,
  date.var = "date", hour.window = c(-24, 24), group.var = NULL,
  measure = c("cosine", "overlap_pct", "overlap", "crossprod",
  "softcosine", "query_lookup", "query_lookup_pct"), min.similarity = 0,
  n.topsim = NULL, only.from = NULL, only.to = NULL,
  only.complete.window = TRUE, zscore = F, return_as = c("igraph",
  "edgelist", "matrix"), batchsize = 1000, simmat = NULL,
  simmat_thres = NULL, verbose = FALSE)
}
\arguments{
\item{dtm}{A quanteda \link[quanteda]{dfm}. Alternatively, a DocumentTermMatrix from the tm package can be used, but then the meta parameter needs to be specified manually}

\item{dtm.y}{optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm.y. This cannot be combined with only.from and only.to}

\item{meta}{If dtm is a quanteda dfm, docvars(meta) is used by default (meta is NULL) to obtain the meta data. Otherwise, the meta data.frame has to be given by the user, with the rows of the meta data.frame matching the rows of the dtm (i.e. each row is a document)}

\item{meta.y}{like meta, but for dtm.y (only necessary if dtm.y is used)}

\item{date.var}{The name of the column in meta that specifies the document date. default is "date". The values should be of type POSIXct}

\item{hour.window}{A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.}

\item{group.var}{Optionally,  The name of the column in meta that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.}

\item{measure}{the measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document 
that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental).
The regular crossprod (inner product) is also supported.
If the dtm's are prepared with the create_queries function, the special "query_lookup" and "query_lookup_pct" can be used.}

\item{min.similarity}{a threshold for similarity. lower values are deleted. Set to 0.1 by default.}

\item{n.topsim}{An alternative or additional sort of threshold for similarity. Only keep the [n.topsim] highest similarity scores for x. Can return more than [n.topsim] similarity scores in the case of duplicate similarities.}

\item{only.from}{A vector with names/ids of documents (dtm rownames), or a logical vector that matches the rows of the dtm. Use to compare only these documents to other documents.}

\item{only.to}{A vector with names/ids of documents (dtm rownames), or a logical vector that matches the rows of the dtm. Use to compare other documents to only these documents.}

\item{only.complete.window}{if True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x.}

\item{zscore}{if true, transform the similarity scores for each document in dtm to z-scores. The min.similarity filter  will then apply to this value.}

\item{return_as}{Detemine whether output is returned as an "edgelist", "igraph" network or sparse "matrix'.}

\item{batchsize}{If group and/or date are used, size of batches.}

\item{simmat}{If softcosine is used, a symmetrical matrix with the similarity scores of terms. If NULL, the cosine similarity of terms in dtm will be used}

\item{simmat_thres}{If softosine is used, a threshold for the similarity scores of terms}

\item{verbose}{If TRUE, report progress}
}
\value{
A network/graph in the \link[igraph]{igraph} class
}
\description{
Given a document-term matrix (DTM) with dates for each document, calculates the document similarities over time using with a sliding window.
}
\details{
The calculation of document similarity is performed using a vector space model approach. 
Inner-product based similarity measures are used, such as cosine similarity.
It is recommended to weight the DTM beforehand, for instance using Term frequency-inverse document frequency (tf.idf)
}
\examples{
rnewsflow_dfm 

dtm = quanteda::dfm_tfidf(rnewsflow_dfm)
g = newsflow.compare(dtm, hour.window = c(0.1, 36))

vcount(g) # number of documents, or vertices
ecount(g) # number of document pairs, or edges

head(igraph::get.data.frame(g, 'vertices'))
head(igraph::get.data.frame(g, 'edges'))
}
