--- title: "Mathematical Definitions of Sufficient Statistics" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Mathematical Definitions of Sufficient Statistics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview The **redeem** package models the intensities of interaction formation and dissolution using a log-linear formulation. The intensity (rate) for a dyad $(i,j)$ at time $t$ is: $$\lambda_{i,j}(t) = \exp(s_{i,j}(\mathscr{H}_t)^\top \beta + \alpha_i + \alpha_j + f(t, \gamma))$$ where: - $s_{i,j}(\mathscr{H}_t)$ is a vector of **sufficient statistics** derived from the history of interactions $\mathscr{H}_t$ up to time $t$. - $\alpha_i, \alpha_j$ are actor popularity parameters. - $f(t, \gamma) = \sum_{q=1}^Q \gamma_q \mathbb{I}(c_{q-1} \le t < c_q)$ is the baseline step-function representing temporal variation over change points $0 = c_0 < c_1 < \dots < c_Q$ (with $\gamma_1 = 0$). This vignette provides precise mathematical definitions for all sufficient network statistics implemented in the package, as well as a guide for adding custom statistics. The complete list of available formula terms is documented in the `redeem_terms` reference manual page. For model estimation, see the help pages for the models via `dem` and `rem`. --- ## Statistic Transformations Each sufficient statistic $s_{i,j}(\mathscr{H}_t)$ is defined as a transformed count $f(N(t))$, where $N(t)$ is a raw network statistic. The package supports five standard transformations $f(\cdot)$: | Transformation | Mathematical Definition | | :--- | :--- | | **`identity`** | $f(s) = s$ | | **`log`** | $f(s) = \log(1 + s)$ | | **`recip`** | $f(s) = 1/(1+s)$ | | **`bin`** | $f(s) = \mathbb{I}(s > 0)$ | | **`sig`** | $f(s) = \frac{s}{s + K}$ | --- ## Endogenous Network Statistics These statistics capture structural dependencies within the network evolution. ### 1. Intercept (`Intercept` / `intercept`) A constant term representing the baseline log-intensity. $$s_{i,j}(\mathscr{H}_t) = 1$$ ### 2. Inertia (`inertia` / `number_interaction`) Counts how many times the dyad $(i,j)$ has initiated an interaction in the past. $$s_{i,j}(\mathscr{H}_t) = f(N_{i,j}(t))$$ where $N_{i,j}(t) = \sum_{k: t_k < t} \mathbb{I}(i_k = i, j_k = j)$ (or the windowed version $N_{i,j}^w(t) = \sum_{k: t-w < t_k < t} \mathbb{I}(i_k = i, j_k = j)$). ### 3. Reciprocity (`reciprocity`) Models the tendency to reciprocate past interactions (directed only). $$s_{i,j}(\mathscr{H}_t) = f(N_{j,i}(t))$$ ### 4. Duration (`duration` / `current_interaction`) Measures the dependency on the time since the current interaction started (DEM only). $$s_{i,j}(\mathscr{H}_t) = \begin{cases} f(t - t_{\text{start, } i,j}) & \text{if dyad } (i,j) \text{ is interacting at } t \\ 0 & \text{otherwise} \end{cases}$$ where $t_{\text{start, } i,j}$ is the timestamp of the last formation event. ### 5. Participation Shifts (P-shifts) P-shifts capture the sequential dependencies between consecutive events (REM only). Let the preceding event in the stream be $A \to B$. For a candidate event $i \to j$ at time $t$: - **`psABBA`** (Reciprocation): $s_{i,j}(t) = \mathbb{I}(i = B, j = A)$ - **`psABBY`** (Receiver turn-continuing): $s_{i,j}(t) = \mathbb{I}(i = B, j \ne A, j \ne B)$ - **`psABAY`** (Sender turn-continuing): $s_{i,j}(t) = \mathbb{I}(i = A, j \ne A, j \ne B)$ - **`psABXA`** (Usurpation to sender): $s_{i,j}(t) = \mathbb{I}(i \ne A, i \ne B, j = A)$ - **`psABXB`** (Usurpation to receiver): $s_{i,j}(t) = \mathbb{I}(i \ne A, i \ne B, j = B)$ - **`psABXY`** (Completely new dyad): $s_{i,j}(t) = \mathbb{I}(i \ne A, i \ne B, j \ne A, j \ne B)$ --- ## Triadic Closure and Shared Partners Triadic statistics capture structural closure. They can be calculated over active edges (designated as **`current_`** in DEM) or historical event existence (designated as **`general_`** in REM/DEM). Let $\mathcal{A}_t$ represent the set of interacting dyads (for `current_`) or previously interacted dyads (for `general_`) at time $t$. ### Common Partners (`general_common_partners` / `current_common_partners`) Counts the number of third-party actors $k$ sharing a connection of a specified type with both $i$ and $j$: - **`OSP`** (Outgoing Shared Partner): $s_{i,j}(t) = f(|\{k : (i,k) \in \mathcal{A}_t \land (j,k) \in \mathcal{A}_t\}|)$ - **`ISP`** (Incoming Shared Partner): $s_{i,j}(t) = f(|\{k : (k,i) \in \mathcal{A}_t \land (k,j) \in \mathcal{A}_t\}|)$ - **`OTP`** (Outgoing Two-Path): $s_{i,j}(t) = f(|\{k : (i,k) \in \mathcal{A}_t \land (k,j) \in \mathcal{A}_t\}|)$ - **`ITP`** (Incoming Two-Path): $s_{i,j}(t) = f(|\{k : (k,i) \in \mathcal{A}_t \land (j,k) \in \mathcal{A}_t\}|)$ ### Triangles (`general_triangle` / `current_triangle`) Similar to common partners, but only non-zero if the focal dyad itself is active (directed only): - **`OSP`** Triangle: $s_{i,j}(t) = \mathbb{I}((i,j) \in \mathcal{A}_t) \times f(|\{k : (i,k) \in \mathcal{A}_t \land (j,k) \in \mathcal{A}_t\}|)$ - **`ISP`** Triangle: $s_{i,j}(t) = \mathbb{I}((i,j) \in \mathcal{A}_t) \times f(|\{k : (k,i) \in \mathcal{A}_t \land (k,j) \in \mathcal{A}_t\}|)$ - **`OTP`** Triangle: $s_{i,j}(t) = \mathbb{I}((i,j) \in \mathcal{A}_t) \times f(|\{k : (i,k) \in \mathcal{A}_t \land (k,j) \in \mathcal{A}_t\}|)$ - **`ITP`** Triangle: $s_{i,j}(t) = \mathbb{I}((i,j) \in \mathcal{A}_t) \times f(|\{k : (k,i) \in \mathcal{A}_t \land (j,k) \in \mathcal{A}_t\}|)$ --- ## Degree and Centrality Statistics These statistics capture actor-level activity or popularity. Let $\mathcal{D}_t$ be the network state at time $t$, and $d_{i, \text{out}}(t)$, $d_{i, \text{in}}(t)$ represent the out-degree and in-degree of $i$ in $\mathcal{D}_t$. Let $c_{i, \text{out}}(t)$, $c_{i, \text{in}}(t)$ be the total out-events and in-events involving $i$. ### Degree Statistics (`general_degree_out_sender`, etc.) - **Out-Degree Sender**: $s_{i,j}(t) = f(d_{i, \text{out}}(t))$ - **Out-Degree Receiver**: $s_{i,j}(t) = f(d_{j, \text{out}}(t))$ - **In-Degree Sender**: $s_{i,j}(t) = f(d_{i, \text{in}}(t))$ - **In-Degree Receiver**: $s_{i,j}(t) = f(d_{j, \text{in}}(t))$ - **Degree Sum** (undirected only): $s_{i,j}(t) = f(d_{i}(t) + d_{j}(t))$ - **Degree Absolute Difference** (undirected only): $s_{i,j}(t) = f(\|d_{i}(t) - d_{j}(t)\|)$ ### Count Statistics (`general_count_out_sender`, etc.) Count statistics are identical to degree statistics but use total interaction counts ($c$) rather than binary degrees ($d$): - **Out-Count Sender**: $s_{i,j}(t) = f(c_{i, \text{out}}(t))$ - **Out-Count Receiver**: $s_{i,j}(t) = f(c_{j, \text{out}}(t))$ - **In-Count Sender**: $s_{i,j}(t) = f(c_{i, \text{in}}(t))$ - **In-Count Receiver**: $s_{i,j}(t) = f(c_{j, \text{in}}(t))$ - **Count Sum** (undirected only): $s_{i,j}(t) = f(c_{i}(t) + c_{j}(t))$ - **Count Absolute Difference** (undirected only): $s_{i,j}(t) = f(\|c_{i}(t) - c_{j}(t)\|)$ All of these degree and count statistics can be conveniently specified in model formulas using the `degree()` (or `degrees()`) and `count()` helper functions: - **Out-Degree / Out-Count Sender**: `degree(type = "out_sender")` or `count(type = "out_sender")` - **Out-Degree / Out-Count Receiver**: `degree(type = "out_receiver")` or `count(type = "out_receiver")` - **In-Degree / In-Count Sender**: `degree(type = "in_sender")` or `count(type = "in_sender")` - **In-Degree / In-Count Receiver**: `degree(type = "in_receiver")` or `count(type = "in_receiver")` - **Degree / Count Sum**: `degree(type = "sum")` or `count(type = "sum")` - **Degree / Count Absolute Difference**: `degree(type = "absdiff")` or `count(type = "absdiff")` --- ## Exogenous Statistics ### 1. Dyadic Covariate (`dyadic_cov`) $$s_{i,j}(\mathscr{H}_t) = f(X_{i,j}(t))$$ where $X(t)$ is an $N \times N$ matrix. ### 2. Monadic Covariate (`monadic_cov`) $$s_{i,j}(\mathscr{H}_t) = g(x_i(t), x_j(t))$$ where $x$ is a monadic vector and $g(\cdot)$ is a user-defined function. --- ## Developer Guide: Adding Custom Sufficient Statistics To add a new sufficient statistic (e.g., `my_stat`) to the package, follow this two-step process: ### Step 1: Implement the Update Logic in C++ Define a C++ function in the `src/` directory (e.g., in a new file or in `src/sufficient_statistics.cpp`) with the signature `ValidateFunction`. This function computes the change to the statistic state matrix when an event occurs: ```cpp #include "redeem/sufficient_statistics.h" #include "redeem/extension_api.hpp" arma::uvec stat_my_stat( Data_DEM &object, arma::mat &data, unsigned int &from, unsigned int &to, unsigned int &number_event, unsigned int col_number, std::string transformation, unsigned int type ) { if (from == 0 || to == 0) return arma::uvec(); // 1. Calculate which dyad indices are affected and the change value (val) double val = (type == 1) ? 1.0 : -1.0; arma::uvec affected_indices = { object.current_stats.find_from_to(from, to) }; // 2. Apply the update using the helper (handles transformations internally) apply_update(object, affected_indices, col_number, val, transformation, data); // 3. Return the indices of modified dyads return affected_indices; } // 4. Register the term in the global C++ Registry TERM_REGISTER("my_stat", stat_my_stat); ``` ### Step 2: Write the R Initializer Function In `R/init_terms.R`, create an initialization function named `InitRedeemTerm.my_stat`. This function validates user-provided arguments and sets up initial values: ```r #' @keywords internal InitRedeemTerm.my_stat <- function(arglist, n_nodes, model_type, directed, ...) { # Validate arguments against expected types and models arglist <- check.RedeemTerm(arglist, model_type = model_type, directed = directed, expected = list(transformation = c("identity", "log", "recip", "bin", "sig"), K = "numeric"), defaults = list(transformation = "identity", K = 1) ) # Define initial statistic values at time 0 (optional) eval_at_zero <- matrix(0, n_nodes, n_nodes) # Return the configuration list list( base_name = "my_stat", transformation = arglist$transformation, eval_at_zero = eval_at_zero, event_stream = arglist$event_stream ) } ``` Once these steps are completed, run `devtools::document()` to update namespaces and documentation, and build the package. The new term is now ready for use in `dem()` and `rem()` by specifying it directly in the formula (e.g., `~ my_stat(transformation = "log")`).