Content
Intro to DSEM
This is a short introduction to DSEM, with pictures!
DSEM is used to analyze time series data. Time series data has multiple observations for a given subject on one or multiple variables. For example, if we measure John’s urge to smoke every day for 50 days we will get the kind of data we need. DSEM is also useful for when we have multiple observations for multiple subjects. Thus, it is instructive to think of DSEM as being composed of two models: (1) a within-subject model, describing a subject’s mean and variation around that mean that happens over time and (2) a between-subject model, capturing how these subject-specific characteristics differ between subjects. The number of parameters in a DSEM can vary depending on the substantive question you would like to answer. In this workshop we will go over the most popular DSEM which consists of three parameters. In the figure below you can see a visual representation of the three-parameter DSEM. We will break down each of these parameters below using our running example of smoking urges.
1. Mean urge to smoke (\(\mu\))
The mean describes the average of a subject’s urge to smoke across all 50 days. This can be thought of as a stable point around which any variation happens. Subjects can differ in their mean value: as shown in the left panel of the figure below, the subject in blue has a higher mean urge to smoke. In the right panel, the model is described mathematically. We model a subject’s mean urge to smoke (\(\mu_{i}\)) as the sum of a population-level mean (\(\gamma_{1}\)) and a subject-specific deviation from that population-level mean (\(\upsilon_{1i}\)). In such a hierarchical framework, population-level effects are called fixed effects and the sds (or variances) of subject-specific effects are called random effects. Note that you estimate sds of subject-specific effects, because estimating all separate effects is computationally expensive (read: impossible).
2. Autoregression (AR1; \(\phi\))
How much the urge to smoke on a given day differs from the mean urge across all 50 days is called a deviation. How much these deviations linger over time, that is, for how long subsequent urges will stay above or below the subject’s mean urge, is captured by the autoregressive parameter. More frmally, it captures how deviations for the mean on time t-1 (i.e., \(Urge_{(t-1)i} - \mu_i\)) relate to deviations on time t (i.e., \(Urge_{ti} - \mu_i\) if you restructure the formula in the figure below (and ignore the noise \(\epsilon_{ti}\))). For example, when a subject with a high positive autoregressive parameter (blue in left panel of the figure below) reports an urge to smoke that is higher than their mean urge, you can expect them to stay at that elevated level for longer than someone with a lower autoregressive parameter (red in left panel of figure). In the right panel, the statistical model again shows how a subject’s autoregressive parameter (\(\phi_{i}\)) is composed of a population-level autoregressive parameter (\(\gamma_{2}\)) and a subject-specific deviation (\(\upsilon_{2i}\)).
3. Residual Variance (\(\psi\))
After taking into account the variation in urges that is explained by autoregression, our model still doesn’t fully capture all fluctuations. That is, we still have residual variation left. For example, two people who have the same level of urges on average and the same autoregression, may still differ in their day-to-day residual variation in their urges. In the left panel of the figure below, the two people shown are equal on all other parameters except their residual variance (blue has greater residual variance). As can be seen in the right panel of the figure below, the general structure of the residual variances is the same as for the mean and autoregression: a subject’s residual variance (\(\psi_{i}\)) is composed of a population-level parameter (\(\gamma_{3}\)) and a subject-specific deviation (\(\upsilon_{3i}\)). You can ignore the exponential term for now, we explain it in the Fitting a simple DSEM module.
To get started, go back to the top.
Get started
Before we start coding, let’s take care of some prerequisites. If you do not have R (R Core Team, 2024) and its user-friendly interface RStudio installed, please do so following the steps on this website.
Once installed, you also need to install the RStan package to enable R to use the computing platform.
Exercise 0: Install RStan using the R console.
if(!require(rstan)){install.packages("rstan")}
require(rstan)
The next step is accessing the example data on smoking urges through GitHub and getting a feel for them.
Exercise 1: Load the data. What variables are in the data set? How many subjects? And how many time points?
dat <- read.csv("https://raw.githubusercontent.com/JessicaSchaaf/dsemworkshop/main/Data/Two-Level%20Data.csv",header=T) #load McNeish data
There are 6 variables in the data set, for 100 subjects across 50 time points. These variables are subject’s smoking urges (‘urge’), their depressive symptoms (‘dep’), job stress (‘js’), home stress (‘hs’), subject ids (‘subject’), and time (‘time’).
We are interested in the following research questions
Q1. What are subject’s average smoking urges?
Q2. How do urges linger over time?
Q3. How do these effects differ across subjects? and
Q4. How do urges relate to concurrent depressive symptoms?
Let’s prepare the data to enable you to fit a DSEM to answer the research questions.
Exercise 2: First create an object Nsubj
that
contains the number of subjects and an object Nobs
that
contains the number of observations/time points.
Nsubj <- length(unique(dat$subject))
Nobs <- max(dat$time)
Exercise 3: Create two matrices of dimensions
Nsubj x Nobs
that contain the urges data (call this object
urges
) and the depression data (call this object
dep
).
urges <- matrix(dat$urge, Nsubj, Nobs, byrow=T)
dep <- matrix(dat$dep, Nsubj, Nobs, byrow=T)
Exercise 4: Plot the time series of smoking urges with time on the x-axis and urges, averaged across subjects, on the y-axis.
df = data.frame(time = 1:Nobs,
urges = tapply(dat$urge,dat$time,mean),
urgesSE = tapply(dat$urge,dat$time,sd)/sqrt(Nsubj))
if(!require(ggplot2)){install.packages("ggplot2")}
require(ggplot2)
ggplot(df, aes(x=time, y=urges)) +
geom_line(size=2) +
geom_ribbon(aes(ymin=df$urges - df$urgesSE, ymax=df$urges + df$urgesSE), alpha=0.3) +
labs(x="Time", y="Smoking Urges") +
theme_minimal() + ylim(-0.6,0.6)
To continue to the next module and learn about Stan models go back to the top.
Implementing a DSEM in Stan
Before getting into how to fit models in Stan, first some background on how to construct Stan models. You may or may not wish to alter or fully implement models yourself. Either way, understanding the basics may help understand what is going on behind the scenes. A Stan model is built up using several blocks. A simple model often requires a data block, a model block, and a parameters block.
The data block
In the data block, you specify which data to run the model
on. You provide these data from a different program, for example, R. A
standard DSEM requires three types of data to be transferred from R to
Stan: the number of subjects, the number of time points or observations,
and the time series data. In Stan, for each variable, you also have to
indicate the variable
type and its dimensions. The number of subjects (which we call
N
) is just a single integer, as is the number of
observations (which we call T
). The time series data is a
two-dimensional object of dimensions N x T
. In Stan, this
is implemented as follows:
data {
int<lower=1> N; // number of subjects
int<lower=1> T; // number of observations
array[N] vector[T] Y; // time series data
}
Important to note here is that Stan doesn’t require a certain set of variable names. In other words, you can call variables as you wish. Just be sure to be consistent; otherwise Stan will give you a semantic error (“Identifier ‘…’ not in scope.”).
The parameters block
In the parameters block, you specify which parameters to
sample in the model. In a standard DSEM the parameters consist of the
population-level mean, autoregression, and residual variance (which we
call gammas as is common notation in hierarchical modeling), and the
subject-specific deviations (u's
). They also consist of
parameters needed to construct the covariance matrix of subject-specific
deviations (which we call Omega
). Although this is
necessary to fit the model, we do not go into the details here to avoid
overly mathematical discussion. The population-level parameters
(gamma
) are implemented as a vector of size 3 and the
subject-specific deviations (u
) as a two-dimensional object
of size N x 3
:
parameters {
vector[3] gamma; // population-level effects
array[N] vector[3] u; // subject-specific deviations
}
The model block
In the model block, you specify the model to fit on the data and, as Stan uses Bayesian estimation, the prior distributions for all parameters.
The model
In our case the model is a standard DSEM, which mathematically can be described as follows:
\[\begin{equation} Y(t) = \mu + \delta(t)\label{eq1}\tag{1} \end{equation}\]
where Y
indicates the subject’s smoking urges across
trials t
, \(\mu\)
(mu
) is the mean urge, and \(\delta\) (delta
) is the
deviation of the observed urges from the mean urge on a specific
trial.
The lingering of deviations from the mean urge over time is captured
in an additional parameter \(\phi\)
(phi
), called the autoregressive effect. Mathematically,
this lingering is formulated as
\[\begin{equation} \delta(t) = \phi\delta(t-1) + \epsilon(t)\label{eq2}\tag{2} \end{equation}\]
where \(\epsilon\)
(epsilon
) is the residual after taking lingering of
deviations into account.
In Stan we implement Equation \(\ref{eq2}\) by specifying that the
deviation for subject i
on trial t
(delta[i][t]
) comes from a normal distribution with a mean
equal to the autoregressive effect for that subject
(phi[i] * delta[i][t-1]
). In Stan, we also need to specify
the standard deviation of this normal distribution. Do you remember from
the Intro to DSEM module that one of the parameters of a DSEM
is the residual variance? This is where this residual variance comes
into play. You have just learned from Equation \(\ref{eq2}\) that the residuals are called
\(\epsilon\). From this follows that
the residual variance is var(\(\epsilon_{it}\)), which we call \(\psi\) (psi
). However, as Stan
requires standard deviations instead of variances, we need to take the
squared root of the residual variance (sqrt(psi[i])
).
model {
for (i in 1:N) { // loop across subjects
for (t in 2:T) { // loop across observations
// likelihood of observing data given model
parameters
target += normal_lpdf(delta[i][t] | phi[i] *
delta[i][t-1], sqrt(psi[i]));
}
}
}
Likely, you have never encountered the target +=
notation in the code snippet above. It is just a way of telling Stan to
incrementally update the log likelihood (hence _lpdf
which
stands for log probability density function). It is mathematically
equivalent to
delta[i][t] ~ normal(phi[i] * delta[i][t-1], sqrt(psi[i]))
,
which may look more familiar.
Also in the model block, we specify the likelihood of the
subject-specific deviations. As you may remember from simple, or even
multilevel, regression, deviations are usually assumed normally
distributed with a mean of zero. Here we specify something similar.
Specifically, we tell Stan that subject-specific deviations
(u
) are multivariate normally distributed with means of
zero (rep_vector(0,3)
, which is a vector of three zeros)
and covariance matrix Omega
. This multivariate
implementation enables you to assess whether subject-specific effects
are correlated. For example, whether subjects with a higher mean have
lower residual variance or vice versa.
model {
for (i in 1:N) {
// likelihood of subject-specific deviations
target += multi_normal_lpdf(u[i] | rep_vector(0,3),
Omega);
}
}
Prior distributions
In the model block, you also specify prior distributions for
the parameters in the parameters block. For the
population-level parameters (gamma
), we implement a normal
distribution with a mean of zero and a standard deviation of 3.
model {
// prior on population-level parameters
target += normal_lpdf(gamma | 0, 3);
}
The transformed parameters and generated quantities blocks
You may also add a transformed parameters block (in which you specify parameter transformations), and a generated quantities block (in which you compute additional outputs such as posterior predictives). Importantly, the model block runs locally. So if you want to use variables or parameters in the generated quantities block, you need to specify these in the transformed parameters block (instead of the model block). But that is for later.
Go back to the top for the real fun to start in the next module.
Fitting a simple DSEM
In the Get Started module, you have prepared all necessary
data to fit a DSEM to the example data on smoking urges. Stan requires
data to be formatted in a list. Besides, in the Stan model, we have used
the names N
for the number of subjects, T
for
the number of time points or observations, and Y
for the
time series data. To enable Stan to recognize which data are which, you
need to use these names (N
, T
, and
Y
) in the list. We start simple and only focus on the first
two RQs, that is, mean smoking urge and how urges linger over time.
Exercise 5: Create a list (called datUrges
) that
contains the number of subjects (N
), the number of
observations (T
), and the time series data (Y
)
which in this case is smoking urges.
datUrges <- list(N = Nsubj,
T = Nobs,
Y = urges)
We are now going to fit a DSEM with three parameters: mean, autoregression, and residual variance. As you may have noticed while plotting the data in the previous module, the urges data are centered around zero and have a fairly stable average over time. This indicates the data are detrended. As such, we do not have to worry about trends in our data and can omit a parameter capturing this trend in the DSEM. Beware that your own data may contain trends and that you should either deal with this before modeling (through detrending like in our example) or by including a trend in the DSEM. Code for a DSEM including trend is available here.
We have already programmed the three-parameter DSEM in Stan for you
below. In the Implementing a DSEM in Stan module, we explained
the core of what we implemented. For now, you have to believe that we
coded the model correctly. Remember that in the model gamma
indicates population-level effects. Specifically, gamma[1]
is the population-level mean, gamma[2]
the population-level
autoregression, and gamma[3]
the population-level residual
variance.
Exercise 6: Fit the three-parameter DSEM using the
stan()
function from the rstan package. The argument
model_code
should be set to mymodel, the object in
which we programmed the model code. The data
argument
should contain the data list you created in exercise 5. In the
pars
argument you may specify (using a character string)
which parameters to keep track of. For now, keep track of
population-level effects (gamma
) and standard deviations of
subject-specific effects (tau_Omega
). Set the
iter
argument to 1000 iterations. And for reproducibility,
set the seed
argument to 42.
fitUrges <- stan(model_code = mymodel,
data = datUrges,
pars = c("gamma","tau_Omega"),
iter = 1000,
seed = 42)
Exercise 7: Print the fit results for the population-level
effects using the print()
function.
print(fitUrges, pars="gamma", probs = c(0.025, 0.975), digits = 2)
The output from exercise 7, gives you information on many things. As
Stan uses Bayesian estimation, the mean
indicates the mean
of the posterior distribution of each parameter. This value is often
reported as ‘the best estimate of the parameter’. The standard error of
the mean (se_mean
) indicates how stable the mean is
estimated across iterations, with large values indicating unstable
estimates. The standard deviation (sd
) indicates how
uncertain that mean is (the higher the more uncertain). The following
columns the print()
function returns indicate quantiles.
This is very helpful as you may use these to report credible intervals
(similar to confidence intervals in frequentist statistics). For
instance, the 2.5% and 97.5% quantiles give you the 95% credible
interval. The final two columns can be used to check how well your model
performs and how well parameters are estimated. n_eff
stand
for number of effective samples which indicates how effective sampling
was done. Without going into too much detail, the Rhat
statistic (Gelman
& Rubin, 1992) gives you an indication of model convergence.
Ideally Rhats are 1; values above 1.1 are typically interpreted as
convergence problems.
Exercise 8: Report results on the first two RQs, that is, what the average smoking urge is and how urges linger over time.
If you did everything correctly all effects have converged with R-hat values below 1.1 (the closer to 1 the better). You’ll also find that the average smoking urge is practically zero. This may seem odd at first, but remember that data were detrended (and thus demeaned) beforehand so it makes total sense. The autoregression is small but positive, indicating that experiencing more than average smoking urges yesterday relates to experiencing more than average smoking urges today (and vice versa for less than average urges). Go back to the top to move on to investigating individual differences.
Investigating individual differences
Moving on to answering the third RQ, we’re now not interested in population-level effects, but in individual differences in these effects. Specifically, we test whether subjects differ in their mean smoking urge and how urges linger over time (both answered on the population level in exercise 8).
In exercise 6, you not only tracked population-level effects
(gamma
), but also standard deviations in subject-specific
effects (tau_Omega
) in the pars
argument.
These parameter estimates indicate to what extent subjects differ in
their mean, autoregression, and residual variance.
tau_Omega
contains three values: tau_Omega[1]
is the sd of subject-specific means, tau_Omega[2]
of
subject-specific autoregressions, and tau_Omega[3]
of
subject-specific residual variance.
Exercise 9: Use the print()
function to get fit
results for the standard deviations of subject-specific effects. Then
report on the third RQ. Tip: Compute the 95% credible interval to assess
in which range most of the means and autoregressions fall.
print(fitUrges, pars="tau_Omega", probs = c(0.025, 0.975), digits = 2)
Looking at the estimates of tau_Omega
there are
substantial individual differences. To get a better feel for the scale
of individual differences, you may calculate the 95% credible interval,
and thus assess in which parameter range most of the individuals fall.
To do so, first take the mean of the estimated parameter as indicated by
gamma
in the summary of exercise 7. Then add/subtract 1.96
times the standard deviation (as indicated by tau_Omega
in
the summary of exercise 9) to/from the mean of the estimated parameter.
With respect to the mean, most individuals’ means range between -1.6 and
1.6, whereas most individuals’ autoregressions range between -.09 and
.45.
If you wish to investigate subject-specific effects (instead of their
standard deviation), you need to keep track of those effects while
fitting the model. The subject-specific means are called
mu
, the subject-specific autoregressions phi
,
and the residual variances psi
. Beware that there are
3 times N
subject-specific effects in the model so keeping
track of them makes model fitting take substantially longer. To speed up
estimation, you may run
options(mc.cores = parallel::detectCores())
in the R
console to run sampling chains in parallel.
Exercise 10: Refit the model but now keep track of the three
subject-specific effects. Use the same settings as in exercise 6, except
for the pars
argument.
options(mc.cores = parallel::detectCores())
fitUrges_ind <- stan(model_code = mymodel,
data = datUrges,
pars = c("mu","phi","psi"),
iter = 1000,
seed = 42)
You can extract the posterior samples for easy plotting with the
extract()
function. This returns an object that contains
the posterior samples for the tracked parameters. Beware that the
elements in the object have dimensions
number of samples x number of parameters
for
population-level effects, and
number of samples x number of participants
for
subject-specific effects.
Exercise 11: Extract the posterior samples and then plot the subject-specific estimates to visualize individual differences.
post = extract(fitUrges_ind)
df = data.frame(pp = 1:Nsubj,
param = rep(c("mu","phi","psi"),each=Nsubj),
est = c(apply(post$mu,2,mean),apply(post$phi,2,mean),apply(post$psi,2,mean)))
ggplot(df, aes(x=est, y=pp)) +
geom_point() +
facet_wrap(~ param, scales="free") +
labs(x="Estimate", y="Participant") +
ylim(c(1,Nsubj))
Exercise 12: Compare the plots to your own results in exercise 9. Are there substantial individual differences in mean urges and lingering of urges?
Looking good! Now let’s move on to adding covariates to answer the fourth RQ.
Adding covariates
The fourth RQ concerns a covariate, namely how concurrent depressive
symptoms relate to smoking urges. This covariate needs to be added to
the Stan model. We have already implemented such a covariate in a Stan
model. Yet, it may be helpful to understand what needs to be changed to
include an additional predictor. First, in the data block, we
declared that we have additional data (which we called X
)
to be modeled. Second, in the parameters block, we added a
population-level parameter for the covariate effect (an additional
gamma
). Third, also in the parameters block, we
added subject-specific effects (u
’s) for the covariate.
Beware that effects can only differ between subjects when a predictor
varies from trial-to-trial, not for subject-specific predictors. Fourth,
in the model block, we changed the prior distribution of the
subject-specific effects to indicate that we now have four (instead of
three) deviations per subject. And fifth, also in the model
block, we changed the formula for the actual model, stating that
the mean is now not only determined by the autoregressive effect
(Equation \(\ref{eq2}\) in the
Implementing a DSEM in Stan module), but also by the
covariate.
The complete model is implemented in the object
mymodelcov
below.
Exercise 13: Update your data list datUrges
(created in exercise 5) to include object X
, containing the
depression data.
datUrges <- list(N = Nsubj,
T = Nobs,
Y = urges,
X = dep)
Exercise 14: Fit a DSEM including the depression covariate.
Keep track of the population-level means and sds. Use
seed = 2094
. Report on the fourth RQ, that is, whether
depression relates to urges. Also compare estimates of the other
parameters to the ones you reported in exercise 8 and 9.
fitUrges_dep <- stan(model_code = mymodelcov,
data = datUrges,
pars = c("gamma","tau_Omega"),
iter = 1000,
seed = 2094)
print(fitUrges_dep, pars=c("gamma","tau_Omega"), probs = c(0.025, 0.975), digits = 2)
The answer is yes, more concurrent depressive symptoms relate to more
smoking urges, as indicated by a relatively large positive regression
coefficient under gamma[4]
. Also note that the estimates of
the mean and autoregression (gamma[1]
and
gamma[2]
) have barely changed. The residual variance
(gamma[3]
) dropped substantially, indicating residual
variance is explained by including concurrent depression as a
predictor.
Great! You’re done… for now :) Additional modules will be added soon. Go back to the top.