Using the do-operator
to simulate data¶
This tutorial illustrates how we can use the underlying PyMC
model of a given HSSM model, to forward simulate datasets that you may use for numerical studies.
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import hssm
import pymc as pm
Simulate Data¶
We will simulate a simple dataset that contains two conditions.
# Simple model
data = hssm.simulate_data(
model="ddm", theta=dict(v=0.5, a=1.5, z=0.5, t=0.1), size=500
)
Specify HSSM Model¶
We will fit two separate models to the data.
# Model 1
hssm_model = hssm.HSSM(model="ddm", data=data)
Model initialized successfully.
Underlying PyMC model¶
pm.model_to_graphviz(hssm_model.pymc_model)
Using the do-operator
¶
from pymc.model.transform.conditioning import do
synth_idata, synth_model = hssm_model.sample_do(params = {"v": 0,
"a": 1.5,
"z": 0.5,
"t": 0.1},
draws = 100,
return_model = True)
Sampling: [rt,response]
Note that the process defines an new PyMC
model, which sets the parameters we passed in our dictionary
to Data
objects.
pm.model_to_graphviz(synth_model)
synth_idata
-
<xarray.Dataset> Size: 405kB Dimensions: (chain: 1, draw: 100, __obs__: 500) Coordinates: * chain (chain) int64 8B 0 * draw (draw) int64 800B 0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 99 * __obs__ (__obs__) int64 4kB 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499 Data variables: v_mean (chain, draw, __obs__) float64 400kB 0.0 0.0 0.0 ... 0.0 0.0 0.0 Attributes: created_at: 2025-10-17T01:18:54.705842+00:00 arviz_version: 0.22.0 inference_library: pymc inference_library_version: 5.25.1
-
<xarray.Dataset> Size: 805kB Dimensions: (chain: 1, draw: 100, __obs__: 500, rt,response_dim: 2) Coordinates: * chain (chain) int64 8B 0 * draw (draw) int64 800B 0 1 2 3 4 5 6 7 ... 93 94 95 96 97 98 99 * __obs__ (__obs__) int64 4kB 0 1 2 3 4 5 ... 494 495 496 497 498 499 * rt,response_dim (rt,response_dim) int64 16B 0 1 Data variables: rt,response (chain, draw, __obs__, rt,response_dim) float64 800kB 1.... Attributes: created_at: 2025-10-17T01:18:54.708836+00:00 arviz_version: 0.22.0 inference_library: pymc inference_library_version: 5.25.1
-
<xarray.Dataset> Size: 12kB Dimensions: (__obs__: 500, rt,response_extra_dim_0: 2) Coordinates: * __obs__ (__obs__) int64 4kB 0 1 2 3 4 ... 496 497 498 499 * rt,response_extra_dim_0 (rt,response_extra_dim_0) int64 16B 0 1 Data variables: rt,response (__obs__, rt,response_extra_dim_0) float64 8kB 0... Attributes: created_at: 2025-10-17T01:18:54.709714+00:00 arviz_version: 0.22.0 inference_library: pymc inference_library_version: 5.25.1
-
<xarray.Dataset> Size: 32B Dimensions: () Data variables: z float64 8B 0.5 a float64 8B 1.5 t float64 8B 0.1 v float64 8B 0.0 Attributes: created_at: 2025-10-17T01:18:54.710879+00:00 arviz_version: 0.22.0 inference_library: pymc inference_library_version: 5.25.1
Observe two things:
- A closer look at the
prior
group reveals that no parameters were actually sampled (it collects on the deterministicv_mean
which, arguably, we can skip too) - The
prior_predictive
group is where we collected our synthetic data
Our synthetic data sits in the prior_predictive
group, since, under the hood, were de facto called the sample_prior_predictive()
method, on our new do-model.
If you prefer to get your outputs as a dataframe, you can use the predictive_idata_to_dataframe
method from the hssm.utils
module.
synth_df = hssm.utils.predictive_idata_to_dataframe(synth_idata,
predictive_group = "prior_predictive")
synth_df
chain | draw | __obs__ | rt | response | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1.446554 | 1.0 |
1 | 0 | 0 | 1 | 1.198923 | 1.0 |
2 | 0 | 0 | 2 | 0.892631 | -1.0 |
3 | 0 | 0 | 3 | 3.382068 | 1.0 |
4 | 0 | 0 | 4 | 0.436959 | -1.0 |
... | ... | ... | ... | ... | ... |
49995 | 0 | 99 | 495 | 3.510932 | 1.0 |
49996 | 0 | 99 | 496 | 1.799391 | 1.0 |
49997 | 0 | 99 | 497 | 1.187817 | -1.0 |
49998 | 0 | 99 | 498 | 0.320621 | 1.0 |
49999 | 0 | 99 | 499 | 1.530689 | 1.0 |
50000 rows × 5 columns
From here you may re-attach the original data used to define the hssm_model
(e.g.to re-attach covariates that you want as part of the model you are testing on your synthetic data).
The sample_do()
function can be extremely helpful in setting up numerical experiments to test e.g. for parameter identifiability in scenarios that you wish to control precisely.