{
"cells": [
{
"cell_type": "markdown",
"id": "8f226bdf",
"metadata": {},
"source": [
"## 3. Model checking\n",
"\n",
"
\n",
"\n",
"Last time we focussed on building a simple initial model for the Cepheid variable data. In this notebook, we explore the next steps in the workflow, **model implementation** and **model checking**. Along the way, we'll also introduce **prior predictive checks** and **posterior predictive checks**. "
]
},
{
"cell_type": "markdown",
"id": "91802fbe",
"metadata": {},
"source": [
"### Implementation in Stan\n",
"\n",
"
\n",
"\n",
"In the [model building notebook](model_building.ipynb), we developed a simple model to fit the period-luminosity relation of Cepheid variable stars. We can now implement this model in Stan and use it to infer the parameters $\\alpha$ and $\\beta$ that describe the relation.\n",
"\n",
"
\n",
"\n",
"As we've seen before in the [introduction notebook](introduction.ipynb), Stan files consist of different blocks that divide information into `data`, `parameters` and `model`. You can find more detailed information on how to write models in Stan code [here](https://mc-stan.org/docs/2_27/reference-manual/index.html#overview), but for now we can just take a look at this model: `stan/cepheid_v0.stan`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4371e844",
"metadata": {
"ExecuteTime": {
"end_time": "2021-08-27T08:19:14.947998Z",
"start_time": "2021-08-27T08:19:14.820155Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/**\r\n",
" * Initial model for Cepheid variables\r\n",
" * - Single galaxy\r\n",
" * - Shared sigma_m\r\n",
" **/\r\n",
"\r\n",
"data {\r\n",
"\r\n",
" /* usual inputs */\r\n",
" int Nc; // number of Cepheids\r\n",
" vector[Nc] m_obs; // obs apparent mag.\r\n",
" real sigma_m; // mag uncertainty\r\n",
" vector[Nc] log10P; // log10(periods/days)\r\n",
" real z; // redshift of single galaxy\r\n",
"\r\n",
" /* for generated quantities */\r\n",
" int Ngrid;\r\n",
" vector[Ngrid] log10P_grid;\r\n",
"}\r\n",
"\r\n",
"transformed data {\r\n",
"\r\n",
" real dL;\r\n",
" \r\n",
" /* luminosity distance */\r\n",
" dL = (3.0e5 * z) / 70.0; // Mpc\r\n",
" \r\n",
"}\r\n",
"\r\n",
"parameters {\r\n",
"\r\n",
" /* parameters of the P-L relation */\r\n",
" real alpha;\r\n",
" real beta;\r\n",
" \r\n",
"}\r\n",
"\r\n",
"transformed parameters {\r\n",
"\r\n",
" vector[Nc] M_true;\r\n",
" vector[Nc] m_true;\r\n",
"\r\n",
" /* P-L relation */\r\n",
" M_true = alpha + beta * log10P;\r\n",
"\r\n",
" /* convert to m */\r\n",
" m_true = M_true + 5 * log10(dL) + 25;\r\n",
" \r\n",
"}\r\n",
"\r\n",
"model { \r\n",
"\r\n",
" /* likelihood */\r\n",
" m_obs ~ normal(m_true, sigma_m);\r\n",
" \r\n",
"}\r\n",
"\r\n",
"generated quantities {\r\n",
"\r\n",
" vector[Ngrid] line;\r\n",
"\r\n",
" /* generate the posterior for the P-L relation */\r\n",
" line = alpha + beta * log10P_grid;\r\n",
" \r\n",
"}\r\n"
]
}
],
"source": [
"!cat stan/cepheid_v0.stan"
]
},
{
"cell_type": "markdown",
"id": "f87c92ec",
"metadata": {},
"source": [
"A few things to note:\n",
"\n",
"* We could have left the `M` <-> `m` conversion outside of Stan for a more compact program, but it is quite easy to have it inside, meaning that our work in this notebook is simplified.\n",
"* We could cut out the `transformed parameters` block and replace `m_true` with a long expression in the `model` block, but this is easy to read and gives us direct access to the `M_true` and `m_true` distributions, as we will see later.\n",
"* The `generated quantities` block is not part of the statistical model, but is run once for every iteration of the sampler. This can be useful for generating simple outputs, again as we will see later.\n",
"\n",
"
\n",
"\n",
"What about priors??\n",
"\n",
"The more awake amongst you may notice that we have explicitly specified data, parameters and a likelihood in the Stan code, but no prior. A Bayesian analysis requires specification of the priors, so what is going on? By default, Stan will assume a uniform prior over the range of the parameters that you specified. Here, we specify `alpha` and `beta` with no bounds, and so our prior is implicitly uniform over all the real numbers. Does this sound reasonable? In any case, we will run with it for now, and talk about priors more later in this notebook.\n",
"\n",
"
\n",
"\n",
"Next, let's set up our environment and compile our model to get ready to fit some data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "08961ece",
"metadata": {
"ExecuteTime": {
"end_time": "2021-08-27T08:19:15.978729Z",
"start_time": "2021-08-27T08:19:14.950833Z"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"from matplotlib import pyplot as plt\n",
"import arviz as av\n",
"from cmdstanpy import CmdStanModel"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8045500d",
"metadata": {
"ExecuteTime": {
"end_time": "2021-08-27T08:19:15.984474Z",
"start_time": "2021-08-27T08:19:15.980238Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:cmdstanpy:found newer exe file, not recompiling\n",
"INFO:cmdstanpy:compiled model file: /Users/fran/projects/BayesianWorkflow/src/notebooks/stan/cepheid_v0\n"
]
}
],
"source": [
"cepheid_v0 = CmdStanModel(stan_file=\"stan/cepheid_v0.stan\")"
]
},
{
"cell_type": "markdown",
"id": "ba0697d1",
"metadata": {},
"source": [
"### The first check: fitting simulated data\n",
"\n",
"
\n",
"\n",
"While we developed our initial generative and statistical models with the goal to fit *real* data, in doing so we now have a simple simulation program that contains exactly the same information as is included in our model. We can use this to generate *simulated* data where we *know* the values of `alpha` and `beta`, and can verify that our fit gives us reasonable results. \n",
"\n",
"
\n",
"\n",
"\n",
"\n",
"
\n",
"\n",
"If we are not able to recover the known simulated truth, this means there is some kind of discrepancy between the generative model and the statistical model. This fact can be a **powerful tool for debugging**, especially when working with more complex models, as it allows us to check if our Stan model is doing what we think it should be. *Just because it compiles, it doesn't mean that it is correct!*\n",
"\n",
"
\n",
"\n",
"So, let's use our work in the [model building](model_building.ipynb) to simulate a test dataset using the same assumptions that we did there:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d6c8aa6f",
"metadata": {
"ExecuteTime": {
"end_time": "2021-08-27T08:19:15.989899Z",
"start_time": "2021-08-27T08:19:15.986986Z"
}
},
"outputs": [],
"source": [
"# known quantities\n",
"z = 0.005\n",
"\n",
"Nc = 10\n",
"P = [71.5, 45.0, 19.9, 30.6, 8.8, 75.2, 21.5, 22.0, 34.5, 87.4] # days\n",
"sigma_m = 0.5\n",
"\n",
"c = 3.0e5 # km/s\n",
"H0 = 70.0 # km/s/Mpc\n",
"\n",
"# high-level parameter values\n",
"alpha_true = -2\n",
"beta_true = -4"
]
},
{
"cell_type": "markdown",
"id": "261fd2fb",
"metadata": {},
"source": [
"You can copy over your simulation code, or load data that you saved last time. I'll load some data that I simulated earlier:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "34580367",
"metadata": {
"ExecuteTime": {
"end_time": "2021-08-27T08:19:15.993587Z",
"start_time": "2021-08-27T08:19:15.991218Z"
}
},
"outputs": [],
"source": [
"m_obs = np.array([21.30057715, 22.89477067, 24.63106411, \n",
" 22.72569458, 26.29936958, 22.34729756, \n",
" 24.19867534, 24.64700615, 23.19478031, 21.94968928])"
]
},
{
"cell_type": "markdown",
"id": "2be0a6f9",
"metadata": {},
"source": [
"We can now define our `data` dict to pass to the Stan model...\n",
"\n",
">Note: The variable names in the Stan file must match the keys of the dict"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "754b2ce8",
"metadata": {
"ExecuteTime": {
"end_time": "2021-08-27T08:19:15.997885Z",
"start_time": "2021-08-27T08:19:15.994863Z"
}
},
"outputs": [],
"source": [
"data = {}\n",
"data[\"Nc\"] = Nc\n",
"data[\"z\"] = z\n",
"data[\"log10P\"] = np.log10(P)\n",
"data[\"m_obs\"] = m_obs\n",
"data[\"sigma_m\"] = sigma_m\n",
"\n",
"# for generated quantities\n",
"data[\"Ngrid\"] = 50\n",
"data[\"log10P_grid\"] = np.linspace(0, 3)"
]
},
{
"cell_type": "markdown",
"id": "f332b982",
"metadata": {},
"source": [
"... and try to fit the simulated data"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "50683587",
"metadata": {
"ExecuteTime": {
"end_time": "2021-08-27T08:19:16.767978Z",
"start_time": "2021-08-27T08:19:15.999304Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:cmdstanpy:start chain 1\n",
"INFO:cmdstanpy:start chain 2\n",
"INFO:cmdstanpy:start chain 3\n",
"INFO:cmdstanpy:start chain 4\n",
"INFO:cmdstanpy:finish chain 2\n",
"INFO:cmdstanpy:finish chain 1\n",
"INFO:cmdstanpy:finish chain 3\n",
"INFO:cmdstanpy:finish chain 4\n",
"INFO:cmdstanpy:Processing csv files: /var/folders/8d/cyg0_lx54mggm8v350vlx91r0000gn/T/tmpmlkdxck8/cepheid_v0-202108271019-1-hce0fjz2.csv, /var/folders/8d/cyg0_lx54mggm8v350vlx91r0000gn/T/tmpmlkdxck8/cepheid_v0-202108271019-2-ecyi9p7i.csv, /var/folders/8d/cyg0_lx54mggm8v350vlx91r0000gn/T/tmpmlkdxck8/cepheid_v0-202108271019-3-_xo9pdvk.csv, /var/folders/8d/cyg0_lx54mggm8v350vlx91r0000gn/T/tmpmlkdxck8/cepheid_v0-202108271019-4-j47jnkvj.csv\n",
"\n",
"Checking sampler transitions treedepth.\n",
"Treedepth satisfactory for all transitions.\n",
"\n",
"Checking sampler transitions for divergences.\n",
"No divergent transitions found.\n",
"\n",
"Checking E-BFMI - sampler transitions HMC potential energy.\n",
"E-BFMI satisfactory.\n",
"\n",
"Effective sample size satisfactory.\n",
"\n",
"Split R-hat values satisfactory all parameters.\n",
"\n",
"Processing complete, no problems detected.\n"
]
},
{
"data": {
"text/html": [
"
\n", " | Mean | \n", "MCSE | \n", "StdDev | \n", "5% | \n", "50% | \n", "95% | \n", "N_Eff | \n", "N_Eff/s | \n", "R_hat | \n", "
---|---|---|---|---|---|---|---|---|---|
name | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
lp__ | \n", "-4.8 | \n", "0.0320 | \n", "1.00 | \n", "-6.9 | \n", "-4.5 | \n", "-3.800 | \n", "1000.0 | \n", "4300.0 | \n", "1.0 | \n", "
alpha | \n", "-1.3 | \n", "0.0340 | \n", "0.83 | \n", "-2.7 | \n", "-1.3 | \n", "0.036 | \n", "590.0 | \n", "2500.0 | \n", "1.0 | \n", "
beta | \n", "-4.5 | \n", "0.0220 | \n", "0.54 | \n", "-5.4 | \n", "-4.5 | \n", "-3.600 | \n", "580.0 | \n", "2500.0 | \n", "1.0 | \n", "
M_true[1] | \n", "-9.7 | \n", "0.0076 | \n", "0.24 | \n", "-10.0 | \n", "-9.7 | \n", "-9.300 | \n", "1016.0 | \n", "4304.0 | \n", "1.0 | \n", "
M_true[2] | \n", "-8.8 | \n", "0.0036 | \n", "0.18 | \n", "-9.1 | \n", "-8.8 | \n", "-8.500 | \n", "2440.0 | \n", "10341.0 | \n", "1.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
line[46] | \n", "-14.0 | \n", "0.0270 | \n", "0.68 | \n", "-15.0 | \n", "-14.0 | \n", "-13.000 | \n", "621.0 | \n", "2633.0 | \n", "1.0 | \n", "
line[47] | \n", "-14.0 | \n", "0.0290 | \n", "0.72 | \n", "-15.0 | \n", "-14.0 | \n", "-13.000 | \n", "618.0 | \n", "2620.0 | \n", "1.0 | \n", "
line[48] | \n", "-14.0 | \n", "0.0300 | \n", "0.75 | \n", "-16.0 | \n", "-14.0 | \n", "-13.000 | \n", "614.0 | \n", "2603.0 | \n", "1.0 | \n", "
line[49] | \n", "-15.0 | \n", "0.0320 | \n", "0.78 | \n", "-16.0 | \n", "-15.0 | \n", "-13.000 | \n", "612.0 | \n", "2592.0 | \n", "1.0 | \n", "
line[50] | \n", "-15.0 | \n", "0.0330 | \n", "0.81 | \n", "-16.0 | \n", "-15.0 | \n", "-14.000 | \n", "609.0 | \n", "2582.0 | \n", "1.0 | \n", "
73 rows × 9 columns
\n", "