Notebook

Strutural Equation Modelling (SEM) with Jupyter and Anaconda¶

This notebook is intended as an example of how to "do SEM" using different tools that are accessible in a Jupyter IPython Notebook installed with Anaconda. The notebook uses a Python3 kernel and interfaces with several R packages. Alternatives would be to run an R kernel in the notebook directly, or to use an RStudio markdown notebook and the reticulate package to access Python functionality from R.

Stefan Rank (strank at strank.info)

To Do for future revisions:

Integrate instructions on using different hosted / collaborative notebook options for this tutorial:
- CoCalc: https://cocalc.com/
- Google Colaboratory in Google Drive: https://colab.research.google.com/
  - (Python works, conda & R aren't available so most of the meat of this notebook won't work. If you block third-party cookies, you might have to allow cookies from [*.]googleusercontent.com to get output working)
- Microsoft Azure Notebooks: https://notebooks.azure.com/
- Binder: https://mybinder.org/
- A JupyterHub server: https://jupyterhub.readthedocs.io/en/stable/ (if I ever get to installing one)
Turn it into a RISE presentation https://rise.readthedocs.io/en/docs_hot_fixes/ (ideally also using the "Split Cells Notebook" extension for better visual organization)
Integrate other SEM tools available in R:
- lavaan, openmx, semplot (DONE)
- sem (less well documented/supported)
- metasem and metafor for meta-analysis (goes beyond normal SEM)
- simsem's semtools
- bigsem, ctsem https://github.com/cdriveraus/ctsem, gsem, lavaan.survey, regsem, semdiag
Compare with other tools:
- Onyx GUI for SEM http://onyx.brandmaier.de/

For local use, I recommend jupyterlab as the interface, usually acessible as http://localhost:8888/lab

Installing Required (and Optional) Components¶

To run the code in this notebook, a local (or hosted) installation of Anaconda is required. The following installs R components available through Anaconda. We use the --yes option to avoid prompts for confirmation (impossible in the notebook) but that also means you cannot check what will be installed before it proceeds. To avoid this, you could run the installations from an Anaconda prompt or using the Anaconda Navigator.

R Studio is not strictly required, but it is a useful alternative way of using R. Also note that this assumes an Anaconda installation/environment that you have permission to change. On Windows, that means choosing the recommended "for this user only" option on install of Anaconda. Otherwise, these installations require Administrator / root rights.

In [2]:

!conda install --yes r-essentials

Solving environment: ...working... done

# All requested packages already installed.

In [3]:

# !conda install --yes rstudio

Solving environment: ...working... done

# All requested packages already installed.

In [4]:

!conda install --yes r-lavaan

Solving environment: ...working... done

# All requested packages already installed.

Installing and Enabling the Python-R bridge¶

Now we need rpy2 which provides the bridge between this python notebook and R. (Also installing tzlocal due to a current dependency bug in rpy2.)

In [3]:

!conda install --yes tzlocal rpy2

Solving environment: ...working... done

# All requested packages already installed.

Let's enable the rpy2 extension, so that we can then execute R code with the %%R magic command at the top of a cell.

In [1]:

%load_ext rpy2.ipython

If you are on Windows, and you do not get text output from %%R cells showing up in the notebook, but instead in the console window where jupyter is running, this is a bug in rpy2 on Windows, and there's a workaround to capture stdout by running the following cells, see https://github.com/vitorcurtis/RWinOut

In [21]:

%%R
install.packages(c("R.utils"))

In [23]:

!curl -O "https://raw.githubusercontent.com/vitorcurtis/RWinOut/master/RWinOut.py"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1259  100  1259    0     0  11550      0 --:--:-- --:--:-- --:--:-- 11550

In [2]:

%load_ext RWinOut

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

Installing Interface Conveniences¶

And finally we want to see plots directly in the notebook. The simplest way is to request inline plots.

In [3]:

%matplotlib inline

But assuming you are running this in a JuyterLab interface, you might want the ipympl library to get interactive widget plots.

In [41]:

# We install ipympl with pip, as it is not yet readily available with conda:
!pip install ipympl

In [41]:

# nodejs is needed for the interactive features if using JupyterLab,
# the corresponding package for normal notebooks, widgetsnbextension, should already be installed.
!conda install --yes nodejs

In [41]:

# install the jupyterlab extensions:
!jupyter labextension install @jupyter-widgets/jupyterlab-manager
!jupyter labextension install jupyter-matplotlib

Requirement already satisfied: ipympl in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (0.2.1)
Requirement already satisfied: ipykernel>=4.7 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipympl) (5.1.0)
Requirement already satisfied: ipywidgets>=7.0.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipympl) (7.4.2)
Requirement already satisfied: six in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipympl) (1.11.0)
Requirement already satisfied: matplotlib>=2.0.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipympl) (2.2.2)
Requirement already satisfied: ipython>=5.0.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipykernel>=4.7->ipympl) (7.1.1)
Requirement already satisfied: traitlets>=4.1.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipykernel>=4.7->ipympl) (4.3.2)
Requirement already satisfied: jupyter-client in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipykernel>=4.7->ipympl) (5.2.3)
Requirement already satisfied: tornado>=4.2 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipykernel>=4.7->ipympl) (5.1.1)
Requirement already satisfied: widgetsnbextension~=3.4.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->ipympl) (3.4.2)
Requirement already satisfied: nbformat>=4.2.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->ipympl) (4.4.0)
Requirement already satisfied: numpy>=1.7.1 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (1.15.3)
Requirement already satisfied: cycler>=0.10 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (2.2.2)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (2.7.5)
Requirement already satisfied: pytz in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (2018.7)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (1.0.1)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (0.4.0)
Requirement already satisfied: pygments in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (2.2.0)
Requirement already satisfied: setuptools>=18.5 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (40.5.0)
Requirement already satisfied: decorator in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (4.3.0)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (2.0.7)
Requirement already satisfied: jedi>=0.10 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (0.13.1)
Requirement already satisfied: pickleshare in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (0.7.5)
Requirement already satisfied: backcall in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from ipython>=5.0.0->ipykernel>=4.7->ipympl) (0.1.0)
Requirement already satisfied: ipython-genutils in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from traitlets>=4.1.0->ipykernel>=4.7->ipympl) (0.2.0)
Requirement already satisfied: jupyter-core in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from jupyter-client->ipykernel>=4.7->ipympl) (4.4.0)
Requirement already satisfied: pyzmq>=13 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from jupyter-client->ipykernel>=4.7->ipympl) (17.1.2)
Requirement already satisfied: notebook>=4.4.1 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (5.7.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->ipympl) (2.6.0)
Requirement already satisfied: wcwidth in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython>=5.0.0->ipykernel>=4.7->ipympl) (0.1.7)
Requirement already satisfied: parso>=0.3.0 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from jedi>=0.10->ipython>=5.0.0->ipykernel>=4.7->ipympl) (0.3.1)
Requirement already satisfied: Send2Trash in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (1.5.0)
Requirement already satisfied: prometheus-client in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (0.4.2)
Requirement already satisfied: jinja2 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (2.10)
Requirement already satisfied: nbconvert in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (5.3.1)
Requirement already satisfied: terminado>=0.8.1 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (0.8.1)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (1.0)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (1.4.2)
Requirement already satisfied: bleach in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (3.0.2)
Requirement already satisfied: testpath in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (0.4.2)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (0.2.3)
Requirement already satisfied: mistune>=0.7.4 in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (0.8.4)
Requirement already satisfied: webencodings in c:\users\sr876\appdata\local\continuum\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.4.0->ipywidgets>=7.0.0->ipympl) (0.5.1)
Solving environment: ...working... done

# All requested packages already installed.

jupyter-widgets-jupyterlab-manager-0.38.1.tgz
yarn install v1.9.4
info No lockfile found.
[1/5] Validating package.json...
[2/5] Resolving packages...
warning css-loader > cssnano > autoprefixer > browserslist@1.7.7: Browserslist 2 could fail on reading Browserslist >3.0 config used in other tools.
warning css-loader > cssnano > postcss-merge-rules > browserslist@1.7.7: Browserslist 2 could fail on reading Browserslist >3.0 config used in other tools.
warning css-loader > cssnano > postcss-merge-rules > caniuse-api > browserslist@1.7.7: Browserslist 2 could fail on reading Browserslist >3.0 config used in other tools.
[3/5] Fetching packages...
info fsevents@1.2.4: The platform "win32" is incompatible with this module.
info "fsevents@1.2.4" is an optional dependency and failed compatibility check. Excluding it from installation.
[4/5] Linking dependencies...
warning "@jupyterlab/vdom-extension > @nteract/transform-vdom@1.1.1" has incorrect peer dependency "react@^15.6.1".
[5/5] Building fresh packages...
success Saved lockfile.
Done in 75.75s.
yarn run v1.9.4
$ webpack
Hash: f1eac10162dd2d44bf4d
Version: webpack 4.12.2
Time: 39111ms
Built at: 2018-11-02 20:57:26
                                                             Asset       Size                                  Chunks             Chunk Names
                                         1.055322dcf6c2bb19185f.js    888 KiB                                       1  [emitted]  
                              674f50d287a8c48dc19ba404d20fe713.eot    162 KiB                                          [emitted]  
                              912ec66d7572ff821749319396470bde.svg    434 KiB                                          [emitted]  
                             fee66e712a8a08eef5805a46892932ad.woff   95.7 KiB                                          [emitted]  
                              b06871f281fee6b241d60582ae9369b9.ttf    162 KiB                                          [emitted]  
                                      main.5a992934528990838951.js   54.6 KiB                                    main  [emitted]  main
                                         0.cc98107762fcc28532b3.js    4.5 KiB                                       0  [emitted]  
                                      vega.91b98e783d16fd1b9e23.js  519 bytes                                    vega  [emitted]  vega
    vendors~@jupyter-widgets/controls~vega.3dd933b62461edbc58d8.js   22.6 KiB  vendors~@jupyter-widgets/controls~vega  [emitted]  vendors~@jupyter-widgets/controls~vega
                              vendors~main.44f7c1af2649541ea6c6.js   9.41 MiB                            vendors~main  [emitted]  vendors~main
                              vendors~vega.1e302f886dbe300cf0ef.js   2.76 MiB                            vendors~vega  [emitted]  vendors~vega
                            af7ae505a9eed503f8b8e6982036873e.woff2   75.4 KiB                                          [emitted]  
         vendors~@jupyter-widgets/controls.741a7524652a40694e8a.js    266 KiB       vendors~@jupyter-widgets/controls  [emitted]  vendors~@jupyter-widgets/controls
                                  main.5a992934528990838951.js.map   63.5 KiB                                    main  [emitted]  main
                                     0.cc98107762fcc28532b3.js.map   5.73 KiB                                       0  [emitted]  
                                  vega.91b98e783d16fd1b9e23.js.map  251 bytes                                    vega  [emitted]  vega
vendors~@jupyter-widgets/controls~vega.3dd933b62461edbc58d8.js.map   15.6 KiB  vendors~@jupyter-widgets/controls~vega  [emitted]  vendors~@jupyter-widgets/controls~vega
                          vendors~main.44f7c1af2649541ea6c6.js.map   10.9 MiB                            vendors~main  [emitted]  vendors~main
                          vendors~vega.1e302f886dbe300cf0ef.js.map   2.19 MiB                            vendors~vega  [emitted]  vendors~vega
                                     1.055322dcf6c2bb19185f.js.map   1.04 MiB                                       1  [emitted]  
     vendors~@jupyter-widgets/controls.741a7524652a40694e8a.js.map    303 KiB       vendors~@jupyter-widgets/controls  [emitted]  vendors~@jupyter-widgets/controls
                                                        index.html   1.53 KiB                                          [emitted]  
Entrypoint main = vendors~main.44f7c1af2649541ea6c6.js vendors~main.44f7c1af2649541ea6c6.js.map main.5a992934528990838951.js main.5a992934528990838951.js.map
 [0] multi whatwg-fetch ./build/index.out.js 40 bytes {main} [built]
 [1] vertx (ignored) 15 bytes {main} [optional] [built]
 [4] buffer (ignored) 15 bytes {main} [optional] [built]
 [5] crypto (ignored) 15 bytes {main} [optional] [built]
 [6] readable-stream (ignored) 15 bytes {main} [built]
 [7] supports-color (ignored) 15 bytes {main} [built]
 [8] chalk (ignored) 15 bytes {main} [built]
 [9] fs (ignored) 15 bytes {main} [built]
[10] node-fetch (ignored) 15 bytes {vega} [built]
[11] fs (ignored) 15 bytes {vega} [built]
 [ANye] ./build/index.out.js 35.9 KiB {main} [built]
 [RnhZ] ./node_modules/moment/locale sync ^\.\/.*$ 2.88 KiB {main} [optional] [built]
 [YuTi] (webpack)/buildin/module.js 497 bytes {vendors~main} [built]
 [eTbV] ./node_modules/codemirror/mode sync ^\.\/.*\.js$ 2.78 KiB {0} [built]
 [yLpj] (webpack)/buildin/global.js 489 bytes {vendors~main} [built]
    + 2390 hidden modules

WARNING in jquery
  Multiple versions of jquery found:
    2.2.4 ./~/jupyter-matplotlib/~/jquery from ./~/jupyter-matplotlib\src\mpl_widget.js
    3.3.1 ./~/jquery from ./~/@jupyter-widgets\base\lib\widget.js


WARNING in vega-lite
  Multiple versions of vega-lite found:
    2.5.1 ./~/vega-lite\build\src from ./~/vega-lite\build\src\compile\selection\selection.js
    2.6.0 ./~/vega-lite\build from ./~/vega-lite\build\src\index.js

Check how you can resolve duplicate packages: 
https://github.com/darrenscerri/duplicate-package-checker-webpack-plugin#resolving-duplicate-packages-in-your-bundle

Child html-webpack-plugin for "index.html":
     1 asset
    Entrypoint undefined = index.html
    [KTNU] ./node_modules/html-loader!./templates/partial.html 567 bytes {0} [built]
    [YuTi] (webpack)/buildin/module.js 497 bytes {0} [built]
    [aS2v] ./node_modules/html-webpack-plugin/lib/loader.js!./templates/template.html 1.22 KiB {0} [built]
    [yLpj] (webpack)/buildin/global.js 489 bytes {0} [built]
        + 1 hidden module
Done in 48.33s.

Node v8.9.3

> C:\Users\sr876\AppData\Local\Continuum\anaconda3\npm.CMD pack @jupyter-widgets/jupyterlab-manager
Node v8.9.3

> node C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\jupyterlab\staging\yarn.js install
> node C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\jupyterlab\staging\yarn.js run build

jupyter-matplotlib-0.3.0.tgz
yarn install v1.9.4
info No lockfile found.
[1/5] Validating package.json...
[2/5] Resolving packages...
warning css-loader > cssnano > autoprefixer > browserslist@1.7.7: Browserslist 2 could fail on reading Browserslist >3.0 config used in other tools.
warning css-loader > cssnano > postcss-merge-rules > browserslist@1.7.7: Browserslist 2 could fail on reading Browserslist >3.0 config used in other tools.
warning css-loader > cssnano > postcss-merge-rules > caniuse-api > browserslist@1.7.7: Browserslist 2 could fail on reading Browserslist >3.0 config used in other tools.
[3/5] Fetching packages...
info fsevents@1.2.4: The platform "win32" is incompatible with this module.
info "fsevents@1.2.4" is an optional dependency and failed compatibility check. Excluding it from installation.
[4/5] Linking dependencies...
warning "@jupyterlab/vdom-extension > @nteract/transform-vdom@1.1.1" has incorrect peer dependency "react@^15.6.1".
[5/5] Building fresh packages...
success Saved lockfile.
Done in 107.61s.
yarn run v1.9.4
$ webpack
Hash: f1eac10162dd2d44bf4d
Version: webpack 4.12.2
Time: 36409ms
Built at: 2018-11-02 21:00:13
                                                             Asset       Size                                  Chunks             Chunk Names
                                         1.055322dcf6c2bb19185f.js    888 KiB                                       1  [emitted]  
                              674f50d287a8c48dc19ba404d20fe713.eot    162 KiB                                          [emitted]  
                              912ec66d7572ff821749319396470bde.svg    434 KiB                                          [emitted]  
                             fee66e712a8a08eef5805a46892932ad.woff   95.7 KiB                                          [emitted]  
                              b06871f281fee6b241d60582ae9369b9.ttf    162 KiB                                          [emitted]  
                                      main.5a992934528990838951.js   54.6 KiB                                    main  [emitted]  main
                                         0.cc98107762fcc28532b3.js    4.5 KiB                                       0  [emitted]  
                                      vega.91b98e783d16fd1b9e23.js  519 bytes                                    vega  [emitted]  vega
    vendors~@jupyter-widgets/controls~vega.3dd933b62461edbc58d8.js   22.6 KiB  vendors~@jupyter-widgets/controls~vega  [emitted]  vendors~@jupyter-widgets/controls~vega
                              vendors~main.44f7c1af2649541ea6c6.js   9.41 MiB                            vendors~main  [emitted]  vendors~main
                              vendors~vega.1e302f886dbe300cf0ef.js   2.76 MiB                            vendors~vega  [emitted]  vendors~vega
                            af7ae505a9eed503f8b8e6982036873e.woff2   75.4 KiB                                          [emitted]  
         vendors~@jupyter-widgets/controls.741a7524652a40694e8a.js    266 KiB       vendors~@jupyter-widgets/controls  [emitted]  vendors~@jupyter-widgets/controls
                                  main.5a992934528990838951.js.map   63.5 KiB                                    main  [emitted]  main
                                     0.cc98107762fcc28532b3.js.map   5.73 KiB                                       0  [emitted]  
                                  vega.91b98e783d16fd1b9e23.js.map  251 bytes                                    vega  [emitted]  vega
vendors~@jupyter-widgets/controls~vega.3dd933b62461edbc58d8.js.map   15.6 KiB  vendors~@jupyter-widgets/controls~vega  [emitted]  vendors~@jupyter-widgets/controls~vega
                          vendors~main.44f7c1af2649541ea6c6.js.map   10.9 MiB                            vendors~main  [emitted]  vendors~main
                          vendors~vega.1e302f886dbe300cf0ef.js.map   2.19 MiB                            vendors~vega  [emitted]  vendors~vega
                                     1.055322dcf6c2bb19185f.js.map   1.04 MiB                                       1  [emitted]  
     vendors~@jupyter-widgets/controls.741a7524652a40694e8a.js.map    303 KiB       vendors~@jupyter-widgets/controls  [emitted]  vendors~@jupyter-widgets/controls
                                                        index.html   1.53 KiB                                          [emitted]  
Entrypoint main = vendors~main.44f7c1af2649541ea6c6.js vendors~main.44f7c1af2649541ea6c6.js.map main.5a992934528990838951.js main.5a992934528990838951.js.map
 [0] multi whatwg-fetch ./build/index.out.js 40 bytes {main} [built]
 [1] vertx (ignored) 15 bytes {main} [optional] [built]
 [4] buffer (ignored) 15 bytes {main} [optional] [built]
 [5] crypto (ignored) 15 bytes {main} [optional] [built]
 [6] readable-stream (ignored) 15 bytes {main} [built]
 [7] supports-color (ignored) 15 bytes {main} [built]
 [8] chalk (ignored) 15 bytes {main} [built]
 [9] fs (ignored) 15 bytes {main} [built]
[10] node-fetch (ignored) 15 bytes {vega} [built]
[11] fs (ignored) 15 bytes {vega} [built]
 [ANye] ./build/index.out.js 35.9 KiB {main} [built]
 [RnhZ] ./node_modules/moment/locale sync ^\.\/.*$ 2.88 KiB {main} [optional] [built]
 [YuTi] (webpack)/buildin/module.js 497 bytes {vendors~main} [built]
 [eTbV] ./node_modules/codemirror/mode sync ^\.\/.*\.js$ 2.78 KiB {0} [built]
 [yLpj] (webpack)/buildin/global.js 489 bytes {vendors~main} [built]
    + 2390 hidden modules

WARNING in jquery
  Multiple versions of jquery found:
    2.2.4 ./~/jupyter-matplotlib/~/jquery from ./~/jupyter-matplotlib\src\mpl_widget.js
    3.3.1 ./~/jquery from ./~/@jupyter-widgets\base\lib\widget.js


WARNING in vega-lite
  Multiple versions of vega-lite found:
    2.5.1 ./~/vega-lite\build\src from ./~/vega-lite\build\src\compile\selection\selection.js
    2.6.0 ./~/vega-lite\build from ./~/vega-lite\build\src\index.js

Check how you can resolve duplicate packages: 
https://github.com/darrenscerri/duplicate-package-checker-webpack-plugin#resolving-duplicate-packages-in-your-bundle

Child html-webpack-plugin for "index.html":
     1 asset
    Entrypoint undefined = index.html
    [KTNU] ./node_modules/html-loader!./templates/partial.html 567 bytes {0} [built]
    [YuTi] (webpack)/buildin/module.js 497 bytes {0} [built]
    [aS2v] ./node_modules/html-webpack-plugin/lib/loader.js!./templates/template.html 1.22 KiB {0} [built]
    [yLpj] (webpack)/buildin/global.js 489 bytes {0} [built]
        + 1 hidden module
Done in 46.88s.

Node v8.9.3

> C:\Users\sr876\AppData\Local\Continuum\anaconda3\npm.CMD pack jupyter-matplotlib
Node v8.9.3

> node C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\jupyterlab\staging\yarn.js install
> node C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\jupyterlab\staging\yarn.js run build

Now enable widget-based plots:

In [1]:

%matplotlib widget

Finally, since this is a long document, the following extension adds a table-of-contents sidebar to the JupyterLab interface:

In [ ]:

!jupyter labextension install @jupyterlab/toc

Installing More R Packages¶

Now for some R packages that are not readily available in an Anaconda default install. They might be available through the conda-forge "channel" - however, at the time of writing, I cannot recommend this, as the performance of conda install is abysmal when using R packages from that repository.

When using a server-hosted notebook, some or all of these packages might already be installed.

In [16]:

%%R
install.packages(c("semPlot", "OpenMx", "semTools", "sem", "gpairs", "GGally"))

A note to avoid possible confusion: lavaan provides a function cfa as a convenience for confirmatory factor analysis. There is also an R package called cfa - however, that one is not related to SEM.

Importing and Configuring Python & R Packages¶

First, import all the basic packages in Python, such as pandas, numpy, matplotlib.

Also import seaborn for simple high-level plots with decent looks. It complements the default plotting provided by matplotlib. See here, for a useful brief overview of looking at data using some of seaborn's plot types: https://elitedatascience.com/python-seaborn-tutorial

In [4]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Next, some of the libraries that are statistics-oriented.

In [5]:

import statsmodels.api as sm
import statsmodels.formula.api as smf # the R-like interface for statsmodels
import statsmodels.graphics as smg
import sklearn

Now for loading R packages, we will be using.

In [6]:

%%R
library(lavaan)
library(semPlot)
library(OpenMx)
library(semTools)

In R, there are also several packages providing convenient high-level plots, such as generalized pairs plots.

In [7]:

%%R
library(GGally)
library(gpairs)

Configure Plotting¶

For the purpose of getting nice visual output, we will also set some defaults for plotting libraries.

In [8]:

defaultfigwidth, defaultfigheight = 10, 9

# set a slightly larger default size for plots. default dpi is 100
plt.rcParams['figure.figsize'] = [defaultfigwidth, defaultfigheight]

# enable seaborn's defaults for nicer plots overall:
sns.set(color_codes=True)

There is currently no built-in way to set default dimensions for plots in R. The %R magic command from the rpy2 library accepts width, height, and units parameters like this: %%R -w 10 -h 9 -u in -r 100 but it would be nice to set defaults.

Since this is python, there is a way around that using monkey-patching. Note that this is usually a Bad Idea(TM) and should be avoided if possible. It is also purely cosmetic for the purposes of this notebook, so it can be safely ignored. :)

In [9]:

# these are the defaults we want to set:
default_units = 'in' # inch, to make it more easily comparable to matpplotlib
default_res = 100 # dpi, same as default in matplotlib
default_width = 10
default_height = 9
# try monkey-patching a function in rpy2, so we effectively get these
# default settings for the width, height, and units arguments of the %R magic command
import rpy2
old_setup_graphics = rpy2.ipython.rmagic.RMagics.setup_graphics

def new_setup_graphics(self, args):
    if getattr(args, 'units') is not None:
        if args.units != default_units: # a different units argument was passed, do not apply defaults
            return old_setup_graphics(self, args)
    args.units = default_units
    if getattr(args, 'res') is None:
        args.res = default_res
    if getattr(args, 'width') is None:
        args.width = default_width
    if getattr(args, 'height') is None:
        args.height = default_height        
    return old_setup_graphics(self, args)

rpy2.ipython.rmagic.RMagics.setup_graphics = new_setup_graphics

Loading Data¶

We are borrowing example data from this excellent course offered at Harvard, S090A1: https://canvas.harvard.edu/courses/8737/pages/data

The actual data is from the Zambian Early Childhood Development Project. The full sample has more than 1600 Zambian six-year-olds, from a study led by Günther Fink and Stephanie Zuilkowski.

The data is in the proprietory stata format, so we first need to convert it and import it. We will use pandas.read_stata but this could also be accomplished in R with the foreign package. First, we write a helper function to download the data file if it is not in the current directly. Defining a function will allow us to re-use it later for other datasets.

In [10]:

import os.path
import urllib.request

def downloadIfMissing(filenameData, remoteLocation):
    '''Check if the file exists. If not, try downloading from remoteLocation.'''
    if not os.path.isfile(filenameData):
        with urllib.request.urlopen(remoteLocation) as response:
            with open(filenameData, 'xb') as destinationFile:
                destinationFile.write(response.read())

In [11]:

# make sure we have the small data file available in the current directory, if not, try to download it:
filenameSmallZambiaData = "S090_InClass_Zambia.dta"
downloadIfMissing(filenameSmallZambiaData, "https://canvas.harvard.edu/courses/8737/files/1839865/download")

In [12]:

# read the data into a pandas dataframe
smallZambiaDF = pd.read_stata(filenameSmallZambiaData)
len(smallZambiaDF) # should return 1613

Out[12]:

In [13]:

# make sure we have the full measurement data file available in the current directory, if not, try to download it:
filenameMeasureZambiaData = "S090_InClass_Zambia_Measurement.dta"
downloadIfMissing(filenameMeasureZambiaData, "https://canvas.harvard.edu/courses/8737/files/1994882/download")

In [14]:

# read the data into a pandas dataframe
measureZambiaDF = pd.read_stata(filenameMeasureZambiaData)
len(measureZambiaDF) # should return 1623

Out[14]:

Data Overview¶

We now have the data in a dataframe, let's get an overview of what kind of data we are dealing with.

Now let's have a look at the dataframe.

In [15]:

smallZambiaDF.head() # equivalent to [:5] i.e. first five entries

Out[15]:

	childid	male	urban	ece	reasoning	socemo	vocab	vocabsq	wealth	books
0	101	Male	Urban	ECE	3.0	1.05	18.0	324.0	3.0	No books in home
1	102	Male	Urban	No ECE	4.0	1.00	19.0	361.0	2.0	No books in home
2	103	Female	Urban	No ECE	4.0	1.80	19.0	361.0	3.0	No books in home
3	104	Male	Urban	No ECE	5.0	2.35	12.0	144.0	3.0	No books in home
4	105	Male	Urban	No ECE	5.0	1.50	25.0	625.0	2.0	Books in home

In [10]:

smallZambiaDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1613 entries, 0 to 1612
Data columns (total 10 columns):
childid      1613 non-null int16
male         1613 non-null category
urban        1613 non-null category
ece          1613 non-null category
reasoning    1613 non-null float32
socemo       1613 non-null float32
vocab        1613 non-null float32
vocabsq      1613 non-null float32
wealth       1613 non-null float32
books        1613 non-null category
dtypes: category(4), float32(5), int16(1)
memory usage: 53.9 KB

In [11]:

smallZambiaDF.describe(include='all', percentiles=[]) # describe categorical and numerical columns, don't bother with percentiles

Out[11]:

	childid	male	urban	ece	reasoning	socemo	vocab	vocabsq	wealth	books
count	1613.000000	1613	1613	1613	1613.000000	1613.000000	1613.000000	1613.000000	1613.000000	1613
unique	NaN	2	2	2	NaN	NaN	NaN	NaN	NaN	2
top	NaN	Female	Urban	No ECE	NaN	NaN	NaN	NaN	NaN	No books in home
freq	NaN	810	814	1110	NaN	NaN	NaN	NaN	NaN	1171
mean	4158.768754	NaN	NaN	NaN	4.451333	1.647789	21.475512	489.346558	2.957222	NaN
std	2376.382896	NaN	NaN	NaN	2.521065	0.450903	5.307204	216.253510	1.430792	NaN
min	101.000000	NaN	NaN	NaN	0.000000	0.444444	0.000000	0.000000	1.000000	NaN
50%	4312.000000	NaN	NaN	NaN	4.000000	1.611111	22.000000	484.000000	3.000000	NaN
max	8125.000000	NaN	NaN	NaN	10.000000	3.000000	30.000000	900.000000	5.000000	NaN

In [12]:

smallZambiaDF.male.describe()

Out[12]:

count       1613
unique         2
top       Female
freq         810
Name: male, dtype: object

In [13]:

smallZambiaDF.male.value_counts()

Out[13]:

Female    810
Male      803
Name: male, dtype: int64

In [14]:

smallZambiaDF.wealth.value_counts()

Out[14]:

1.0    356
4.0    340
2.0    311
5.0    307
3.0    299
Name: wealth, dtype: int64

In [16]:

# visually check relations between numeric variables
grid = sns.pairplot(smallZambiaDF, hue="ece", height=defaultfigheight/6, kind='scatter')
# we have to use an explicit height per facet-figure here, since a grid of figures doesn't follow the matplotlib default size

C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

In [17]:

grid = sns.pairplot(smallZambiaDF, hue="ece", height=defaultfigheight/6, kind='reg') # linear regressions on top of scatter

In [18]:

pd.scatter_matrix(smallZambiaDF)
plt.show()

C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: pandas.scatter_matrix is deprecated, use pandas.plotting.scatter_matrix instead
  """Entry point for launching an IPython kernel.

In [19]:

plt.figure()
sns.swarmplot(x="male", y="socemo", data=smallZambiaDF)
plt.show()

In [17]:

%%R -i smallZambiaDF
pairs(smallZambiaDF)

C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

In [18]:

%%R -i smallZambiaDF
ggpairs(smallZambiaDF) # from library GGally

In [33]:

%%R -i smallZambiaDF
gpairs(smallZambiaDF) # from library gpairs

In [20]:

# Calculate correlations
corr = smallZambiaDF.corr()
# Heatmap
plt.figure()
sns.heatmap(corr)
plt.show()

To Do: implement an equivalent to ggpairs in python, similar to scatter_matrix_all found here: http://photo.etangkk.com/Python/blog-04.asp but using seaborn's PairGrid functionality.

In [ ]:

What are our observed variables like? Continuous (e.g. age or height), ordinal (e.g. Likert items), or categorical (e.g. gender) or even dichotomous/binary (e.g. yes/no questions). Is validity (accurately captures the theoretical construct) and reliability (consistent results across time/rater/items) already established or just assumed?

In [ ]:

Data Cleanup and Checking Assumptions¶

SEM maximum likelihood estimator assumes normality of all endogenous variables, and implies they are continuous. But these assumptions can sometimes be relaxed.

TODO: how to check?

Simple Regression¶

First, we do a simple regression. We can use scikit-learn (more oriented towards machine learning, less support for detailed statistics output) or statsmodels (more statistics oriented, with R-like interface) or R.

In [74]:

result = smf.ols(formula='socemo ~ ece', data=smallZambiaDF).fit()

In [25]:

print(result.params)

Intercept     1.613317
ece[T.ECE]    0.110536
dtype: float64

In [26]:

print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 socemo   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     21.06
Date:                Sun, 18 Nov 2018   Prob (F-statistic):           4.80e-06
Time:                        23:22:31   Log-Likelihood:                -993.01
No. Observations:                1613   AIC:                             1990.
Df Residuals:                    1611   BIC:                             2001.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.6133      0.013    119.945      0.000       1.587       1.640
ece[T.ECE]     0.1105      0.024      4.589      0.000       0.063       0.158
==============================================================================
Omnibus:                       23.699   Durbin-Watson:                   0.838
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               21.012
Skew:                           0.225   Prob(JB):                     2.74e-05
Kurtosis:                       2.667   Cond. No.                         2.42
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [75]:

fig = smg.regressionplots.abline_plot(model_results=result)
fig.axes[0].scatter(smallZambiaDF['ece'], smallZambiaDF['socemo'])

FigureCanvasNbAgg()

Out[75]:

<matplotlib.collections.PathCollection at 0x23e368b1cf8>

Since this is based on a categorical variable (ece), it is effectively a one-way ANOVA, so we can also print it as that.

In [27]:

table = sm.stats.anova_lm(result, typ=2) # Type 2 Anova DataFrame
print(table)

              sum_sq      df          F    PR(>F)
ece         4.229228     1.0  21.060384  0.000005
Residual  323.511946  1611.0        NaN       NaN

In [28]:

pd.get_dummies(smallZambiaDF, columns=['ece',]).head()

Out[28]:

	childid	male	urban	reasoning	socemo	vocab	vocabsq	wealth	books	ece_No ECE	ece_ECE
0	101	Male	Urban	3.0	1.05	18.0	324.0	3.0	No books in home	0	1
1	102	Male	Urban	4.0	1.00	19.0	361.0	2.0	No books in home	1	0
2	103	Female	Urban	4.0	1.80	19.0	361.0	3.0	No books in home	1	0
3	104	Male	Urban	5.0	2.35	12.0	144.0	3.0	No books in home	1	0
4	105	Male	Urban	5.0	1.50	25.0	625.0	2.0	Books in home	1	0

In [30]:

smallZambiaDF['ece'].str.get_dummies()['ECE'].head()

Out[30]:

0    1
1    0
2    0
3    0
4    0
Name: ECE, dtype: int64

In [31]:

smallZambiaDF['socemo'].shape, smallZambiaDF['ece'].str.get_dummies()[['ECE']].shape

Out[31]:

((1613,), (1613, 1))

In [32]:

linreg = sklearn.linear_model.LinearRegression()
linreg.fit(smallZambiaDF['ece'].str.get_dummies()[['ECE']], smallZambiaDF['socemo'])

Out[32]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [33]:

print(linreg.intercept_, linreg.coef_)

1.613316750660673 [0.11053562]

In [61]:

sns.lmplot(x='ece_ECE', y='socemo', data=pd.get_dummies(smallZambiaDF), height=defaultfigheight)

FigureCanvasNbAgg()

Out[61]:

<seaborn.axisgrid.FacetGrid at 0x16c688009e8>

In [62]:

f = plt.figure()
residaxes = sns.residplot(x='ece_ECE', y='socemo', data=pd.get_dummies(smallZambiaDF))

FigureCanvasNbAgg()

Calculating and plotting the simple linear model with R:

In [30]:

%%R -i smallZambiaDF
model = lm(socemo ~ ece, data = smallZambiaDF)
summary(model)

Call:
lm(formula = socemo ~ ece, data = smallZambiaDF)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.16887 -0.32385 -0.02385  0.32615  1.38668 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.66858    0.01204 138.551  < 2e-16 ***
ece.L        0.07816    0.01703   4.589  4.8e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4481 on 1611 degrees of freedom
Multiple R-squared:  0.0129,	Adjusted R-squared:  0.01229 
F-statistic: 21.06 on 1 and 1611 DF,  p-value: 4.795e-06

In [39]:

%%R -i smallZambiaDF
plot(socemo ~ ece, data = smallZambiaDF)
abline(model)

In [40]:

%%R -i smallZambiaDF
xyplot(socemo ~ ece, data = smallZambiaDF)

Setting up the Model for SEM¶

For comparison let's use SEM to run the same linear regression model as above.

In [121]:

%%R -i smallZambiaDF
semfit <- sem('socemo ~ reasoning', data=smallZambiaDF, meanstructure=TRUE)
summary(semfit)

lavaan 0.6-2 ended normally after 11 iterations

  Optimization method                           NLMINB
  Number of free parameters                          3

  Number of observations                          1613

  Estimator                                         ML
  Model Fit Test Statistic                       0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  socemo ~                                            
    reasoning        -0.007    0.004   -1.672    0.094

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .socemo            1.681    0.023   73.853    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .socemo            0.203    0.007   28.399    0.000

In [22]:

dummiedSmallZambiaDF = pd.get_dummies(smallZambiaDF, dtype=np.int8)
# We use int8 instead of uint8 for the dummy variables, because R has no unsigned integers,
# so we would get warnings when passing these new variables into R.
# Further: the new dummy columns use the actual value as part of the column name,
# and therefore might have spaces in the column name.
# this is not great for specifying models, so let us replace the spaces with underscores:
dummiedSmallZambiaDF.columns = dummiedSmallZambiaDF.columns.str.replace(' ', '_')

In [122]:

%%R -i dummiedSmallZambiaDF
semfit <- sem('socemo ~ ece_ECE', data=dummiedSmallZambiaDF, meanstructure=TRUE)
summary(semfit, standardized=TRUE)

lavaan 0.6-2 ended normally after 13 iterations

  Optimization method                           NLMINB
  Number of free parameters                          3

  Number of observations                          1613

  Estimator                                         ML
  Model Fit Test Statistic                       0.000
  Degrees of freedom                                 0
  Minimum Function Value               0.0000000000000

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  socemo ~                                                              
    ece_ECE           0.111    0.024    4.592    0.000    0.111    0.114

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .socemo            1.613    0.013  120.020    0.000    1.613    3.579

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .socemo            0.201    0.007   28.399    0.000    0.201    0.987

In [125]:

%%R -i dummiedSmallZambiaDF
semfit <- sem('socemo ~ ece_ECE', data=dummiedSmallZambiaDF, estimator='MLMV', meanstructure=TRUE)
summary(semfit, standardized=TRUE)

lavaan 0.6-2 ended normally after 13 iterations

  Optimization method                           NLMINB
  Number of free parameters                          3

  Number of observations                          1613

  Estimator                                         ML      Robust
  Model Fit Test Statistic                       0.000       0.000
  Degrees of freedom                                 0           0
  Minimum Function Value               0.0000000000000
  Scaling correction factor                                     NA
  Shift parameter                                     
    for simple second-order correction (Mplus variant)

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                           Robust.sem

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  socemo ~                                                              
    ece_ECE           0.111    0.024    4.595    0.000    0.111    0.114

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .socemo            1.613    0.013  119.945    0.000    1.613    3.579

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .socemo            0.201    0.006   31.106    0.000    0.201    0.987

In [84]:

%%R -i dummiedSmallZambiaDF
semPaths(lm('socemo ~ ece_ECE', data=dummiedSmallZambiaDF), "std")

In [23]:

%%R -i dummiedSmallZambiaDF
bigmodel <- '
  socemo ~ ece_ECE + wealth
  reasoning ~ ece_ECE + wealth + books_Books_in_home
  socemo ~~ reasoning
'
semfit <- sem(bigmodel, data=dummiedSmallZambiaDF, estimator="MLMV", meanstructure=TRUE)
summary(semfit, standardized=TRUE)

C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

lavaan 0.6-2 ended normally after 31 iterations

  Optimization method                           NLMINB
  Number of free parameters                         10

  Number of observations                          1613

  Estimator                                         ML      Robust
  Model Fit Test Statistic                       6.390       6.305
  Degrees of freedom                                 1           1
  P-value (Chi-square)                           0.011       0.012
  Scaling correction factor                                  1.014
  Shift parameter                                           -0.000
    for simple second-order correction (Mplus variant)

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                           Robust.sem

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  socemo ~                                                              
    ece_ECE           0.100    0.026    3.916    0.000    0.100    0.103
    wealth            0.009    0.008    1.161    0.246    0.009    0.029
  reasoning ~                                                           
    ece_ECE           0.741    0.151    4.922    0.000    0.741    0.136
    wealth            0.208    0.046    4.502    0.000    0.208    0.118
    books_Bks_n_hm    0.296    0.148    2.007    0.045    0.296    0.052

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .socemo ~~                                                             
   .reasoning        -0.079    0.030   -2.677    0.007   -0.079   -0.072

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .socemo            1.589    0.025   64.749    0.000    1.589    3.525
   .reasoning         3.523    0.140   25.208    0.000    3.523    1.398

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .socemo            0.200    0.006   31.148    0.000    0.200    0.986
   .reasoning         6.022    0.173   34.896    0.000    6.022    0.948

In [129]:

%%R
semPaths(semfit, "std")

In [25]:

%%R
semPaths(semfit, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,
         groups=list(c("wealth", "ece_ECE", "books_Books_in_home"), c("reasoning", "socemo")),
         color=c("lightblue", "purple"))

In [90]:

%%R
fitMeasures(semfit)

                         npar                          fmin 
                        8.000                         0.002 
                        chisq                            df 
                        6.390                         1.000 
                       pvalue                  chisq.scaled 
                        0.011                         6.305 
                    df.scaled                 pvalue.scaled 
                        1.000                         0.012 
         chisq.scaling.factor                baseline.chisq 
                        1.014                       123.037 
                  baseline.df               baseline.pvalue 
                        7.000                         0.000 
        baseline.chisq.scaled            baseline.df.scaled 
                      120.437                         7.000 
       baseline.pvalue.scaled baseline.chisq.scaling.factor 
                        0.000                         1.023 
                          cfi                           tli 
                        0.954                         0.675 
                         nnfi                           rfi 
                        0.675                         0.636 
                          nfi                          pnfi 
                        0.948                         0.135 
                          ifi                           rni 
                        0.956                         0.954 
                   cfi.scaled                    tli.scaled 
                        0.953                         0.673 
                   cfi.robust                    tli.robust 
                           NA                            NA 
                  nnfi.scaled                   nnfi.robust 
                        0.673                            NA 
                   rfi.scaled                    nfi.scaled 
                        0.634                         0.948 
                   ifi.scaled                    rni.scaled 
                        0.948                         0.953 
                   rni.robust                          logl 
                           NA                     -4724.918 
            unrestricted.logl                           aic 
                    -4721.722                      9465.835 
                          bic                        ntotal 
                     9508.922                      1613.000 
                         bic2                         rmsea 
                     9483.508                         0.058 
               rmsea.ci.lower                rmsea.ci.upper 
                        0.022                         0.104 
                 rmsea.pvalue                  rmsea.scaled 
                        0.302                         0.057 
        rmsea.ci.lower.scaled         rmsea.ci.upper.scaled 
                        0.021                         0.103 
          rmsea.pvalue.scaled                  rmsea.robust 
                        0.308                            NA 
        rmsea.ci.lower.robust         rmsea.ci.upper.robust 
                           NA                            NA 
          rmsea.pvalue.robust                           rmr 
                           NA                         0.004 
                   rmr_nomean                          srmr 
                        0.004                         0.016 
                 srmr_bentler           srmr_bentler_nomean 
                        0.016                         0.016 
                  srmr_bollen            srmr_bollen_nomean 
                        0.016                         0.016 
                   srmr_mplus             srmr_mplus_nomean 
                        0.016                         0.016 
                        cn_05                         cn_01 
                      970.642                      1675.747 
                          gfi                          agfi 
                        0.996                         0.941 
                         pgfi                           mfi 
                        0.066                         0.998 
                         ecvi 
                        0.014

In [91]:

%%R
fitMeasures(semfit, c("cfi", "tli", "rmsea", "srmr"))

  cfi   tli rmsea  srmr 
0.954 0.675 0.058 0.016

Now let's try to do the same with OpenMX. Here, we need to first drop variables that we are not actually using in the model, otherwise OpenMX will complain.

In [107]:

smallZambiaDFForOMX = dummiedSmallZambiaDF[["wealth", "ece_ECE", "books_Books_in_home", "reasoning", "socemo"]]

In [168]:

%%R -i smallZambiaDFForOMX
manifests <- names(smallZambiaDFForOMX)
model <- mxModel("zambia simple", type="RAM",
                manifestVars=manifests,
                mxPath(from=c("wealth", "ece_ECE"), to="socemo", arrows=1, free=TRUE),
                mxPath(from=c("wealth", "ece_ECE", "books_Books_in_home"), to="reasoning", arrows=1, free=TRUE),
                mxPath(from="socemo", to="reasoning", arrows=2),
                mxPath(from=manifests, arrows=2, values=.8), # needs starting values or it won't converge
                mxPath(from="one", to=manifests),
                mxData(observed=smallZambiaDFForOMX, type="raw"))
model

MxModel 'zambia simple' 
type : RAM 
$matrices : 'A', 'S', 'F', and 'M' 
$algebras : NULL 
$constraints : NULL 
$intervals : NULL 
$latentVars : none
$manifestVars : 'wealth', 'ece_ECE', 'books_Books_in_home', 'reasoning', and 'socemo' 
$data : 1613 x 5 
$data means : NA
$data type: 'raw' 
$submodels : NULL 
$expectation : MxExpectationRAM 
$fitfunction : MxFitFunctionML 
$compute : NULL 
$independent : FALSE 
$options :  
$output : FALSE

In [169]:

%%R
modelRun <- mxRun(model)

In [172]:

%%R
mxCheckIdentification(model)

$status
[1] TRUE

$jacobian
       zambia simple.A[4,1] zambia simple.A[5,1] zambia simple.A[4,2]
cov1_1                  0.0                  0.0                  0.0
cov2_1                  0.0                  0.0                  0.0
cov3_1                  0.0                  0.0                  0.0
cov4_1                  0.8                  0.0                  0.0
cov5_1                  0.0                  0.8                  0.0
cov2_2                  0.0                  0.0                  0.0
cov3_2                  0.0                  0.0                  0.0
cov4_2                  0.0                  0.0                  0.8
cov5_2                  0.0                  0.0                  0.0
cov3_3                  0.0                  0.0                  0.0
cov4_3                  0.0                  0.0                  0.0
cov5_3                  0.0                  0.0                  0.0
cov4_4                  0.0                  0.0                  0.0
cov5_4                  0.0                  0.0                  0.0
cov5_5                  0.0                  0.0                  0.0
mean1                   0.0                  0.0                  0.0
mean2                   0.0                  0.0                  0.0
mean3                   0.0                  0.0                  0.0
mean4                   0.0                  0.0                  0.0
mean5                   0.0                  0.0                  0.0
       zambia simple.A[5,2] zambia simple.A[4,3] zambia simple.S[1,1]
cov1_1                  0.0                  0.0                    1
cov2_1                  0.0                  0.0                    0
cov3_1                  0.0                  0.0                    0
cov4_1                  0.0                  0.0                    0
cov5_1                  0.0                  0.0                    0
cov2_2                  0.0                  0.0                    0
cov3_2                  0.0                  0.0                    0
cov4_2                  0.0                  0.0                    0
cov5_2                  0.8                  0.0                    0
cov3_3                  0.0                  0.0                    0
cov4_3                  0.0                  0.8                    0
cov5_3                  0.0                  0.0                    0
cov4_4                  0.0                  0.0                    0
cov5_4                  0.0                  0.0                    0
cov5_5                  0.0                  0.0                    0
mean1                   0.0                  0.0                    0
mean2                   0.0                  0.0                    0
mean3                   0.0                  0.0                    0
mean4                   0.0                  0.0                    0
mean5                   0.0                  0.0                    0
       zambia simple.S[2,2] zambia simple.S[3,3] zambia simple.S[4,4]
cov1_1                    0                    0                    0
cov2_1                    0                    0                    0
cov3_1                    0                    0                    0
cov4_1                    0                    0                    0
cov5_1                    0                    0                    0
cov2_2                    1                    0                    0
cov3_2                    0                    0                    0
cov4_2                    0                    0                    0
cov5_2                    0                    0                    0
cov3_3                    0                    1                    0
cov4_3                    0                    0                    0
cov5_3                    0                    0                    0
cov4_4                    0                    0                    1
cov5_4                    0                    0                    0
cov5_5                    0                    0                    0
mean1                     0                    0                    0
mean2                     0                    0                    0
mean3                     0                    0                    0
mean4                     0                    0                    0
mean5                     0                    0                    0
       zambia simple.S[4,5] zambia simple.S[5,5] zambia simple.M[1,1]
cov1_1                    0                    0                    0
cov2_1                    0                    0                    0
cov3_1                    0                    0                    0
cov4_1                    0                    0                    0
cov5_1                    0                    0                    0
cov2_2                    0                    0                    0
cov3_2                    0                    0                    0
cov4_2                    0                    0                    0
cov5_2                    0                    0                    0
cov3_3                    0                    0                    0
cov4_3                    0                    0                    0
cov5_3                    0                    0                    0
cov4_4                    0                    0                    0
cov5_4                    1                    0                    0
cov5_5                    0                    1                    0
mean1                     0                    0                    1
mean2                     0                    0                    0
mean3                     0                    0                    0
mean4                     0                    0                    0
mean5                     0                    0                    0
       zambia simple.M[1,2] zambia simple.M[1,3] zambia simple.M[1,4]
cov1_1                    0                    0                    0
cov2_1                    0                    0                    0
cov3_1                    0                    0                    0
cov4_1                    0                    0                    0
cov5_1                    0                    0                    0
cov2_2                    0                    0                    0
cov3_2                    0                    0                    0
cov4_2                    0                    0                    0
cov5_2                    0                    0                    0
cov3_3                    0                    0                    0
cov4_3                    0                    0                    0
cov5_3                    0                    0                    0
cov4_4                    0                    0                    0
cov5_4                    0                    0                    0
cov5_5                    0                    0                    0
mean1                     0                    0                    0
mean2                     1                    0                    0
mean3                     0                    1                    0
mean4                     0                    0                    1
mean5                     0                    0                    0
       zambia simple.M[1,5]
cov1_1                    0
cov2_1                    0
cov3_1                    0
cov4_1                    0
cov5_1                    0
cov2_2                    0
cov3_2                    0
cov4_2                    0
cov5_2                    0
cov3_3                    0
cov4_3                    0
cov5_3                    0
cov4_4                    0
cov5_4                    0
cov5_5                    0
mean1                     0
mean2                     0
mean3                     0
mean4                     0
mean5                     1

$non_identified_parameters
[1] "None"

In [171]:

%%R
summary(modelRun)

Summary of zambia simple 
 
free parameters:
                   name matrix                 row                 col
1  zambia simple.A[4,1]      A           reasoning              wealth
2  zambia simple.A[5,1]      A              socemo              wealth
3  zambia simple.A[4,2]      A           reasoning             ece_ECE
4  zambia simple.A[5,2]      A              socemo             ece_ECE
5  zambia simple.A[4,3]      A           reasoning books_Books_in_home
6  zambia simple.S[1,1]      S              wealth              wealth
7  zambia simple.S[2,2]      S             ece_ECE             ece_ECE
8  zambia simple.S[3,3]      S books_Books_in_home books_Books_in_home
9  zambia simple.S[4,4]      S           reasoning           reasoning
10 zambia simple.S[4,5]      S           reasoning              socemo
11 zambia simple.S[5,5]      S              socemo              socemo
12 zambia simple.M[1,1]      M                   1              wealth
13 zambia simple.M[1,2]      M                   1             ece_ECE
14 zambia simple.M[1,3]      M                   1 books_Books_in_home
15 zambia simple.M[1,4]      M                   1           reasoning
16 zambia simple.M[1,5]      M                   1              socemo
       Estimate   Std.Error A
1   0.208256249 0.045763284  
2   0.009257539 0.008323664  
3   0.741469780 0.144594365  
4   0.100492479 0.025700813  
5   0.296275774 0.142460515  
6   2.045906814 0.072042943  
7   0.214596238 0.007556487  
8   0.198934598 0.007005001  
9   6.021752810 0.212049316  
10 -0.079150620 0.027477928  
11  0.200411594 0.007057006  
12  2.957222635 0.035614481  
13  0.311841230 0.011534387  
14  0.274023513 0.011105512  
15  3.523063821 0.141593354  
16  1.589071982 0.025607614  

Model Statistics: 
               |  Parameters  |  Degrees of Freedom  |  Fit (-2lnL units)
       Model:             16                   8049              19249.93
   Saturated:             20                   8045                    NA
Independence:             10                   8055                    NA
Number of observations/statistics: 1613/8065

Information Criteria: 
      |  df Penalty  |  Parameters Penalty  |  Sample-Size Adjusted
AIC:       3151.934               19281.93                 19282.28
BIC:     -40198.781               19368.11                 19317.28
To get additional fit indices, see help(mxRefModels)
timestamp: 2018-11-20 01:04:09 
Wall clock time: 0.07181001 secs 
optimizer:  CSOLNP 
OpenMx version number: 2.9.9 
Need help?  See help(mxSummary)

In [114]:

%%R
semPaths(modelRun, "std")

In [173]:

%%R
semPaths(modelRun, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,
         groups=list(c("wealth", "ece_ECE", "books_Books_in_home"), c("reasoning", "socemo")),
         color=c("lightblue", "purple"))

There's also another package for doing SEM in R, calles sem, but it seems older and less well supported/documented.

A more complex model¶

Research Question: Do ECE participation and book availability fully explain the relations between wealth and children's socio-emotional & reasoning skills in the Zambian context?

In [196]:

%%R -i dummiedSmallZambiaDF
biggermodel <- '
  ece_ECE ~ wealth
  books_Books_in_home ~ wealth
  socemo ~ ece_ECE + books_Books_in_home
  reasoning ~ ece_ECE + books_Books_in_home
  socemo ~~ reasoning
'
semfit2 <- sem(biggermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)
# WLS would correspond to the adf method, but does not run to completion with this dataset
summary(semfit2, standardized=TRUE)

lavaan 0.6-2 ended normally after 35 iterations

  Optimization method                           NLMINB
  Number of free parameters                         15

  Number of observations                          1613

  Estimator                                         ML      Robust
  Model Fit Test Statistic                     105.787      98.505
  Degrees of freedom                                 3           3
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  1.074
    for the Yuan-Bentler correction (Mplus variant)

Parameter Estimates:

  Information                                 Observed
  Observed information based on                Hessian
  Standard Errors                   Robust.huber.white

Regressions:
                        Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  ece_ECE ~                                                                  
    wealth                 0.114    0.008   14.928    0.000    0.114    0.351
  books_Books_in_home ~                                                      
    wealth                 0.051    0.008    6.560    0.000    0.051    0.162
  socemo ~                                                                   
    ece_ECE                0.093    0.025    3.721    0.000    0.093    0.096
    books_Bks_n_hm         0.067    0.026    2.588    0.010    0.067    0.067
  reasoning ~                                                                
    ece_ECE                0.961    0.143    6.730    0.000    0.961    0.177
    books_Bks_n_hm         0.320    0.148    2.159    0.031    0.320    0.057

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .socemo ~~                                                             
   .reasoning        -0.076    0.030   -2.574    0.010   -0.076   -0.069

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE          -0.025    0.023   -1.089    0.276   -0.025   -0.053
   .books_Bks_n_hm    0.124    0.024    5.262    0.000    0.124    0.279
   .socemo            1.600    0.014  112.146    0.000    1.600    3.555
   .reasoning         4.064    0.076   53.429    0.000    4.064    1.616

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE           0.188    0.005   39.133    0.000    0.188    0.877
   .books_Bks_n_hm    0.194    0.005   38.651    0.000    0.194    0.974
   .socemo            0.200    0.006   31.190    0.000    0.200    0.986
   .reasoning         6.099    0.175   34.873    0.000    6.099    0.964

In [183]:

%%R
fitMeasures(semfit2, c("cfi", "tli", "rmsea", "srmr"))

  cfi   tli rmsea  srmr 
0.882 0.608 0.110 0.046

In [179]:

%%R
semPaths(semfit2, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,
         groups=list(c("wealth", "ece_ECE", "books_Books_in_home"), c("reasoning", "socemo")),
         color=c("lightblue", "purple"))

High-level tools for SEM in R¶

simsem's semtools, see https://github.com/simsem/semTools/wiki/Functions

In [119]:

%%R
chisqSmallN(semfit)

naive.chisq    k-factor   adj.chisq          df      pvalue 
      0.000       0.999       0.000       0.000       1.000

bigsem, ctsem, gsem, lavaan.survey, regsem, semdiag

Model Fit¶

A caveat regarding fit measures: similar to the arbitraty cutoff for p-values indicating significance, the use of fit indices is controversial.

Usual indices:

chisquare, basic but can be unclear
RMSEA (root mean square error of approximation) based on chisquare but takes degrees of freedom (model parsimony) and sample size into account. Should be close to 0 ( < 0.05-0.08)
CFI/TLI (Bentler comparative fit index, Tucker Lewis index) measure relative fit vs an "independence model" (with covariances 0). Should be close to 1 ( > 0.90-0.95)
SRMR (standardized root mean square residual) measures the difference between observed and predicted correlations. Should be close to 0 ( < 0.06-0.08)

In [148]:

%%R
fitMeasures(semfit, c("cfi", "tli", "rmsea", "srmr"))

  cfi   tli rmsea  srmr 
0.954 0.675 0.058 0.013

Latent Variables and Mediation¶

A simpler model to show mediation.

In [200]:

%%R -i dummiedSmallZambiaDF
simplermodel <- '
  # Here we use a and b and c to keep track of the parameter estimations,
  # so that we can then use it to estimate the indirect and total effects.
  # ece is the supposed Mediator
  ece_ECE ~ a*wealth
  socemo ~ c*wealth + b*ece_ECE
  # also estimate indirect effect:
  indirect := a*b
  # total effect:
  total := c + (a*b)
'
semfit3 <- sem(simplermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)
summary(semfit3, standardized=TRUE)

lavaan 0.6-2 ended normally after 21 iterations

  Optimization method                           NLMINB
  Number of free parameters                          7

  Number of observations                          1613

  Estimator                                         ML      Robust
  Model Fit Test Statistic                       0.000       0.000
  Degrees of freedom                                 0           0
  Minimum Function Value               0.0000000000000
  Scaling correction factor                                     NA
    for the Yuan-Bentler correction (Mplus variant)

Parameter Estimates:

  Information                                 Observed
  Observed information based on                Hessian
  Standard Errors                   Robust.huber.white

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  ece_ECE ~                                                             
    wealth     (a)    0.114    0.008   14.928    0.000    0.114    0.351
  socemo ~                                                              
    wealth     (c)    0.009    0.008    1.166    0.244    0.009    0.029
    ece_ECE    (b)    0.100    0.026    3.920    0.000    0.100    0.103

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE          -0.025    0.023   -1.089    0.276   -0.025   -0.053
   .socemo            1.589    0.024   64.924    0.000    1.589    3.525

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE           0.188    0.005   39.133    0.000    0.188    0.877
   .socemo            0.200    0.006   31.147    0.000    0.200    0.986

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    indirect          0.011    0.003    3.815    0.000    0.011    0.036
    total             0.021    0.007    2.772    0.006    0.021    0.066

In [201]:

%%R
fitMeasures(semfit3, c("cfi", "tli", "rmsea", "srmr"))

  cfi   tli rmsea  srmr 
    1     1     0     0

In [202]:

%%R
semPaths(semfit3, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,
         groups=list(c("wealth", "ece_ECE"), c("socemo")),
         color=c("lightblue", "purple"))

Other packages that should be useful for mediation analysis, for example to compare results: psych and MBESS, see here http://nickmichalak.com/blog_entries/2018/nrg01/nrg01.html

Comparing two different models¶

The semtools package is helpful for this.

In [206]:

%%R -i dummiedSmallZambiaDF
twostagesemfit1 <- sem.2stage(model=biggermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)
summary(twostagesemfit1, standardized=TRUE)

Chi-squared test(s) results, ADJUSTED for missing data:

Browne (1984) residual-based test statistic:

 chisq     df pvalue 
97.915  3.000  0.000 


Satorra-Bentler (2001) scaled test statistic:

   chisq.naive scaling.factor   chisq.scaled             df         pvalue 
       105.787          0.001     178767.164          3.000          0.000 


Chi-squared test results, UNADJUSTED for missing data:

lavaan 0.6-2 ended normally after 35 iterations

  Optimization method                           NLMINB
  Number of free parameters                         17

  Number of observations                          1613

  Estimator                                         ML
  Model Fit Test Statistic                     105.787
  Degrees of freedom                                 3
  P-value (Chi-square)                           0.000


Parameter Estimates, with SEs (and tests/CIs) ADJUSTED for missing data:

                   lhs op                 rhs    est    se       z pvalue
1              ece_ECE  ~              wealth  0.114 0.008  14.943  0.000
2  books_Books_in_home  ~              wealth  0.051 0.008   6.599  0.000
3               socemo  ~             ece_ECE  0.093 0.024   3.925  0.000
4               socemo  ~ books_Books_in_home  0.067 0.025   2.709  0.007
5            reasoning  ~             ece_ECE  0.961 0.131   7.315  0.000
6            reasoning  ~ books_Books_in_home  0.320 0.138   2.321  0.020
7               socemo ~~           reasoning -0.076 0.028  -2.706  0.007
8              ece_ECE ~~             ece_ECE  0.188 0.008  24.811  0.000
9  books_Books_in_home ~~ books_Books_in_home  0.194 0.008  24.842  0.000
10              socemo ~~              socemo  0.200 0.007  28.354  0.000
11           reasoning ~~           reasoning  6.099 0.223  27.342  0.000
12              wealth ~~              wealth  2.046 0.074  27.537  0.000
13             ece_ECE ~1                     -0.025 0.025  -0.988  0.323
14 books_Books_in_home ~1                      0.124 0.025   4.943  0.000
15              socemo ~1                      1.600 0.016 102.991  0.000
16           reasoning ~1                      4.064 0.086  47.308  0.000
17              wealth ~1                      2.957 0.036  83.034  0.000
   ci.lower ci.upper std.lv std.all std.nox
1     0.099    0.129  0.114   0.351   0.246
2     0.036    0.066  0.051   0.162   0.113
3     0.047    0.140  0.093   0.096   0.096
4     0.019    0.116  0.067   0.067   0.067
5     0.704    1.219  0.961   0.177   0.177
6     0.050    0.591  0.320   0.057   0.057
7    -0.131   -0.021 -0.076  -0.069  -0.069
8     0.173    0.203  0.188   0.877   0.877
9     0.178    0.209  0.194   0.974   0.974
10    0.186    0.214  0.200   0.986   0.986
11    5.662    6.537  6.099   0.964   0.964
12    1.900    2.192  2.046   1.000   2.046
13   -0.074    0.024 -0.025  -0.053  -0.053
14    0.075    0.174  0.124   0.279   0.279
15    1.570    1.631  1.600   3.555   3.555
16    3.895    4.232  4.064   1.616   1.616
17    2.887    3.027  2.957   2.067   2.957

In [207]:

%%R -i dummiedSmallZambiaDF
biggerermodel <- '
  ece_ECE ~ wealth
  books_Books_in_home ~ wealth
  socemo ~ wealth + ece_ECE + books_Books_in_home
  reasoning ~ wealth + ece_ECE + books_Books_in_home
  socemo ~~ reasoning
'
twostagesemfit2 <- sem.2stage(model=biggerermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)
summary(twostagesemfit2, standardized=TRUE)

Chi-squared test(s) results, ADJUSTED for missing data:

Browne (1984) residual-based test statistic:

 chisq     df pvalue 
77.505  1.000  0.000 


Satorra-Bentler (2001) scaled test statistic:

   chisq.naive scaling.factor   chisq.scaled             df         pvalue 
        83.543          0.001     160732.251          1.000          0.000 


Chi-squared test results, UNADJUSTED for missing data:

lavaan 0.6-2 ended normally after 38 iterations

  Optimization method                           NLMINB
  Number of free parameters                         19

  Number of observations                          1613

  Estimator                                         ML
  Model Fit Test Statistic                      83.543
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.000


Parameter Estimates, with SEs (and tests/CIs) ADJUSTED for missing data:

                   lhs op                 rhs    est    se      z pvalue
1              ece_ECE  ~              wealth  0.114 0.008 15.072  0.000
2  books_Books_in_home  ~              wealth  0.051 0.008  6.602  0.000
3               socemo  ~              wealth  0.008 0.009  0.898  0.369
4               socemo  ~             ece_ECE  0.086 0.026  3.334  0.001
5               socemo  ~ books_Books_in_home  0.066 0.025  2.597  0.009
6            reasoning  ~              wealth  0.209 0.047  4.472  0.000
7            reasoning  ~             ece_ECE  0.747 0.141  5.305  0.000
8            reasoning  ~ books_Books_in_home  0.270 0.139  1.947  0.051
9               socemo ~~           reasoning -0.079 0.027 -2.881  0.004
10             ece_ECE ~~             ece_ECE  0.188 0.008 24.832  0.000
11 books_Books_in_home ~~ books_Books_in_home  0.194 0.008 24.832  0.000
12              socemo ~~              socemo  0.200 0.007 28.399  0.000
13           reasoning ~~           reasoning  6.022 0.212 28.399  0.000
14              wealth ~~              wealth  2.046 0.072 28.399  0.000
15             ece_ECE ~1                     -0.025 0.025 -0.995  0.320
16 books_Books_in_home ~1                      0.124 0.025  4.945  0.000
17              socemo ~1                      1.581 0.026 61.422  0.000
18           reasoning ~1                      3.526 0.141 24.952  0.000
19              wealth ~1                      2.957 0.036 83.034  0.000
   ci.lower ci.upper std.lv std.all std.nox
1     0.099    0.129  0.114   0.351   0.246
2     0.036    0.066  0.051   0.162   0.113
3    -0.009    0.024  0.008   0.024   0.017
4     0.035    0.136  0.086   0.088   0.088
5     0.016    0.115  0.066   0.065   0.065
6     0.117    0.300  0.209   0.119   0.083
7     0.471    1.023  0.747   0.138   0.138
8    -0.002    0.542  0.270   0.048   0.048
9    -0.132   -0.025 -0.079  -0.072  -0.072
10    0.173    0.203  0.188   0.877   0.877
11    0.178    0.209  0.194   0.974   0.974
12    0.186    0.213  0.200   0.985   0.985
13    5.606    6.437  6.022   0.951   0.951
14    1.905    2.187  2.046   1.000   2.046
15   -0.073    0.024 -0.025  -0.053  -0.053
16    0.075    0.174  0.124   0.279   0.279
17    1.530    1.631  1.581   3.511   3.511
18    3.249    3.803  3.526   1.401   1.401
19    2.887    3.027  2.957   2.067   2.957

In [209]:

%%R
anova(twostagesemfit1, twostagesemfit2) # equivalent to "likelihood-ratio test"

Difference test for Browne (1984) residual-based statistics:

 chisq     df pvalue 
20.411  2.000  0.000 


Satorra-Bentler (2001) scaled difference test:

   chisq.naive scaling.factor   chisq.scaled             DF         pvalue 
        22.244          0.001      35434.111          2.000          0.000

In [210]:

%%R
miPowerFit(semfit2)

                   lhs op                 rhs group
12              wealth ~~              wealth     1
18             ece_ECE ~~ books_Books_in_home     1
19             ece_ECE ~~              socemo     1
20             ece_ECE ~~           reasoning     1
21 books_Books_in_home ~~              socemo     1
22 books_Books_in_home ~~           reasoning     1
23             ece_ECE  ~ books_Books_in_home     1
24             ece_ECE  ~              socemo     1
25             ece_ECE  ~           reasoning     1
26 books_Books_in_home  ~             ece_ECE     1
27 books_Books_in_home  ~              socemo     1
28 books_Books_in_home  ~           reasoning     1
30              socemo  ~              wealth     1
32           reasoning  ~              wealth     1
33              wealth  ~             ece_ECE     1
34              wealth  ~ books_Books_in_home     1
35              wealth  ~              socemo     1
36              wealth  ~           reasoning     1
                                     mi                   epc target.epc
12  0.000000000000000000000000000003003 -0.000000000000001431    0.20459
18 81.416605542162955089224851690232754  0.042884997752392018    0.02066
19  1.544749222765177121274859928234946 -0.017232019584356661    0.02085
20 21.625988933431568028709079953841865 -0.356307742032899999    0.11651
21  1.544738483979728682626841873570811 -0.039922711700388476    0.02008
22 21.626245636034973074401932535693049 -0.825492627898089082    0.11218
23 81.416625996730118686173227615654469  0.221397869134911884    0.10386
24  0.619563202904253484959440356760751  0.053795614873202932    0.10291
25  9.314088537604437334493923117406666 -0.037919599570400378    0.01842
26 81.416636979079456182262219954282045  0.227984502302336800    0.09628
27 14.613812036635893676361774851102382  0.529474396339441045    0.09908
28 13.254150786988219579143333248794079  0.071090656370003724    0.01773
30  1.544754756539778339785584648780059  0.010424279832891337    0.03147
32 21.626012906364280752313788980245590  0.215543337637906313    0.17583
33  0.000000005867903821677934663066101  0.000135862687082167    0.30877
34  0.000000008669718146489592254175238  0.000371547803528799    0.32069
35  0.855398203706040693994339108030545  0.079649234055344187    0.31774
36 20.934768677467971542682789731770754  0.071299881188622113    0.05687
                  std.epc std.target.epc significant.mi high.power decision.pow
12 -0.0000000000000006995            0.1          FALSE      FALSE            I
18  0.2075578170512830112            0.1           TRUE       TRUE        EPC:M
19 -0.0826332446876676813            0.1          FALSE      FALSE            I
20 -0.3058234554311530795            0.1           TRUE      FALSE            M
21 -0.1988357948464972713            0.1          FALSE      FALSE            I
22 -0.7358929489158683168            0.1           TRUE      FALSE            M
23  0.2131657988650298707            0.1           TRUE       TRUE        EPC:M
24  0.0522764444171291212            0.1          FALSE      FALSE            I
25 -0.2058713928429005968            0.1           TRUE      FALSE            M
26  0.2367888435868654440            0.1           TRUE       TRUE        EPC:M
27  0.5343921132592205359            0.1           TRUE      FALSE            M
28  0.4008673589843934626            0.1           TRUE      FALSE            M
30  0.0331221683287136609            0.1          FALSE       TRUE           NM
32  0.1225841182785387951            0.1           TRUE       TRUE        EPC:M
33  0.0000440015864972028            0.1          FALSE      FALSE            I
34  0.0001158582390606253            0.1          FALSE      FALSE            I
35  0.0250673777159869729            0.1          FALSE       TRUE           NM
36  0.1253687229667201364            0.1           TRUE       TRUE        EPC:M
     se.epc lower.epc upper.epc lower.std.epc upper.std.epc decision.ci
12 0.825817 -1.358348  1.358348      -0.66393       0.66393           I
18 0.004753  0.035067  0.050703       0.16972       0.24539           M
19 0.013865 -0.040037  0.005573      -0.19199       0.02673           I
20 0.076619 -0.482335 -0.230280      -0.41399      -0.19765           M
21 0.032121 -0.092757  0.012912      -0.46198       0.06431           I
22 0.177510 -1.117470 -0.533515      -0.99618      -0.47561           M
23 0.024537  0.181039  0.261757       0.17431       0.25202           M
24 0.068345 -0.058621  0.166212      -0.05697       0.16152           I
25 0.012425 -0.058357 -0.017482      -0.31683      -0.09491           I
26 0.025267  0.186424  0.269545       0.19362       0.27995           M
27 0.138504  0.301655  0.757294       0.30446       0.76433           M
28 0.019527  0.038972  0.103210       0.21975       0.58198           M
30 0.008387 -0.003371  0.024220      -0.01071       0.07696          NM
32 0.046350  0.139305  0.291782       0.07923       0.16594           I
33 1.773612 -2.917197  2.917469      -0.94479       0.94487           I
34 3.990361 -6.563189  6.563932      -2.04657       2.04680           I
35 0.086119 -0.062003  0.221302      -0.01951       0.06965          NM
36 0.015583  0.045668  0.096932       0.08030       0.17044           I

Multi-group analysis¶

Compare a model across groups in the data, for example male vs. female. See http://lavaan.ugent.be/tutorial/groups.html for how to specify path constraints across groups.

In [212]:

%%R -i dummiedSmallZambiaDF
multigroupsemfit <- sem(model=biggermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE,
                        group="male_Male")
summary(multigroupsemfit, standardized=TRUE)

lavaan 0.6-2 ended normally after 66 iterations

  Optimization method                           NLMINB
  Number of free parameters                         30

  Number of observations per group         
  1                                                803
  0                                                810

  Estimator                                         ML      Robust
  Model Fit Test Statistic                     112.355     104.373
  Degrees of freedom                                 6           6
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  1.076
    for the Yuan-Bentler correction (Mplus variant)

Chi-square for each group:

  1                                             70.945      65.905
  0                                             41.410      38.468

Parameter Estimates:

  Information                                 Observed
  Observed information based on                Hessian
  Standard Errors                   Robust.huber.white


Group 1 [1]:

Regressions:
                        Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  ece_ECE ~                                                                  
    wealth                 0.118    0.011   11.214    0.000    0.118    0.372
  books_Books_in_home ~                                                      
    wealth                 0.038    0.011    3.490    0.000    0.038    0.126
  socemo ~                                                                   
    ece_ECE                0.076    0.036    2.104    0.035    0.076    0.078
    books_Bks_n_hm         0.050    0.038    1.321    0.186    0.050    0.049
  reasoning ~                                                                
    ece_ECE                0.862    0.207    4.165    0.000    0.862    0.155
    books_Bks_n_hm         0.353    0.220    1.604    0.109    0.353    0.061

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .socemo ~~                                                             
   .reasoning        -0.079    0.042   -1.891    0.059   -0.079   -0.070

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE          -0.038    0.031   -1.205    0.228   -0.038   -0.082
   .books_Bks_n_hm    0.152    0.034    4.508    0.000    0.152    0.346
   .socemo            1.567    0.020   78.464    0.000    1.567    3.497
   .reasoning         4.155    0.110   37.657    0.000    4.155    1.622

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE           0.184    0.007   26.400    0.000    0.184    0.861
   .books_Bks_n_hm    0.191    0.007   26.100    0.000    0.191    0.984
   .socemo            0.199    0.009   21.330    0.000    0.199    0.991
   .reasoning         6.375    0.257   24.797    0.000    6.375    0.971


Group 2 [0]:

Regressions:
                        Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  ece_ECE ~                                                                  
    wealth                 0.109    0.011    9.902    0.000    0.109    0.330
  books_Books_in_home ~                                                      
    wealth                 0.064    0.011    5.859    0.000    0.064    0.199
  socemo ~                                                                   
    ece_ECE                0.111    0.035    3.208    0.001    0.111    0.115
    books_Bks_n_hm         0.081    0.036    2.257    0.024    0.081    0.081
  reasoning ~                                                                
    ece_ECE                1.056    0.198    5.347    0.000    1.056    0.199
    books_Bks_n_hm         0.302    0.201    1.503    0.133    0.302    0.055

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .socemo ~~                                                             
   .reasoning        -0.068    0.041   -1.667    0.096   -0.068   -0.064

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE          -0.010    0.033   -0.320    0.749   -0.010   -0.023
   .books_Bks_n_hm    0.094    0.033    2.855    0.004    0.094    0.208
   .socemo            1.634    0.020   80.741    0.000    1.634    3.647
   .reasoning         3.971    0.104   38.032    0.000    3.971    1.610

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .ece_ECE           0.192    0.007   28.994    0.000    0.192    0.891
   .books_Bks_n_hm    0.195    0.007   28.525    0.000    0.195    0.961
   .socemo            0.196    0.009   22.897    0.000    0.196    0.979
   .reasoning         5.813    0.235   24.713    0.000    5.813    0.956

Measurement Models, Confirmatory Factor Analysis¶

SEM can be used not only to relate measured variables, as above, but also to relate indicators (items, or full "scales" that combine items) to theorized "latent" variables that cannot be directly observed.

We use the extended version of the zambia dataset here which includes several more variables that are directly observed items. Initially, we'll use 5 items each for a socioemotional scale and a task-orientation scale. Each item is rated on a scale from 1 (never) to 4 (always).

In [215]:

measureZambiaDF.head()

Out[215]:

	childid	male	urban	reas1	reas2	reas3	reas4	reas5	reas6	reas7	...	to12	to13	ece	reasoning	socemo	vocab	vocabsq	taskorient	wealth	books
0	101	Male	Urban	Correctly answered	Correctly answered	Correctly answered	Incorrect	Incorrect	Incorrect	Incorrect	...	2.0	4.0	ECE	3.0	1.05	18.0	324.0	3.615385	3.0	No books in home
1	102	Male	Urban	Correctly answered	Incorrect	Correctly answered	Correctly answered	Incorrect	Incorrect	Correctly answered	...	4.0	4.0	No ECE	4.0	1.00	19.0	361.0	3.769231	2.0	No books in home
2	103	Female	Urban	Correctly answered	Correctly answered	Correctly answered	Incorrect	Incorrect	Correctly answered	Incorrect	...	4.0	4.0	No ECE	4.0	1.80	19.0	361.0	3.769231	3.0	No books in home
3	104	Male	Urban	Correctly answered	Correctly answered	Correctly answered	Incorrect	Incorrect	Incorrect	Correctly answered	...	4.0	3.0	No ECE	5.0	2.35	12.0	144.0	3.692308	3.0	No books in home
4	105	Male	Urban	Correctly answered	Correctly answered	Correctly answered	Incorrect	Incorrect	Correctly answered	Incorrect	...	4.0	3.0	No ECE	5.0	1.50	25.0	625.0	3.461539	2.0	Books in home

5 rows × 84 columns

In [219]:

measureZambiaDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1623 entries, 0 to 1622
Data columns (total 84 columns):
childid       1623 non-null int16
male          1623 non-null category
urban         1623 non-null category
reas1         1623 non-null category
reas2         1623 non-null category
reas3         1623 non-null category
reas4         1623 non-null category
reas5         1623 non-null category
reas6         1623 non-null category
reas7         1623 non-null category
reas8         1623 non-null category
reas9         1623 non-null category
reas10        1623 non-null category
se1           1551 non-null category
se2           1461 non-null category
se3           1512 non-null category
se4           1587 non-null category
se5           1482 non-null category
se6           1539 non-null category
se7           1562 non-null category
se8           1567 non-null category
se9           1550 non-null category
se10          1569 non-null category
se11          1571 non-null category
se12          1516 non-null category
se13          1560 non-null category
se14          1532 non-null category
se15          1559 non-null category
se16          1476 non-null category
se17          1581 non-null category
se18          1484 non-null category
se19          1573 non-null category
se20          1569 non-null category
voc1          1623 non-null float32
voc2          1623 non-null float32
voc3          1623 non-null float32
voc4          1623 non-null float32
voc5          1623 non-null float32
voc6          1623 non-null float32
voc7          1623 non-null float32
voc8          1623 non-null float32
voc9          1623 non-null float32
voc10         1623 non-null float32
voc11         1623 non-null float32
voc12         1623 non-null float32
voc13         1623 non-null float32
voc14         1623 non-null float32
voc15         1623 non-null float32
voc16         1623 non-null float32
voc17         1623 non-null float32
voc18         1623 non-null float32
voc19         1623 non-null float32
voc20         1623 non-null float32
voc21         1623 non-null float32
voc22         1623 non-null float32
voc23         1623 non-null float32
voc24         1623 non-null float32
voc25         1623 non-null float32
voc26         1623 non-null float32
voc27         1623 non-null float32
voc28         1623 non-null float32
voc29         1623 non-null float32
voc30         1623 non-null float32
to1           1589 non-null float32
to2           1585 non-null float32
to3           1573 non-null float32
to4           1587 non-null float32
to5           1588 non-null float32
to6           1591 non-null float32
to7           1563 non-null float32
to8           1572 non-null float32
to9           1552 non-null float32
to10          1551 non-null float32
to11          1547 non-null float32
to12          1548 non-null float32
to13          1566 non-null float32
ece           1623 non-null category
reasoning     1623 non-null float32
socemo        1613 non-null float32
vocab         1623 non-null float32
vocabsq       1623 non-null float32
taskorient    1597 non-null float32
wealth        1623 non-null float32
books         1623 non-null category
dtypes: category(34), float32(49), int16(1)
memory usage: 385.5 KB

In [224]:

measureZambiaDF.describe(include="all")[["se1", "se2", "to1", "to2"]]

Out[224]:

	se1	se2	to1	to2
count	1551	1461	1589.000000	1585.000000
unique	4	4	NaN	NaN
top	Sometimes	Sometimes	NaN	NaN
freq	648	704	NaN	NaN
mean	NaN	NaN	3.111391	3.076341
std	NaN	NaN	0.971020	0.950069
min	NaN	NaN	1.000000	1.000000
25%	NaN	NaN	2.000000	3.000000
50%	NaN	NaN	3.000000	3.000000
75%	NaN	NaN	4.000000	4.000000
max	NaN	NaN	4.000000	4.000000

The se variables are not yet numerically coded, so we have to fix that, and prepare dummy-coding if needed.

In [226]:

measureZambiaDF["se1"].unique()

Out[226]:

[Never, Always, Sometimes, Usually, NaN]
Categories (4, object): [Never < Sometimes < Usually < Always]

In [26]:

for colnum in range(1,6):
    measureZambiaDF["se" + str(colnum) + "c"] = measureZambiaDF["se" + str(colnum)].cat.codes

In [27]:

measureZambiaDF["se1c"].unique()

Out[27]:

array([ 0,  3,  1,  2, -1], dtype=int64)

Let's check correlations to make sure our assumptions, i.e. that indicators within a factor are correlated, is correct.

In [28]:

# Calculate correlations
corr = measureZambiaDF[["se1c", "se2c", "se3c", "se4c", "se5c", "to1", "to2", "to3", "to4", "to5", "socemo", "taskorient"]].corr()
# Heatmap
plt.figure()
sns.heatmap(corr)
plt.show()

In [234]:

corr

Out[234]:

	se1c	se2c	se3c	se4c	se5c	to1	to2	to3	to4	to5	socemo	taskorient
se1c	1.000000	0.448828	0.397943	0.315406	0.299793	0.057045	0.055181	0.092964	0.057652	0.004773	0.529947	0.048885
se2c	0.448828	1.000000	0.362351	0.215676	0.299680	0.034002	0.053028	0.043206	-0.022127	-0.008994	0.386499	0.006681
se3c	0.397943	0.362351	1.000000	0.317611	0.332186	0.041432	0.049954	0.101046	0.055205	0.030859	0.525709	0.045439
se4c	0.315406	0.215676	0.317611	1.000000	0.283498	0.033503	0.065899	0.055018	0.037609	0.045378	0.501649	0.027030
se5c	0.299793	0.299680	0.332186	0.283498	1.000000	0.037358	0.061677	0.090891	0.032899	0.062674	0.521056	0.053759
to1	0.057045	0.034002	0.041432	0.033503	0.037358	1.000000	0.689873	0.579799	0.443605	0.475216	0.086210	0.776598
to2	0.055181	0.053028	0.049954	0.065899	0.061677	0.689873	1.000000	0.604540	0.448777	0.448261	0.124137	0.759687
to3	0.092964	0.043206	0.101046	0.055018	0.090891	0.579799	0.604540	1.000000	0.444270	0.430099	0.151745	0.725361
to4	0.057652	-0.022127	0.055205	0.037609	0.032899	0.443605	0.448777	0.444270	1.000000	0.518978	0.106936	0.657686
to5	0.004773	-0.008994	0.030859	0.045378	0.062674	0.475216	0.448261	0.430099	0.518978	1.000000	0.108867	0.673800
socemo	0.529947	0.386499	0.525709	0.501649	0.521056	0.086210	0.124137	0.151745	0.106936	0.108867	1.000000	0.115544
taskorient	0.048885	0.006681	0.045439	0.027030	0.053759	0.776598	0.759687	0.725361	0.657686	0.673800	0.115544	1.000000

In [ ]:

In [29]:

%%R -i measureZambiaDF
model <- '
    to_latent  =~ to1 + to2 + to3 + to4 + to5 
    se_latent  =~ se1c + se2c + se3c + se4c + se5c 
'
fit <- cfa(model, data=measureZambiaDF)
summary(fit, fit.measures=TRUE)

C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se1". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se2". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se3". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se4". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se5". Fall back to string conversion. The error is: Converting pandas "Category" series to R factor is only possible when categories are strings.
  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se6". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se7". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se8". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se9". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se10". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se11". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se12". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se13". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se14". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se15". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se16". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se17". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se18". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se19". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))
C:\Users\sr876\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2-2.9.4-py3.7-win-amd64.egg\rpy2\robjects\pandas2ri.py:67: UserWarning: Error while trying to convert the column "se20". Fall back to string conversion. The error is: Error in as.character.factor(x) : malformed factor

  (name, str(e)))

lavaan 0.6-2 ended normally after 29 iterations

  Optimization method                           NLMINB
  Number of free parameters                         21

                                                  Used       Total
  Number of observations                          1547        1623

  Estimator                                         ML
  Model Fit Test Statistic                     219.773
  Degrees of freedom                                34
  P-value (Chi-square)                           0.000

Model test baseline model:

  Minimum Function Test Statistic             4175.115
  Degrees of freedom                                45
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    0.955
  Tucker-Lewis Index (TLI)                       0.940

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -19814.396
  Loglikelihood unrestricted model (H1)     -19704.510

  Number of free parameters                         21
  Akaike (AIC)                               39670.792
  Bayesian (BIC)                             39783.018
  Sample-size adjusted Bayesian (BIC)        39716.306

Root Mean Square Error of Approximation:

  RMSEA                                          0.059
  90 Percent Confidence Interval          0.052  0.067
  P-value RMSEA <= 0.05                          0.018

Standardized Root Mean Square Residual:

  SRMR                                           0.034

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  to_latent =~                                        
    to1               1.000                           
    to2               0.984    0.030   32.415    0.000
    to3               0.864    0.030   28.936    0.000
    to4               0.620    0.027   23.135    0.000
    to5               0.655    0.028   23.500    0.000
  se_latent =~                                        
    se1c              1.000                           
    se2c              0.996    0.063   15.798    0.000
    se3c              1.034    0.064   16.161    0.000
    se4c              0.645    0.049   13.223    0.000
    se5c              1.220    0.085   14.332    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  to_latent ~~                                        
    se_latent         0.057    0.016    3.489    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .to1               0.324    0.017   18.665    0.000
   .to2               0.301    0.016   18.284    0.000
   .to3               0.411    0.018   22.537    0.000
   .to4               0.431    0.017   25.277    0.000
   .to5               0.461    0.018   25.161    0.000
   .se1c              0.539    0.028   19.129    0.000
   .se2c              0.761    0.035   21.731    0.000
   .se3c              0.712    0.034   20.787    0.000
   .se4c              0.645    0.026   24.863    0.000
   .se5c              1.749    0.073   23.896    0.000
    to_latent         0.621    0.034   18.044    0.000
    se_latent         0.388    0.034   11.426    0.000

In [241]:

%%R
semPaths(fit, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=5, sizeLat=10,
         groups="latents",
         color=c("lightblue", "purple"))

TODO: compare against models (similar to above) that use one latent factor instead of two, or that load some indicators (e.g. se4c) to both.

Combining structural and measurement models for Structural Regression¶

Let's test whether ECE partially explains the relation between wealth and children's socioemotional and task orientation skills.

In [243]:

# we need dummy coding again here, to use ECE
dummiedMeasureZambiaDF = pd.get_dummies(measureZambiaDF, dtype=np.int8)
# We use int8 instead of uint8 for the dummy variables, because R has no unsigned integers,
# so we would get warnings when passing these new variables into R.
# Further: the new dummy columns use the actual value as part of the column name,
# and therefore might have spaces in the column name.
# this is not great for specifying models, so let us replace the spaces with underscores:
dummiedMeasureZambiaDF.columns = dummiedMeasureZambiaDF.columns.str.replace(' ', '_')

In [244]:

%%R -i dummiedMeasureZambiaDF
fullsrmodel <- '
    to_latent =~ to1 + to2 + to3 + to4 + to5 
    se_latent =~ se1c + se2c + se3c + se4c + se5c 
    ece_ECE   ~ wealth
    se_latent ~ ece_ECE + wealth
    to_latent ~ ece_ECE + wealth
'
srsemfit <- sem(model=fullsrmodel, data=dummiedMeasureZambiaDF, estimator="MLR", meanstructure=TRUE)
summary(srsemfit, standardized=TRUE)

lavaan 0.6-2 ended normally after 52 iterations

  Optimization method                           NLMINB
  Number of free parameters                         38

                                                  Used       Total
  Number of observations                          1547        1623

  Estimator                                         ML      Robust
  Model Fit Test Statistic                     242.490     224.444
  Degrees of freedom                                50          50
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  1.080
    for the Yuan-Bentler correction (Mplus variant)

Parameter Estimates:

  Information                                 Observed
  Observed information based on                Hessian
  Standard Errors                   Robust.huber.white

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  to_latent =~                                                          
    to1               1.000                               0.788    0.811
    to2               0.984    0.028   35.744    0.000    0.775    0.817
    to3               0.862    0.031   27.367    0.000    0.679    0.727
    to4               0.621    0.033   18.963    0.000    0.490    0.598
    to5               0.654    0.034   19.294    0.000    0.516    0.605
  se_latent =~                                                          
    se1c              1.000                               0.627    0.651
    se2c              0.990    0.059   16.805    0.000    0.620    0.579
    se3c              1.022    0.072   14.228    0.000    0.640    0.603
    se4c              0.641    0.057   11.225    0.000    0.401    0.447
    se5c              1.212    0.100   12.078    0.000    0.760    0.498

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  ece_ECE ~                                                             
    wealth            0.115    0.008   14.756    0.000    0.115    0.353
  se_latent ~                                                           
    ece_ECE           0.125    0.044    2.855    0.004    0.199    0.093
    wealth            0.028    0.014    2.013    0.044    0.045    0.064
  to_latent ~                                                           
    ece_ECE           0.171    0.048    3.540    0.000    0.217    0.101
    wealth            0.103    0.016    6.432    0.000    0.131    0.188

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .to_latent ~~                                                          
   .se_latent         0.042    0.015    2.732    0.006    0.089    0.089

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .to1               2.743    0.054   51.115    0.000    2.743    2.822
   .to2               2.717    0.052   51.998    0.000    2.717    2.861
   .to3               2.702    0.047   57.230    0.000    2.702    2.891
   .to4               3.258    0.040   81.464    0.000    3.258    3.981
   .to5               3.072    0.042   73.536    0.000    3.072    3.601
   .se1c              1.433    0.045   31.959    0.000    1.433    1.488
   .se2c              1.167    0.045   25.922    0.000    1.167    1.090
   .se3c              1.376    0.047   29.391    0.000    1.376    1.296
   .se4c              1.582    0.034   46.726    0.000    1.582    1.762
   .se5c              1.689    0.058   29.171    0.000    1.689    1.107
   .ece_ECE          -0.026    0.023   -1.134    0.257   -0.026   -0.056
   .to_latent         0.000                               0.000    0.000
   .se_latent         0.000                               0.000    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .to1               0.323    0.022   14.525    0.000    0.323    0.342
   .to2               0.301    0.022   13.493    0.000    0.301    0.333
   .to3               0.412    0.023   17.755    0.000    0.412    0.472
   .to4               0.430    0.023   19.102    0.000    0.430    0.642
   .to5               0.461    0.023   20.300    0.000    0.461    0.634
   .se1c              0.534    0.035   15.350    0.000    0.534    0.576
   .se2c              0.762    0.048   15.995    0.000    0.762    0.665
   .se3c              0.717    0.042   17.175    0.000    0.717    0.636
   .se4c              0.645    0.028   23.178    0.000    0.645    0.800
   .se5c              1.749    0.076   22.919    0.000    1.749    0.752
   .ece_ECE           0.188    0.005   38.582    0.000    0.188    0.876
   .to_latent         0.585    0.032   18.120    0.000    0.941    0.941
   .se_latent         0.386    0.040    9.545    0.000    0.983    0.983

In [245]:

%%R
semPaths(srsemfit, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=5, sizeLat=10,
         groups="latents",
         color=c("lightblue", "purple"))

Presenting the Results¶

In [ ]:

Notes on Aspects Skipped¶

Power calculations. Is the number of subjects/observations large enough for the number of parameters/connections to be estimated. In general, SEM needs large sample sizes above 100.
...

In [ ]:

For reference: some models from tutorials with built-in data¶

In [85]:

%%R
model <- ' visual  =~ x1 + x2 + x3 
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '
fit <- cfa(model, data=HolzingerSwineford1939)

In [86]:

%%R
summary(fit, fit.measures=TRUE)

lavaan 0.6-2 ended normally after 35 iterations

  Optimization method                           NLMINB
  Number of free parameters                         21

  Number of observations                           301

  Estimator                                         ML
  Model Fit Test Statistic                      85.306
  Degrees of freedom                                24
  P-value (Chi-square)                           0.000

Model test baseline model:

  Minimum Function Test Statistic              918.852
  Degrees of freedom                                36
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    0.931
  Tucker-Lewis Index (TLI)                       0.896

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3737.745
  Loglikelihood unrestricted model (H1)      -3695.092

  Number of free parameters                         21
  Akaike (AIC)                                7517.490
  Bayesian (BIC)                              7595.339
  Sample-size adjusted Bayesian (BIC)         7528.739

Root Mean Square Error of Approximation:

  RMSEA                                          0.092
  90 Percent Confidence Interval          0.071  0.114
  P-value RMSEA <= 0.05                          0.001

Standardized Root Mean Square Residual:

  SRMR                                           0.065

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  visual =~                                           
    x1                1.000                           
    x2                0.554    0.100    5.554    0.000
    x3                0.729    0.109    6.685    0.000
  textual =~                                          
    x4                1.000                           
    x5                1.113    0.065   17.014    0.000
    x6                0.926    0.055   16.703    0.000
  speed =~                                            
    x7                1.000                           
    x8                1.180    0.165    7.152    0.000
    x9                1.082    0.151    7.155    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  visual ~~                                           
    textual           0.408    0.074    5.552    0.000
    speed             0.262    0.056    4.660    0.000
  textual ~~                                          
    speed             0.173    0.049    3.518    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .x1                0.549    0.114    4.833    0.000
   .x2                1.134    0.102   11.146    0.000
   .x3                0.844    0.091    9.317    0.000
   .x4                0.371    0.048    7.779    0.000
   .x5                0.446    0.058    7.642    0.000
   .x6                0.356    0.043    8.277    0.000
   .x7                0.799    0.081    9.823    0.000
   .x8                0.488    0.074    6.573    0.000
   .x9                0.566    0.071    8.003    0.000
    visual            0.809    0.145    5.564    0.000
    textual           0.979    0.112    8.737    0.000
    speed             0.384    0.086    4.451    0.000

In [87]:

%%R
model <- '
  x1 ~ y3
  # measurement model
  #  ind60 =~ x1 + x2 + x3
  #  dem60 =~ y1 + y2 + y3 + y4
  #  dem65 =~ y5 + y6 + y7 + y8
  # regressions
  #  dem60 ~ ind60
  #  dem65 ~ ind60 + dem60
'
fit <- sem(model, data=PoliticalDemocracy)
summary(fit, standardized=TRUE)

lavaan 0.6-2 ended normally after 10 iterations

  Optimization method                           NLMINB
  Number of free parameters                          2

  Number of observations                            75

  Estimator                                         ML
  Model Fit Test Statistic                       0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                             Standard

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  x1 ~                                                                  
    y3                0.073    0.024    2.999    0.003    0.073    0.327

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .x1                0.473    0.077    6.124    0.000    0.473    0.893