\newpage
Sensitivity of mortality rates to the imputation of missing socioeconomic data: cohort study
Document version
Version | Alterations |
---|---|
01 | Initial version |
The data procedures, design and analysis methods used in this report are fully described in the annex document SAP-2023-017-BH-v01.
This analysis was performed using statistical software R
version 4.3.0.
Table 1 shows how the number of outcome events increase when missing zip codes are imputed under the various imputation approaches employed. In all datasets the ID pool remains unchanged, meaning no participant that was dropped was recovered after the imputation approaches evaluated. Only a small number of new outcome events are gained when the multiple observations per individuals are considered. The distribution of frequencies of the SES categories vary slightly between approaches, but these proportions appear to be robust to changes in the underlying zip code imputations.
Table: Table 1 Distribution of variables in the data under various imputation approaches.
Characteristic | CC, N = 7,978 | LOCF, N = 7,978 | LOCF+NOCB, N = 7,978 | CC, N = 24,282 | LOCF, N = 24,282 | LOCF+NOCB, N = 24,282 |
---|---|---|---|---|---|---|
outcome, n | 1,003 | 1,003 | 1,003 | 1,006 | 1,006 | 1,006 |
exposure, n (%) | ||||||
Prosperous | 1,421 (22%) | 1,428 (22%) | 1,430 (22%) | 4,363 (23%) | 4,573 (23%) | 4,577 (23%) |
Comfortable | 1,327 (20%) | 1,339 (20%) | 1,341 (20%) | 3,862 (20%) | 4,070 (20%) | 4,072 (20%) |
Mid-Tier | 1,221 (19%) | 1,237 (19%) | 1,238 (19%) | 3,573 (19%) | 3,764 (19%) | 3,765 (19%) |
At-Risk | 1,287 (20%) | 1,296 (20%) | 1,299 (20%) | 3,782 (20%) | 3,993 (20%) | 3,997 (20%) |
Distressed | 1,285 (20%) | 1,291 (20%) | 1,294 (20%) | 3,696 (19%) | 3,905 (19%) | 3,910 (19%) |
Missing | 1,437 | 1,387 | 1,376 | 5,006 | 3,977 | 3,961 |
After dropping incomplete cases to inspect the data available to the model, we can anticipate how the model might be impacted by the imputations (Table 2).
Surprisingly all three datasets under the “single observation per individual” approach are the same, and this was validated by the all.equal()
function that performs a binary comparison between data frames.
No changes to the data available for modelling can be detected using any of the imputation approaches for this dataset.
Table: Table 2 Number of death events available to models in each dataset
dataset | n |
---|---|
sing_cc | 693 |
sing_locf | 693 |
sing_locf+nocb | 693 |
mult_cc | 2 |
mult_locf | 694 |
mult_locf+nocb | 694 |
Under the “multiple observations per individual” approach, most outcome events are dropped for the complete case dataset. This happens because that dataset uses the exposure at discharge, but there are no Zip codes recorded for that measurement time. This way no DCI scores were available for most individuals, resulting in a sample of size 2 (Table 2). By applying the binary comparison between the two imputation approaches we found that both LOCF and LOCF+NOCB data frames are equal. A single outcome event was added to those datasets after the imputation is applied to the underlying SES data.
This leaves only two datasets to perform the sensitivity analysis on: one dataset under the “single observations per individual” (regardless of whether an imputation was applied) and one using the “multiple observations per individual” approaches (using any imputation). For simplicity, we will consider the complete case dataset for the first case and the LOCF for the second one.
Table 3 shows the results of the model specification from SAP-2023-016-BH-v02 on both datasets available from the previous section. The same model specification was tested with both datasets.
Table: Table 3 Model coefficients for both datasets.
Characteristic | HR | 95% CI | p-value | HR | 95% CI | p-value |
---|---|---|---|---|---|---|
SES quintiles | ||||||
Prosperous | — | — | — | — | ||
Comfortable | 0.98 | 0.78 to 1.25 | 0.893 | 1.06 | 0.83 to 1.35 | 0.623 |
Mid-Tier | 1.09 | 0.84 to 1.41 | 0.515 | 1.18 | 0.91 to 1.52 | 0.207 |
At-Risk | 1.12 | 0.87 to 1.43 | 0.386 | 1.11 | 0.87 to 1.43 | 0.400 |
Distressed | 1.21 | 0.95 to 1.56 | 0.129 | 1.33 | 1.03 to 1.72 | 0.027 |
Using the dataset that provides a single observation per individual the residual analysis of SAR-2023-016-BH is reproduced, where the FIM motor score is dropped due to a violation of the proportional hazards assumption. The dataset that provides multiple observations per individual imputed with a LOCF approach does not violate that assumption, so the term can be safely kept for the analysis. Additionally, when the SES exposure is the time-varying it is associated with mortality under the final model specification, whereas in the smaller dataset this was only true without including any of the FIM scores. Table 4 shows the p-values of the Schoenfeld test for the model tested on both datasets.
Table: Table 4 Schoenfeld test for both datasets.
term | cc | locf |
---|---|---|
exposure | 0.5 | 0.5 |
SexF | 0.2 | 0.2 |
Race | 0.3 | 0.4 |
AGE | 0.8 | 0.8 |
EDUCATION | >0.9 | 0.7 |
EMPLOYMENT | 0.3 | 0.2 |
RehabPay1 | 0.6 | 0.7 |
SCI | 0.2 | 0.9 |
DAYStoREHABdc | 0.055 | 0.2 |
PROBLEMUse | 0.4 | 0.8 |
ResDis | 0.4 | 0.6 |
RURALdc | 0.4 | 0.3 |
FIMMOTD4 | 0.047 | 0.055 |
FIMCOGD4 | 0.2 | 0.13 |
GLOBAL | 0.3 | 0.2 |
Recommended reporting guideline
The adoption of the EQUATOR network (http://www.equator-network.org/) reporting guidelines have seen increasing adoption by scientific journals. All observational studies are recommended to be reported following the STROBE guideline (von Elm et al, 2014).
Simple imputation on zip codes do not affect the range of observations available for modeling in this dataset when using a single observation per individual. The model specification tested is robust to imputation approaches on this dataset and the resulting exposure variable is unchanged.
When using multiple observations per individual, there is a minute increment in the number of events, but there is different imputation approaches do not yield different datasets. The model specification tested is robust to imputation approaches on this dataset and the resulting exposure variable is unchanged.
When using multiple observations per individual in the model specification evaluated, the time-varying exposure allows for the inclusion of terms that violated the proportional hazards assumption in the constant exposure. The model specification tested is sensitive to using a time-varying exposure and all terms can be used for analysis.
N/A
All documents from this consultation were included in the consultant’s Portfolio.
The portfolio is available at:
https://philsf-biostat.github.io/SAR-2023-017-BH/
This analysis is part of a larger project and is supported by other analyses, linked below.
Effect of socioeconomic status in mortality rates after brain injury: cohort study
https://philsf-biostat.github.io/SAR-2023-004-BH/
Time-adjusted effect of socioeconomic status in mortality rates after brain injury: cohort study
https://philsf-biostat.github.io/SAR-2023-016-BH/
Table A1 shows the structure of the analytical dataset.
id | exposure | outcome | Time | SexF | Race | Mar | AGE | PROBLEMUse | EDUCATION | EMPLOYMENT | RURALdc | PriorSeiz | SCI | Cause | RehabPay1 | ResDis | DAYStoREHABdc | FIMMOTD | FIMCOGD | FollowUpPeriod | FIMMOTD4 | FIMCOGD4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ||||||||||||||||||||||
2 | ||||||||||||||||||||||
3 | ||||||||||||||||||||||
… | ||||||||||||||||||||||
N |
Table: Table A1 Analytical dataset structure
Due to confidentiality the data-set used in this analysis cannot be shared online in the public version of this report.