Handling Missing Data Case Study
Home : Statistical Resources : Case Studies : 2002 : Handling Missing Data | en français | Large font | Small font
Jump to navigation menu.Last modified 2002-01-28 10:00
Please check this page regularly for updates, corrections, and answers to frequently asked questions!
Click here to get the "Handling Missing Data" Case Study as a Word document.
Table of Contents
- Overview
- Introduction
- Data Description
- Objectives
- Exercise 1 (PDF file)
- Exercise 2 (PDF file)
- Exercise 3 (PDF file)
- Exercise 4 (PDF file)
- GENESIS (PDF file)
- FAQ
- References
Overview
The data set to be studied, which uses health data from the 1994 National Population Health Survey, will have missing data to simulate non-response. In addition to studying the relationship between health status and health determinants, the student will learn about response mechanisms, non-response bias, and different methods to treat and analyze data with missing values.
Julie Bernier - julie.bernier@statcan.ca, David Haziza - david.haziza@statcan.ca,
Karla Nobrega - karla.nobrega@statcan.ca, Patricia Whitridge - patricia.whitridge@statcan.ca
Introduction
In surveys, it is virtually assured that a certain level of nonresponse will occur. There are two types of nonresponse: total (or unit) nonresponse, when no information is collected on a sampled unit, and partial (or item) nonresponse, when the absence of information is only limited to some variables. In surveys, weighting adjustment methods are commonly used to compensate for total nonresponse, while imputation is used to compensate for item nonresponse.
Weighting adjustments are used primarily to increase the survey weight of respondents in order to compensate for the nonrespondents. Imputation, on the other hand, produces an "artificial value" to replace a missing value. The goal in both cases is to obtain approximately unbiased estimates.
There are four sections to this case study. The student may do any or all of the four components.
Section 1: Assessing the response mechanism
The student will assess the nature of the response mechanism. There are three common classifications for response mechanisms:
a. Missing Completely at Random (MCAR) i.e. the probability of response for a variable of interest y is the same for all units in the population, this means that the probability of response does not depend on either auxiliary variables x or the variable of interest y ;
b. Missing at Random (MAR), i.e. the probability of response to a variable of interest y is related to auxiliary variable(s) x ;
c. Not Missing at Random (NMAR), i.e. the probability of response to variable of interest y is related to y or to other variables that were not studied.
Note that one can only test for missing completely at random.
Section 2: Deciding on a method to deal with the missing data
The student will consider alternatives to address missing data some of which are:
a. Do nothing;
b. Use only respondents with complete data;
c. Use a weighting adjustment method;
d. Impute value using:
- Mean
- Ratio
- Regression
- Random Hot Deck
- Nearest Neighbour
- Other methods
Section 3: Analysing the data
The student will study the relationship between either the Health Utilities Index (HUI) or general self perceived health and the following variables:
- age
- income
- probability of depression
- number of chronic conditions
- number of doctor visits
- Body Mass Index (BMI)
- sex
- smoking status
Section 4: Examining bias from imputation
Using the Generalized System for Imputation Simulations (GENESIS) v.1.0, SAS-8.2, the student will assess the extent of bias resulting from different imputation methods.
Data Description
Survey
This case study on missing data uses a sub-sample of the 1994 National Population Health Survey. The context of the exercise is the relationship between health status and health predictors. Health status is measured with either the general health question or the Health Utilities Index (HUI). The data represent persons, aged 20-65, living in a private household in the prairie provinces. (Pregnant women were excluded in this analysis.) Note that the "missing" data values in the data sample were removed for this case study although they are, in reality, present in the public use micro-data files.
The National Population Health Survey (NPHS) used the Labour Force Survey sampling frame to draw the initial sample of approximately 20,000 households. The survey is designed to collect information on the health of the Canadian population and related socio-demographic information. The first cycle of data collection began in 1994 and continues every second year thereafter. The sample collection is distributed over four quarterly periods followed by a follow-up period and the whole process takes a year. The survey is designed to produce both cross-sectional and longitudinal estimates. In each household some limited health information is collected from all household members and one person in each household is randomly selected for a more in-depth interview.
The questionnaires include content related to health status, use of health services, determinants of health, a health index, chronic conditions and activity restrictions. The use of health services is probed through visits to health care providers, both traditional and non-traditional, and the use of drugs and other medications. Health determinants include smoking, alcohol use and physical activity. As well, a section on self-care has also been included this cycle. The socio-demographic information includes age, sex, education, ethnicity, household income and labour force status.
Data: Excel file, SAS 8.2 file;
SAS variable definitions.
Data are available at http://www.statcan.ca/english/IPS/Data/82M0009XCB.htm or to students free of charge through the data liberation initiative.
Variables
Health index:
| GH_Q1 | In general, how would you describe your health? |
| DVHST94 | Derived Health Status Index (3 decimal places)-HUI provisional score |
Covariates:
| AGEGRP | Grouped age cohorts |
| SEX | Respondent's sex |
| DVHHIN94 | Derived total household income from all sources in the past 12 months |
| DVBMI94 | Derived Body Mass Index (1 decimal place) |
| DVSMKT94 | Derived type of smoker |
| DVPP94 | Derived depression variable - predicted probability (2 decimal points) |
| NUMCHRON | Sum of the following conditions: |
| CHRQ1_A | Do you have any food allergies diagnosed by a health professional? |
| CHRQ1_B | Do you have other allergies diagnosed by a health professional? |
| CHRQ1_C | Do you have asthma diagnosed by a health professional? |
| CHRQ1_D | Do you have arthritis or rheumatism diagnosed by a health professional? |
| CHRQ1_E | Do you have back problems (excluding arthritis) diagnosed by a health professional? |
| CHRQ1_F | Do you have high blood pressure diagnosed by a health professional? |
| CHRQ1_G | Do you have migraine headaches diagnosed by a health professional? |
| CHRQ1_H | Do you have chronic bronchitis or emphysema diagnosed by a health professional? |
| CHRQ1_I | Do you have sinusitis diagnosed by a health professional? |
| CHRQ1_J | Do you have diabetes diagnosed by a health professional? |
| CHRQ1_K | Do you have epilepsy diagnosed by a health professional? |
| CHRQ1_L | Do you have heart disease diagnosed by a health professional? |
| CHRQ1_M | Do you have cancer diagnosed by a health professional? |
| CHRQ1_N | Do you have stomach or intestinal ulcers diagnosed by a health professional? |
| CHRQ1_O | Do you have the effects of a stroke diagnosed by a health professional? |
| CHRQ1_P | Do you have urinary incontinence diagnosed by a health professional? |
| CHRQ1_R | Do you have Alzheimer's disease diagnosed by a health professional? |
| CHRQ1_S | Do you have cataracts diagnosed by a health professional? |
| CHRQ1_T | Do you have glaucoma diagnosed by a health professional? |
| CHRQ1_U | Do you have any other long-term condition diagnosed by a health professional? |
| VISITS | Sum of the following questions: |
| UTIL-Q2 | (Not counting when ... were/was an overnight patient); in the past 12 months, how many times have/has ... seen or talked on the telephone with [fill category] about your/his/her physical, emotional or mental health
a) General practitioner or family physician; |
| WT6 | Survey weights |
See attached NPHS documentation for classes and definitions.
Objectives
-
For this case study, a survey example will be used to:
- Distinguish non-response mechanisms.
- Examine methods used to deal with non-response.
- Estimate bias in the presence of non-response.
Frequently Asked Questions
Please check this section regularly for updates.
References
Fellegi,
*Kalton, G. and D. Kasprzyk (1982), "Imputing for Missing Survey Responses", Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 22-31.
*Kalton, G., and Kasprzyk, D. (1986), "The treatment of missing survey data", Survey Methodology, 12, pp. 1-16.
*Kovar, J. G. and P. Whitridge (1995), "Imputation of Business Survey Data", in B. Cox, D. Binder, A. Christianson, M. Colledge, and P. Kott (eds), Business Survey Methods, New Work: Wiley, pp. 403-420.
*Little, R. J. A. and D.
Lohr,
Mathers CD (1992) Estimating gains in health expectancy due to elimination of specified diseases. Fifth meeting of the International Network on Health Expectancy (REVES-5), Statistics Canada,
Monier A. La conjoncture demographique: l'Europe et les pays developes d'outre-mer. Population 1998;53:995-1023.
*Nordholt,
Oh, H. L. and F. J. Scheuren (1983), "Weighting Adjustment for Unit non-response", in W. G. Madow, I. Olkin, and D. B. Rubin (eds), Incomplete data in Sample Surveys, Vol. 2: Theory and Bibliographies, New York: Academic Press, pp. 143-184.
Sande,
Smith, P. J., Hoaglin, D. C., Battaglia, M. P., Rao, J. N. K., and Daniels, D. (2001), "Evaluation of Adjustment for Partial Nonresponse Bias, Applied to Provider nonresponse in the National Immunization Survey", paper presented at the Annual Meeting of the Statistical Society of Canada, Ottawa, Canada.
© 2001-2003 Statistical Society of Canada | Contact the SSC | Contact the Webmaster | Designed by Pip Media Group