Health Insurance Marketplace Consumer Experience Surveys: Enrollee Satisfaction Survey and Marketplace Survey
Supporting
Statement—Part B
Collections of Information Employing
Statistical Methods
Centers for Medicare & Medicaid Services
TABLE OF CONTENTS
Section Page
1. Potential Respondent Universe and Sampling Methods 2
1.1.2 Sample Size Calculations 5
1.1.3 Marketplace Survey Sampling Methods 15
1.2.2 Sample Size Calculations 18
1.2.3 QHP Enrollee Survey Sampling Methods 20
2. Information Collection Procedures 22
3. Methods to Maximize Response Rates and Address Non-Response Bias 22
4. Tests of Procedures 25
5. Statistical Consultants 25
This supporting statement includes information in support of:
A Health Insurance Marketplace (HIM) Survey (Marketplace Survey)
A Qualified Health Plan Enrollee Experience Survey (QHP Enrollee Survey)
A description of the surveys and the testing goals related to each survey are provided in Part A of this submission. Testing goals that directly impact the sample size estimates are summarized in Sections 1.1.2 (for the Marketplace Survey) and 1.2.2 (for the QHP Enrollee Survey) of this document.
Both surveys will be administered in four annual rounds; one survey each year in 2014, 2015, 2016, and 2017. The first two of these rounds (2014 and 2015) are described in this submission:
A psychometric test of each survey in 2014 using a single survey vendor. The goal of each psychometric test is to evaluate the reliability and validity of each survey. This goal includes assessing the measurement properties of the instrument, individual survey items, and reporting composites. It also includes testing the equivalence of these measurement properties across language and mode of administration. Because the QHP Enrollee Survey includes CAHPS 5.0 Health Plan Survey core items and some CAHPS supplemental items sets, another goal for the QHP survey will be to determine the extent to which the measurement properties of the existing CAHPS items hold for the QHP population, which will include persons who have been previously uninsured. Results of psychometric testing will inform revisions to both surveys, including shortening the survey instruments and reducing respondent burden.
A full national fielding of the revised survey in 2015 to provide early feedback and that will include some beta-test components. The main goals for the Marketplace Survey beta test are 1) to produce assessment scores in each Marketplace for composites, global ratings, and individual report items from the tested and revised Marketplace survey and use these scores to provide early feedback to states, and 2) to the extent feasible (given sample size limitations), conduct subgroup analyses to determine if disparities in consumer experiences by race, ethnicity, income, and disability exist within each Marketplace to help CMS meet its regulatory oversight requirements. A secondary goal is to rerun the the psychometrics to confirm the psychometric properties of the revised instrument on a larger, state-based sample. The goals for the QHP Enrollee Survey beta test, which will involve data collection by multiple survey vendors hired by the issuers in each State, are to test the vendor system (this is explained in more detail in Part A), verify the psychometric properties of the revised QHP instrument on a larger, national sample, and provide early feedback to QHPs before public reporting begins in 2016.
CMS is seeking immediate clearance for the psychometric test round, so that Marketplace Survey psychometric test operations can begin on March 1, 2014, and QHP Enrollee Survey psychometric test operations can begin on June 1, 2014. Following the psychometric tests, CMS will submit updated materials and seek clearance for the beta test round, though it is anticipated that there will be no substantive changes in any methods or materials relative to what is described in this supporting statement.
Robust results from the psychometric analysis are obtained by capturing, to the greatest extent possible, the full range of experiences and response patterns in the population. For this reason, the construction of the sampling frame and the sampling methods are designed to capture this full range of experiences of the populations within each State or QHP. In addition, there are other considerations related to reliability and validity that guide the sample size estimates and sampling methods, and these considerations are described in the sections that follow. The psychometric test component of this information collection is not designed to provide state-level or QHP-level estimates; as such only the aggregate results of this analysis will be discussed or disseminated. The second component of this information collection (in 2015) includes a provision for early feedback to States and QHPs, subject to the limitations associated with response rates (including non-response bias analyses) and data quality findings.
The Marketplace Survey is addressed first, then the QHP Enrollee Survey. Within each of these separate survey sections we discuss the respondent universe, sample size calculations, and sampling methods separately for the psychometric test and the beta test.
The respondent universe is the theoretical population to which we want our findings to apply. The study population is the population to which we can gain access, and the sampling frame is the means by which we can access this study population. The sampling frames will be list-based and constructed from records contained in CMS databases.
The respondent universe for the Marketplace survey includes any adult (age 18+) eligible for health insurance coverage being offered through the new Health Insurance Marketplaces. This definition includes those eligible for Medicaid coverage.
The study population comprises the subset of consumers who have actually interacted with the Marketplaces within a specified time frame. This definition includes any adult who has at a minimum provided their contact information, regardless of how far they have gotten in the application and enrollment process. This definition includes consumers who enter their information themselves through the website, submit a paper application, or have the information entered for them by a telephone or in-person assistor. 1
These individuals are classified into four types of Marketplace consumers: (1) effectuated QHP enrollees (those who have enrolled in a QHP and paid their first premium), (2) QHP enrollees who have not yet paid their premium (enrolled but not effectuated), (3) those who have accessed a Marketplace, completed and submitted an application, but have not enrolled in a QHP, and (4) those who have accessed a Marketplace and entered contact information, but who have not yet completed the application and thus have not yet selected and enrolled in a QHP.
Due to limited resources and the time required for translation of the survey materials, the study population is also limited to those who prefer to speak and read in one or more of three languages: English, Chinese, and Spanish.
All 50 States and the District of Columbia (D.C.)2 are classified in one of two groups based on how their Marketplaces are organized:
State-based Marketplaces (SBMs)—include 15 States (including D.C.), all of which are currently running their own State-specific exchange web sites for enrollment. Eventually, the SBMs will transmit their individual-level application and enrollment data to CMS.
Federally-facilitated Marketplace (FFM)—includes the remaining 36 States where enrollment operates through the federal government’s web site, Healthcare.gov. Even though enrollment in these 36 States operates through a single website, the FFM comprises 36 distinct markets, since the issuers and their associated sets of plan offerings are unique in each State, and in-person assistance will vary by state.
The sampling frame will include all available records for the four distinct groups of eligible consumers. As described above, these four groups are defined by their applicant status:
Potential applicant (PA) – consumers who have completed any step prior to submitting an application, after providing contact information,
Potential enrollee (PE) – consumers who have successfully completed and submitted an application that includes their family size and income information,
Enrollee (E) – consumers who have selected a QHP from their Marketplace, and
Effectuated enrollee (EE) – QHP enrollees who have made their first premium payment to the selected QHP issuer.
The psychometric test component will only include participants in the 36 FFM states. CMS will construct a sampling frame for the 2014 psychometric test using the administrative data from the 36 FFM states contained in the databases in which application and enrollment information are stored. The frame will include individuals who provided contact information at any point from October 1, 2013 to February 28, 2014.
The study population is the same as for the psychometric test component, with the exception of the time frame: any adult (age 18+) who, at any point from March 1, 2014 to December 31, 2014, has at a minimum provided their contact information, regardless of how far they have gotten in the application and enrollment process. In the time between sampling activities for the psychometric test and the beta test, CMS will work with its contractors to try to resolve the sampling frame limitations associated with the 15 SBMs. If the SBMs are able to submit all four populations of interest to CMS or CMS’ contractor in time for the beta test, we will include them in the beta test sample. For now, the working assumption is that CMS will be able to construct a 2015 beta test sampling frame that includes all four applicant types for all 51 States, including the State-Based Marketplaces, and thus will be able to generalize to a single study population that very closely reflects the respondent universe.
Sample size is calculated first by determining the minimum number of completed responses needed to meet the goals of the data collection, and second by inflating that number by a large enough factor to account for the estimated rate of survey non-response.
We will follow American Association for Public Opinion Research (AAPOR) guidelines in calculating response rates. The response rate is the result of dividing the number of completed interviews/questionnaires by the number of eligible respondents who were selected to participate. Potential respondents fall into the following categories:
Eligible and interview completed (c).
Eligible and not interviewed (e).
Ineligible (e.g., out of scope; only potential respondents who have explicitly indicated ineligibility are included here) (i).
Unable to determine eligibility (u).
According to AAPOR guidelines, the total number of participants selected to be surveyed (n) is the sum of eligible and completed (c), eligible and not interviewed (e), ineligible (i), and unable to determine eligibility (u). That is n = c + e + i + u. By design, our survey sampling frames will only include eligible individuals, with eligibility determined using administrative data from CMS databases. However, among those with unknown eligibility (u), there is likely to be a small proportion (x) who may in fact be ineligible. This proportion (u) will be estimated using the following formula:
 
The response rate will then be calculated as:
 
In the above formula, the denominator includes all original survey units that were identified as being eligible, including units with pending responses with no data received, post office returns because of “undeliverable as addressed,” and any new eligible units added to the survey. The denominator will not include units deemed out-of-scope, or duplicates.
Sometimes only partial interviews will be obtained due to a respondent’s breaking off an interview or completing only part of a mailed questionnaire. For the proposed data collections, CMS will follow the CAHPS standard: a questionnaire will be considered complete if responses are available for 50% or more of a selected list of key survey items – the items that all respondents are eligible to answer.
Total required sample size is a function of the purpose of a given component of this study and the desired number of completes divided by the estimated overall response rate calculated as described above. Historically, response rates for CAHPS surveys span a fairly wide range. The 2012 Commercial Health Plan CAHPS response rates are approximately 30% and Medicaid CAHPS response rates are approximately 27%. In the recent psychometric test of the CAHPS survey for Cancer Care, the response rate was 48%; for the Dental CAHPS psychometric test and early implementation, response rates ranged from as low as 40% to as high as 70% in some population segments. Based on experience with psychometric tests of several different CAHPS instruments, and in light of the relatively low response rates obtained with the Medicaid population (which is similar to the Marketplace population) and with the Commercial Health Plan CAHPS survey (which is similar to our QHP Enrollee survey), CMS will assume the overall response rate will be 30%. Once the Marketplace survey data collection is completed and CMS has empirical data on the actual response rates obtained for that survey, this response rate assumption may be revised for the QHP Enrollee data collection in 2014. The results of both survey data collections will also inform the response rate assumptions for the data collections planned for 2015.
The Spanish and Chinese versions of the survey will only be made available via mail. This decision was made because the small Chinese and Spanish samples proposed do not justify the expense of developing Chinese and Spanish versions of the CATI and Internet programs. The lack of a phone option for individuals who prefer Spanish and Chinese may reduce the likelihood that those individuals will respond to or complete the survey, particularly in situations where non-English speakers have lower literacy – such individuals are more likely to be able to complete a phone survey than a mail or web survey. Response rates for non-English consumers may thus be lower than 30%. This limitation may also have the effect of excluding some of the most vulnerable populations from the psychometric test data collection – non-English speakers who have trouble reading, or writing in English or their native language. To address this, we plan to test if response rates vary significantly by race, ethnicity, and language among consumers for whom these variables are available in the sampling frame.
Our sample size estimates for the Marketplace Survey psychometric test reflect the sample sizes necessary for fully evaluating reliability and validity of the instrument.
Reliability testing will include the evaluation of:
Internal consistency reliability (ICR) of proposed composites (as indicated by Cronbach’s alpha)
Equivalence reliability, which tests the consistency of measures across mode and language
Unit-level reliability, which tests the extent to which a measure score differentiates signal (i.e., differences in scores across reporting entities, such as Marketplaces or QHPs) from noise (i.e., random measurement error); also referred to as inter-unit reliability (IUR)
Face validity (the survey questions are representative of the concepts they are supposed to reflect) has been established via the formative research – the review of existing instruments, focus groups, input from a technical expert panel and other stakeholders, and the cognitive testing (described in Section 4 below).
Construct validity will be assessed using confirmatory factor analysis (CFA) and multi-trait analysis. The CFA tests the fit of the data to the factor structure, generates factor loadings, and performs statistical tests of those loadings. The multi-trait analysis compares the correlations of items with their composite total (correcting for overlap3) to the correlations of those items with competing composites, and is an indicator of discriminant validity.
In CAHPS, there are two statistics used to assess unit-level reliability.4 One is a measure of IUR based on the F-statistic from an analysis of variance (ANOVA). The IUR is equal to (F-1)/F, which is a summary measure of the between-unit variance minus the within-unit variance over the between-unit variance.5 The other measure is the intra-class correlation (ICC), which is also calculated using statistics produced by an ANOVA. The ICC in this context is the between-unit variance minus the within-unit variance over the total variance adjusted for the average number of respondents per reporting unit.6 The IUR provides the reliability based on the sample size associated with the data, while the ICC indicates the reliability of a measure for a single respondent. The reliability coefficient can take any value from 0.0 to 1.0, where 1.0 signifies a measure for which every respondent reports an experience identical to every other respondent evaluating the same unit. Scales with reliability coefficients above 0.70 provide adequate precision for use in statistical analysis of unit-level comparisons,7 though it has been argued that measures with reliability coefficients of at least 0.90 are optimal.8
Since unit-level reliability is partly a function of sample size, the IUR allows for the calculation of the number of respondents needed per reporting unit to obtain a particular level of reliability (similar to a power analysis) in future data collections, and thus it is especially important with respect to future respondent burden.9 For the psychometric test, it is not necessary to obtain an IUR of at least 0.70 for the final recommended measures. However, to be useful for making sample size recommendations for future rounds of data collection, past experience demonstrates that it is best to have data from all accountable units when the universe of accountable units is finite (as with the FFM states); where the universe of accountable units is theoretically not finite (as with QHPs), it is best to have data from at least 30 accountable units selected across the full range of unit performance (i.e., from the poorest performing units to the best performing units).
Our sample size recommendations are based on our estimate of the minimum number of responses per equivalence group (i.e., mode and language groups) needed at the national level to conduct the psychometric analyses described above. This estimate is described in more detail immediately below. In order to evaluate unit-level reliability, which requires that we have a consistent number of completed surveys from each state, we propose distributing the total national sample evenly across the 36 states in the FFM. This strategy is described in more detail in Section 1.1.3.1, which describes our proposed sampling methods for the Marketplace survey psychometric test.
Sample Size
Factor analyses, multi-trait analyses, and the estimates of equivalence and internal consistency reliability will all be conducted separately for each survey administration mode using all complete responses from eligible sample members across the whole nation. The generalizability of the results from this psychometric analysis is obtained by attempting to capture the full range of experiences, and thus potential response patterns, in the population. Standard psychometric practice is to obtain a minimum of 10 complete responses for each assessment item that will be used in the psychometric analysis (this includes substantive questions that will be combined into composites, but not screeners, ‘About You’ items, or questions designed to determine survey eligibility). This recommendation is grounded in sound measurement theory10 and practice in the statistical analysis of multivariate data (including factor analyses).11
At this time, the Marketplace survey includes 30 assessment items, which translates into a minimum of 300 completed surveys nationwide, assuming that each completed survey contains a non-missing response for each substantive item. However, given that some substantive items will be legitimately skipped by respondents to whom the subject matter of the item does not apply, this number will need to be larger. In addition, some completed surveys may still have some degree of item non-response (when a respondent skips an item that he/she should have answered). Thus, we propose to obtain a minimum of 15 complete responses for each assessment item. This translates into a minimum number of completes of 450 (15*30) per group if psychometric analyses will be conducted separately for each group (which it will). For surveys conducted in English, there are five mode experiment groups (telephone-only, mail with telephone follow-up, mail-only with 1st class mail follow-up, mail-only with FedEx follow-up mailing, and web-only), and thus we need a minimum of 2,250 completed surveys to conduct psychometric testing for each of the five modes. In addition, we would want 450 completed surveys each for both the Spanish and Chinese surveys to conduct psychometric analyses separately for each language (only one administration mode is planned for Spnaish and Chinese). This approach would result in an overall total of 3,150 completed surveys (2,250 in English, 450 in Spanish, and 450 in Chinese).
Exhibit B1 shows the distribution of the English language completes across the five experimental mode design groups, plus the required number in Chinese and Spanish.
Equal numbers of consumers who indicate that English is their preferred language will be randomly sampled within each State and then will be randomly assigned across the five mode groups so as to obtain 450 completed surveys in each of the five experimental groups. For IUR analyses that include more than one mode, we would control for mode in the models so as not to confound mode differences with differences among States. Although state-level analysis will be conducted to ensure inter-unit reliability only the aggregate results of this analysis will be discussed or disseminated.
Exhibit B1. Sample Sizes and Completed Survey Counts for the Marketplace Psychometric Test
| Mode† | Target Number of Completed Surveys | Total Number to Sample | 
| Exp 1. Phone only | 450 | 1,500 | 
| Exp 2. Mail with phone | 450 | 1,500 | 
| Exp 3. Mail only with third survey mailed Fed Ex | 450 | 1,500 | 
| Exp 4a. Web only – email and pre-notification letter | 225 | 750 | 
| Exp 4b. Web only – email only | 225 | 750 | 
| Exp 5. Mail only | 450 | 1,500 | 
| Total English | 2,250 | 7,500 | 
| Non-English | 
			 | 
			 | 
| Spanish (mail only) | 450 | 1,500 | 
| Chinese (mail only) | 450 | 1,500 | 
| Overall Total | 3,150 | 10,500 | 
| † Mode experiments will be conducted in English only. All modes other than the mail-only mode (Exp. 5) will be available only to respondents whose language preference is English. | ||
Limitations
The psychometric test component of this information collection is not designed to provide state-level estimates.
CMS will not be able to evaluate psychometric properties of the instrument among the 15 SBMs. While this is a serious weakness, it is unavoidable at this time.
This component of the implementation involves the initial fielding of the full national sample that is available to CMS in 2015. The estimation of sample size for this phase of the Marketplace survey will be driven by sample size estimates that result from the IUR analysis described above, as well as the analysis and reporting goals associated with this round of data collection (see Exhibit A1 in Part A). We will not know the former until after the psychometric test analyses have been completed.
As described above, to have a sufficient number of responses for analysis and reporting based on surveys where respondents may interact with a number of different individuals or systems, such as with a health plan or a clinician group, CAHPS generally recommends obtaining completed questionnaires from 300 respondents per reporting entity.12 These estimates are based on analyses conducted on existing CAHPS data to determine the number of completed responses needed to provide power sufficient to detect differences between one reporting entity (e.g., a health plan) and the mean of all other reporting entities in a given sample. These differences are the basis of the standard CAHPS “star rating,” which identifies reporting entities as being below average, average, or above average.
We have assumed that interactions with a Marketplace or a QHP will be analogous to this heterogeneous experience, which implies that 300 completed responses per Marketplace would be sufficient for standard CAHPS analysis and reporting activities. However, regulatory oversight requires CMS to determine if disparities in consumer experiences by race, ethnicity, income, and disability exist within each State. Subgroup analyses would involve, for example, comparing the experiences of a small group in a given State, such as Hispanics, to a large group in that State, such as non-Hispanics. To meet this oversight requirement, a greater number of complete responses will be needed from each Marketplace.
To accommodate this objective in the beta test, CMS proposes sampling to obtain 1,200 English-language completes in each State. Assuming a 30% response rate, the initial sample size in each State will be 4,000.13 With a stratified random sample yielding 1,200 completes in each State, any subgroup comprising at least 5% of the sampling frame for a given State will contribute about 60 completed surveys to their State’s total number of completions (5% of 1,200 = 60). With respect to the power to detect subgroup differences within a State, moderate effect sizes (~0.40) can be detected with as few as 60 completed surveys per group. Thus, the proposed sample size is based on the goal of being able to detect subgroup differences in experiences associated with moderate effect sizes for any group comprising at least 5% of a State’s population.
This approach will yield a total national sample of 204,000 consumers resulting in a total of 61,200 completed surveys nationwide. For this component, CMS will not distribute the survey in Spanish and Chinese, but will administer Spanish and Chinese versions to respondents that request surveys in these languages. It is estimated that this approach will result in just over 6,000 (10%) surveys completed in Spanish and around 1,200 (2%) surveys completed in Chinese (see Exhibit B2). Because the anticipated number of survey completes in English is so large, no adjustment needs to be made for the reduction in the number of completes due to language preferences.
Exhibit B2. Distribution of Marketplace Surveys by Language for Early Feedback/Beta Test Component
| Group | Total Completed Surveys | 
| Total Nationwide (sample = 153,000) | 61,200 | 
| Spanish (assume 10% of Total) | 6,120 | 
| Chinese (assume 2% of Total) | 1,225 | 
| English (remainder) | 53,855 | 
CMS has determined the sample size based on CAHPS recommendations related to the ranking of entities and incorporating the specific demands of oversight and QI outlined above. Thus, the sections below:
Describe the precision of point estimates associated with various sample sizes, and
Describe, in the context of detecting differences between a single State and the mean of all 51 States (i.e., assigning star ratings), the effect sizes associated with various sample sizes.
State-level and national-level estimates both rely on the precision of point estimates for the survey measures (composites, overall ratings, and single item measures). Precision is defined in terms of the margin of error, which is also known as the “half-width” of the confidence interval (typically a 95% confidence interval). The margin of error for a 95% confidence interval (CI) is equal to the standard error of the point estimate multiplied by 1.96 (the margin of error for a 68% CI would be equal to one standard error; the margin of error for a 99% CI would be equal to 2.58 standard errors). Thus, the margin of error is used to construct the CI around the point estimate and describes the range within which we can be confident the true score lies.
We estimated confidence interval precision using PROC POWER in SAS. This approach is analogous to a traditional power analysis, with the margin of error (“CI Half-Width” in SAS) taking the place of effect size and the half-width probability (“Prob (Width)” in SAS) taking the place of power. Using estimates of a range of variances and standard errors observed from some existing CAHPS surveys (e.g., the psychometric test of the draft CAHPS survey for Cancer Care, the NCQA National Distribution of 2009 Adult Medicaid CAHPS Plan-Level Results, and the 2013 Medicare Part C Report Card results) as inputs, we estimated the sample sizes associated with different levels of precision. Note that CMS has decided on a target number of completes based on standard CAHPS recommendations in combination with the oversight requirements for scoring small subgroups in each State. Thus, this analysis is designed to illustrate the level of precision that can be obtained with those samples under several scenarios.
We used a conditional probability approach (that is, the probability of achieving the desired precision is calculated conditionally given that the true mean is captured by the interval), which is a more conservative approach than the unconditional probability approach. To anchor the margins of error and variance estimates (expressed as standard deviations) to a meaningful CAHPS scale, we have transformed observed scores for the three different types of measures from the existing CAHPS results mentioned above into a 100-pt scale. This transformation expresses the inputs to the power analysis in a scale that is comparable across different types of measures.
To express measures on a 100-pt scale, composites and single item measures are transformed from their original 3-pt or 4-pt scales using a simple linear transformation based on expressing the observed score as a percentage of the distance from the floor to the ceiling of a scale:
 
For a 4-pt CAHPS scale (1=never, 2=sometimes, 3=usually, 4=always) with a mean of 3.5, the transformation would look like this, for example:
 
Dichotomous scales where 0=no and 1=yes are simply multiplied by 100 (e.g., if 72% of respondents answer ‘yes’ to the item, the transformed score is 72). Overall ratings, which range from 0 to 10, are simply multiplied by 10 (e.g., a mean of 9.3 becomes 93).
As an example of the proposed approach, consider a sample size estimation assuming a goal of having a half-width probability (power) of 0.80, an alpha of 0.05, and a half-width (margin of error) no greater than 3 points. With these parameters, the power analysis is estimating the number of completes needed to have an 80% chance of obtaining a 95% CI with +/- 3 point margin of error. To put this example in more concrete terms, with an observed score of 83.3 from a sample size calculated using the above inputs, there would be a 95% chance that the true score in the population would be between 80.3 and 86.3, and only a 5% chance that it would be outside of that range.
Exhibit B3 displays the number of completed surveys associated with some different combinations of half-widths (margins of error) and population variances (expressed as standard deviations). This exhibit illustrates the impact of sample size on precision and, thus, indicates the level of precision that might be obtained with the sample sizes proposed for the Marketplace beta test. Observed standard deviations from several of the CAHPS sources consulted ranged from approximately 2 to 28 points for measures on a 100-point scale. Observed standard errors ranged from around 0.30 to 3.2, which represent margins of error of approximately 0.60 to 6.3 points (on a 100-pt scale) for a 95% CI.
Exhibit B3. Precision Associated with Different Sample Sizes and Variances
| 
				Sample
				Size Estimates Needed per State for 80% Half-Width Probability
				 | ||||||
| With a Margin of Error of +/- | And a Standard Deviation of: | |||||
| 5 | 10 | 15 | 20 | 25 | 30 | |
| 1 | 110 | 410 | 902 | 1,585 | 2,461 | 3,530 | 
| 2 | 32 | 110 | 236 | 410 | 632 | 902 | 
| 3 | 17 | 53 | 110 | 189 | 288 | 410 | 
| 4 | 11 | 32 | 65 | 110 | 167 | 236 | 
| 5 | 8 | 22 | 44 | 73 | 110 | 155 | 
| 6 | 7 | 17 | 32 | 53 | 73 | 110 | 
As an illustration, assuming a standard deviation of 25 for an observed mean of 82, we would expect that, in a series of 100 independent random samples of at least 288 individuals (see blue highlighted cell in Exhibit B3) drawn from the same population, the true population score would fall between 79 and 85 (82 +/- 3) in 95 of those samples. For smaller variances, the precision gets better with smaller samples (e.g., with a sample size of 300 and a standard deviation of around 8 points, the margin of error would be +/- 1 point). For a sample size of at least 1,000, the margin of error would be no more than 2 points, assuming the standard deviation were no greater than 30.
Given the proposed 1,200 completed surveys per State, even if the population standard deviation was as high as 30, the margin of error for State-level estimates would be around +/- 2 (see the red shaded cell in Exhibit B3).
As described above, one of the objectives of the full national implementation of the survey is to assign star ratings to Marketplaces and States based on their performance scores (on items, composites, and global ratings) relative to the average performance across all Marketplaces and States. If a global F-test indicates that scores vary across Marketplaces and/or States within the Federal Marketplace, the star rating is then done using a t-test of the difference between each Marketplace or State and the overall mean of all Marketplaces or States. The discussion below shows that the utility of the scoring system depends on the number of completes. In Section 3.2, we discuss methods to evaluate the possible impact of the potential non-response bias.
Using variances observed from previous CAHPS psychometric tests, CMS conducted a power analysis based on a two-sample t-test comparing the mean score on a composite (on a 100-pt scale) from one entity to the pooled mean on that composite from all entities, using a range of variances. The power analysis assumes a balanced design (same number sampled from every entity) and equal variances (single entity variance = pooled variance).14
Exhibit B4. Relationship between Sample Size, Variance, and Effect Sizes for Star Rating of Marketplaces†
| 
				Number
				of Completes per State  | Variance of 15 | Variance of 25 | ||
| Mean Diff | ES | Mean Diff | ES | |
| 20 | 9.5 | 0.63 | 15.8 | 0.63 | 
| 50 | 6.0 | 0.40 | 10.0 | 0.40 | 
| 100 | 4.2 | 0.28 | 7.1 | 0.28 | 
| 150 | 3.5 | 0.23 | 5.8 | 0.23 | 
| 200 | 3.0 | 0.20 | 5.0 | 0.20 | 
| 300 | 2.5 | 0.16 | 4.1 | 0.16 | 
| 500 | 1.9 | 0.13 | 3.2 | 0.13 | 
| 1,200 | 1.2 | 0.08 | 2.1 | 0.08 | 
† Assumes a balanced design (same number sampled from every entity) and equal variances (single entity variance = pooled variance). ES = effect size; Mean Diff = difference in means between a single State and the mean of all States
Exhibit B4 shows mean differences between a single State and the mean of all States that could be detected with a range of completed survey counts per State, given variances (the Root Mean Square Error) of 15 and 25.15 Note that when the variance is larger, the mean differences have to be bigger to yield effect sizes of the same magnitude.
As shown, with 300 completes per State-specific subgroup and a variance of 15 points, we would have 80% power (with an alpha of 0.05) to detect a difference of 2.5 points between a single exchange and the overall mean of exchange scores (e.g., 87.5 versus 90). With a wider variance of 25 points, we could detect a difference of just over 4 points (e.g., 68 versus 72). The effect sizes associated with these differences (0.16) are relatively small, and thus a sample size of 300 per State-specific subgroup should be more than sufficient to detect any differences in performance large enough to be relevant. In fact, small effect sizes (0.28) could still be detected with as few as 100 completes per unit.
Moderate effect sizes could be detected with 50 completes per unit (a bit less than the approximate minimum number of completes we could expect in each State for small race, ethnicity, income, or issuer subgroups comprising at least 5% of a Marketplace’s population). With 1,200 completes per State, mean differences as small as 1.2 to 2.1 points could be detected, assuming variances of 15 or 25 respectively (effect sizes of 0.08, which are very small).
For the English surveys, CMS will draw a stratified random sample from the sampling frame described above in Section 1.1.1.1; this will be a national sample of the FFM, with each of the 36 FFM States comprising a stratum. A total of 208 English-language consumers will be drawn from each FFM State, for a total sample of 7,500. From this sample, equal numbers of individuals will be randomly assigned to each of the five mode groups (1,500 each). We expect this strategy will produce 450 completed surveys in each of the five modes, yielding a total of 2,250 completed English-language surveys. The web-only group will be further randomly distributed such that half of the sample of 1,500 (n=750) receive both an email and a pre-notification letter while the other half (n=750) receive only an email; this strategy should produce 225 completed surveys in each of the web-only groups. See Exhibit B1 for details of the sample distributions. We expect this sampling approach to yield approximately 62 completed surveys in each of the 36 FFM states.
For the Spanish and Chinese samples, CMS will use a systematic random sampling design to yield a sample proportional to the relative size of each group in the 36 States that are part of the FFM. In this design the sampling ratio (k) for each of two sample draws (one for Spanish and one for Chinese) will be equal to N/1,500, where N is the number of eligible individuals in the FFM portion of the sampling frame who have indicated their respective language preference in their Marketplace applications, summed across all 36 FFM States. We will then sort each sampling frame (one for each language) by State and a random number; then, using a random starting point, draw a systematic random sample (with implicit stratification by State) by selecting every kth unit from the frame, yielding a total sample size of 1,500 for each of the two language groups. As described in Section 1.1.2, the lack of a phone option for non-English speakers may negatively impact the response rates from these two populations. While the ideal is for these two samples to yield 450 completed surveys each in Chinese and Spanish, CMS is aware that the actual number of completes may be lower. To address this, CMS will test if response rates vary significantly by race, ethnicity, and language among consumers for whom these variables are available in the sampling frame.
For the English surveys, we will draw a simple random sample from the sampling frame described above in Section 1.1.1.1; each of the 51 States will comprise its own stratum. Samples of 4,000 will be drawn from each strata to yield 1,200 completed surveys from each of the 51 States. This approach will yield a total sample of 204,000 individuals, resulting in 61,200 completed surveys. For the beta test, CMS will not sample based on language and thus will not distribute the survey in Spanish and Mandarin; however, we will administer Spanish and Chinese versions to respondents that request surveys in these languages. It is estimated that this approach will result in just over 6,000 (10%) surveys completed in Spanish and around 1,200 (2%) surveys completed in Chinese.
The respondent universe is the theoretical population to which we want our findings to apply. The study population is the population to which we can gain access, and the sampling frame is the means by which we can access this study population.
The respondent universe for the psychometric test of the QHP survey is defined as any adult (age 18+) enrolled in a QHP through the FFM.16 The study population is defined as all individuals 18 years or older who have enrolled by February 1, 2014 and have been enrolled through the FFM in a QHP for 5 months or longer with no more than one 30-day break in enrollment during the 5 months. Anyone with coverage beginning later than February 1, 2014, will not have been enrolled long enough by the time sampling begins in June of 2014. The psychometric test sampling frame will be list-based and constructed from records contained in CMS databases.
There is some potential for bias in the QHP psychometric test due to website issues and enrollment problems in the first two months of open enrollment. It is partially mitigated by extending the eligibility period to include enrollees whose coverage begins as late as February 1, 2014. This approach will include those who enrolled anytime between October 1, 2013 and January 15, 2014. This limitation could only be mitigated further by relaxing the four-month enrollment requirement for eligibility. However, the consequence of relaxing that requirement is that fewer enrollees will have had any experiences with their plans and providers, which would make them screen out of many of the substantive survey questions. For this reason, CMS is not contemplating reducing the minimum enrollment period below four months. CMS and its contractor will, however, include the month of enrollment in analysis models to test if there are differences in patterns of responses and measurement properties over time.
The psychometric test component of this information collection is not designed to provide state-level or QHP-level estimates; as such only the aggregate results of this analysis will be discussed or disseminated.
The respondent universe for the beta test of the QHP survey is defined as any adult (age 18+) enrolled in a QHP through both the Federal and State-based maketplaces. The study population includes all individuals 18 years or older who have been enrolled in a QHP for 6 months or longer, with no more than one 30-day break in enrollment during the 6 months. The beta test sampling frames will be constructed by insurance issuers following instructions provided by CMS; the issuers will draw the samples. Sampling will be validated by a CMS contractor (Booz Allen Hamilton). This second component of the information collection (in 2015) includes a provision for early feedback to QHPs, subject to the limitations associated with response rates and data quality findings, including non-response bias anlayses.
Sample size is calculated first by determining the minimum number of completed responses needed to meet the goals of the data collection, and second by inflating that number by a large enough factor to account for the estimated rate of survey non-response. Our assumptions for and approach to calculating response rates is described above in Section 1.1.1, and apply here. Response rate targets and the response rate calculation for the psychometric test of the QHP Enrollee Survey are the same as those for the psychometric test of the Marketplace survey. CMS assumes a 30% response rate.
Our sample size estimates for the QHP Enrollee Survey psychometric test reflect the sample sizes necessary for fully evaluating reliability and validity of the instrument. The reliability and validity testing for the QHP psychometric test will include the same analyses being conducted for the Marketplace Survey psychometric test (see Section 1.1.2.1 above).
As with the Marketplace survey, our sample size recommendations are based on our estimate of the minimum number of responses per equivalence group (i.e., mode and language groups) needed at the national level to conduct the psychometric analyses described in Section 1.1.2.1. This estimate is described in more detail immediately below. In order to evaluate unit-level reliability, which requires that we have a consistent number of completed surveys from each QHP, we propose distributing the total national sample evenly across a purposively selected group of 30 QHPs. This strategy is described in more detail in Section 1.2.3.1, which describes our proposed sampling methods for the QHP Enrollee survey psychometric test.
Sample Size
Factor analyses, multi-trait analyses, and the estimates of equivalence and internal consistency reliability will all be conducted separately for each survey administration mode using all complete responses from eligible sample members across the whole nation. The generalizability of the results from this psychometric analysis is obtained by attempting to capture the full range of experiences, and thus potential response patterns, in the population. As discussed in Section 1.1.2.1, standard psychometric practice is to obtain a minimum of 10 complete responses for each item that will be used in the psychometric analysis (this includes substantive questions that will be combined into composites, but not screeners, ‘About You’ items, or questions designed to determine survey eligibility).
At this time, the QHP Enrollee survey includes 40 assessment items, which translates into a minimum of 400 completed surveys nationwide, assuming that each completed survey contains a non-missing response for each substantive item. However, given that some substantive items will be legitimately skipped by respondents to whom the subject matter of the item does not apply, this number will need to be larger. In addition, some completed surveys may still have some degree of item non-response (when a respondent skips an item that he/she should have answered). Thus, we will propose to obtain a minimum of 15 complete responses for each assessment item. This translates into a minimum number of completes of 600 (15*40) for any grouping on which psychometric analyses will be conducted. For surveys conducted in English, there are five mode experiment groups (telephone-only, mail with telephone follow-up, mail-only with 1st class mail follow-up, mail-only with FedEx follow-up mailing, and web-only), and thus we need a minimum of 3,000 completed surveys to conduct psychometric testing for each mode (5*600 = 3,000). In addition, we would want 600 completed surveys each for both the Spanish and Chinese surveys to conduct psychometric analyses separately for each language.
To be useful for making sample size recommendations for future rounds of data collection, past experience demonstrates that, where the universe of accountable units is theoretically not finite (as with QHPs), it is best to have data from at least 30 accountable units selected across the full range of unit performance (i.e., from the poorest performing units to the best performing units). The CAHPS consortium recommends a minimum of 100 completed surveys per plan for the various Health Plan surveys, which should be sufficient for producing stable IUR estimates. With 30 QHPs, this translates into the requirement for a total of 3,000 completed surveys.
Taking into consideration the analysis requirements, a sample size sufficient to adequately conduct the psychometric analyses (3,000 completed surveys) will also be sufficient to evaluate unit-level reliability. Thus, CMS will sample equally across all 30 QHPs with the goal of obtaining 100 completed surveys from each QHP, for a total of 3,000 completed surveys.
Sampled consumers from each QHP will be randomly assigned to each of the five mode groups, and we would control for mode in the IUR analysis to avoid confounding mode differences with differences across QHPs. CMS will distribute the survey in Spanish and Chinese following the methods described for the Marketplace Survey psychometric test. Surveys in those languages will only be administered in the mail-only mode .
Exhibit B5 summarizes the sample size requirements for the QHP Enrollee survey psychometric test.
Exhibit B5. Sample Sizes and Completed Survey Counts for the QHP Psychometric Test
| 
			 | Target Number of Completed Surveys | Total Number to Sample | 
| English Language | 
 | 
 | 
| Exp 1. Phone only | 600 | 2,000 | 
| Exp 2. Mail with phone | 600 | 2,000 | 
| Exp 3. Mail only with third survey mailed Fed Ex | 600 | 2,000 | 
| Exp 4. Web only | 600 | 2,000 | 
| Exp 5. Mail only | 600 | 2,000 | 
| Total English | 3,000 | 10,000 | 
| Non-English | 
			 | 
			 | 
| Spanish (mail only) | 600 | 2,000 | 
| Chinese (mail only) | 600 | 2,000 | 
| Total non-English | 1,200 | 4,000 | 
| Overall Total | 4,200 | 14,000 | 
The estimation of sample size for the beta test of the QHP survey will be driven by sample size estimates that result from the IUR analysis described above, as well as the analysis and reporting goals associated with this round of data collection (see Exhibit A1 in Part A). Once the analysis of the psychometric test data are complete, CMS will make final recommendations for sample size requirements to issuers and survey vendors.
As described in Section 1.1.2.2, to have a sufficient number of responses for analysis and reporting based on surveys of enrollees in health plans, CAHPS generally recommends obtaining completed questionnaires from 300 respondents per reporting plan (i.e., per accountable unit).17 With a response rate of 30%, QHPs would have to draw samples of 1,000 enrollees each; however this number will have to be updated based on the observed response rates from the psychometric test of the QHP survey.
Sampling for the QHP psychometric test will take place in two stages. First, we will draw a sample of 30 QHPs from sampling frame of all eligible QHPs across the 36 FFM States. In order to be eligible to be in this sampling frame, a QHP will have to have a minimum number of enrollees (n=500). CMS will purposively select 30 QHPs based on several criteria, such as maximizing geographic variation, including plans for specific States that we think span the full range of likely enrollee experience, including plans that vary in the racial and ethnic composition of their enrollee populations, or ensuring that specific states are represented. A random sample of QHPs may in fact produce a set of 30 QHPs that do represent a good mix along these dimensions. This decision will be finalized once CMS has more complete data on enrollees.
Next, we will draw a simple random sample of 334 enrollees from each of the 30 QHPs sampled at the first stage, producing a total sample of just over 10,000 enrollees. The enrollees will then be randomly assigned to each of the five experimental mode groups: 2,000 to each group (see Exhibit B5). While we use the term QHP as a semantic convenience, the operational definition of QHP for use in sampling (and ultimately reporting as well) is not as straightforward as common usage would suggest. If QHP is defined in terms of the unique Standard Component ID (SCID) provided by the HIOS system at the request of insurance issuers, then early data indicate that just over 200 issuers offer over 4,400 separate QHPs in just the FFM. Comments received from Blue Cross Blue Shield Association (BCBSA) in response to the 60-day FRN posting explained that using the SCID to define the sampling, data collection, and reporting unit would expose issuers to excessive burden by possibly requiring them to conduct dozens of separate surveys in a given state from individuals enrolled in products that are virtually identical (at least in terms of actuarial value). For example, one issuer in Arizona has 84 separate HMO plans across all five metal levels (including 30 silver plans, 22 gold plans, and 22 platinum plans), each with its own SCID; another issuer in Indiana has 137 HMO plans across the five metal levels, including 58 silver plans. BCBSA also described possible scenarios where a given issuer with a large number of plans (as defined by SCID) might have enrollments in each product offering that are small enough (n < 500) to result in a situation where that issuer would not be required to conduct any surveys at all.
Given these issues, it is apparent to CMS that the sampling and data collection unit for the QHP survey will have to be defined in terms of some aggregation of individual product offerings as defined by the SCID. Aggregating SCIDs up to the product type (EPO, PPO, POS, HMO) within issuer within state is a strategye that produces 268 unique units (this excludes child-only and dental-only plans). If all of an issuer’s offerings in a given State are aggregated up to the metal level within a product type, there are approximately 965 such units in the 36 FFM States. It is essential that we conduct the psychometric test using a level of aggregation that aligns with the the level at which the national implementation results will be reported. The final decision about how to define a QHP for the psychometric test will be driven in part by the enrollment numbers produced by different aggregating strategies. The current plan is to aggregate up to the metal level with a product type for each issuer in each state (yielding 965 units from which to sample our 30 “QHPs” for the psychometric test).
For the Spanish and Chinese samples, CMS will use a systematic random sampling design to yield a sample proportional to the relative size of each group in the 36 States that are part of the FFM. In this design the sampling ratio (k) for each of two sample draws (one for Spanish and one for Chinese) will be equal to N/2,000, where N is the number of eligible individuals in the FFM portion of the sampling frame who have indicated their respective language preference in their Marketplace applications, summed across all 36 FFM States. We will then sort each sampling frame (one for each language) by State and, using a random starting point, draw a systematic random sample (with implicit stratification by State) by selecting every kth unit from the frame. As described in Section 1.1.2, the lack of a phone option for non-English speakers may negatively impact the response rates from these two populations. While the ideal is for these two samples to yield 600 completed surveys each for those consumers whose preferred language is either Chinese or Spanish, CMS is aware that the actual number of completes may be lower. To address this, CMS will test if response rates vary significantly by race, ethnicity, and language among consumers for whom these variables are available in the sampling frame.
For the beta test, HHS-approved QHP Enrollee Survey vendors will draw samples from each reporting entity using instructions and guidelines provided by CMS.
Sampling frame construction and sampling during the psychometric test will help inform final decisions about how to define reporting and sampling units. We have estimated burden for the beta test on the assumption that there will be 2,000 sampling/reporting QHPs. We will refine beta test burden estimates if necessary based on the definition of sampling/reporting QHPs as the pool of enrollees grows and a workable definition of the sampling and reporting unit is determined.
Both surveys will follow standard CAHPS procedures with respect to defining the sampling frame and determining respondent eligibility, and survey operations.18
For the Marketplace surveys and the QHP psychometric test in 2014, data will be collected by a single survey vendor; for the QHP beta test (and full implementation surveys), data will be collected by multiple approved commercial vendors on behalf of QHP issuers. The mode of administration will be mail with phone follow-up. Survey operations for both surveys will follow standard CAHPS practice:
Mail an advance letter
Mail the questionnaire package one week after the advance letter. Include a postage-paid envelope to encourage participation.
Send a postcard reminder to nonrespondents 10 days after sending the questionnaire.
Send a second questionnaire with a reminder letter to those still not responding thirty days after the first mailing.
Begin follow-up by telephone or send final mail survey with nonrespondents three weeks after sending the second questionnaire. Interviewers will attempt to locate respondents who have not responded to the mailed survey
Telephone numbers for sample respondents will be verified prior to calling
A maximum of 9 attempts will be made by phone
Every effort will be made to maximize the response rate, while retaining the voluntary nature of the effort. Below are several options recommended by CAHPS for maximizing response rates that may be employed:
We will set up a toll-free number and publish it in all correspondence with respondents. Assign a trained project staff member to respond to questions on that line. Maintain a log of these calls and review them periodically.
For the psychometric tests of both the Marketplace and QHP Enrollee surveys, a persuasive advance letter will be sent to the respondent. Cover letters describing the survey and encouraging participation will also be included in the survey packets. Reminder postcards will also be sent to encourage participation. The letters will be printed on CMS letterhead with an official logo and include an official signature of a representative from CMS; it will be personalized with the name and address of the intended recipient. Postcards will include an official signature of a representative from CMS.
In subsequent data collections using the Marketplace survey (beta test and national implementations in 2016 and 2017), where samples will be pulled from CMS administrative files, advance letters and cover letters will be sent on CMS letterhead and signed by the CMS privacy officer the same as in the psychometric test.
For subsequent data collections for the QHP survey, both advance letters and cover letters will use the letterhead and logo of the survey vendor or, alternatively, the letterhead and logo from the QHP issuer.
The envelope will also include the appropriate official logo and include a return address; envelopes will be marked “forwarding and address correction” in order to update records for respondents who have moved and to increase the likelihood that the survey packet will reach the intended respondent.
For the telephone interviews:
Interviewers will be trained and monitored
Interviewers will read questions exactly as worded so that all respondents are answering the same question.
When a respondent fails to give a complete or adequate answer, interviewer probes will be nondirective.
Interviewers will maintain a neutral and professional relationship with respondents. The primary goal of the interaction from the respondent’s point of view should be to provide accurate information. The less interviewers communicate about their personal characteristics and, in particular, their personal preferences, the more standardized the interview experience becomes across all interviewers.
Interviewers will record only answers that the respondents themselves choose. The instrument is designed to minimize decisions that interviewers might need to make about how to categorize answers.
The single vendor for the Marketplace surveys and the multiple vendors for the QHP Enrollee Surveys will be required to use CATI.
The mode-of-administration experiment is being conducted in the psychometric test to determine the most efficient and least burdensome modes that should be used in the subsequent surveys.
Unduplicating the samples for the Marketplace and QHP surveys is another way to improve response by minimizing burden on specific sample members who might be selected for both surveys. The psychometric test samples for both the Marketplace and QHP surveys are being drawn by CMS and its contractor, so the two samples will be unduplicated. For the beta test, the sample for the Marketplace Survey will be drawn by CMS and its contractor, but the samples for the QHP Enrollee Survey will be drawn by commercial survey vendors hired by the QHP issuers. The data will be supplied to CMS and its contractor for analysis without identifiers, so it will be impossible to unduplicate the Marketplace Survey and QHP Enrollee Survey samples beginning with the beta test or to know the extent to which duplication occurred. CMS believes that the population for the QHP sample will eventually be so large that the chances of the same individual being selected for both the QHP and Marketplace Surveys will be small, but we will not be able to estimate the likelihood of duplicate selection until the sampling frames for the beta test and subsequent annual rounds of the surveys are constructed.
As part of testing the performance of the surveys in the psychometric test, CMS will determine if the goal of 30 percent response can be achieved. The actual response rates obtained in the psychometric test will be used to adjust the response rate goals for the beta test and subsequent rounds. If 30 percent is not achieved in the psychometric test, the reliability of the surveys as determined at the national level and the ability to conduct subgroup analyses will depend on the presence of non-response bias.
3.2 Evaluating Non-Response BiasIf response rates are less than 80 percent, which we expect to be the case based on the results from other CAHPS surveys (we are targeting 30 percent), CMS will conduct nonresponse bias analyses to determine if there are systematic differences between respondents and nonrespondents in terms of demographic, Marketplace, or QHP related characteristics that could have an impact on the study outcomes. Some of the potentially related characteristics that will be available on the sampling frame for respondents and nonrespondents of both the Marketplace and QHP Enrollee surveys include: the mode of application (phone, web, in-person, or a combination), applicant status (PA, PE, EE, E), Medicaid eligibility, language preference, race, ethnicity, gender, income, disability status, and state. Additionally, CMS will know the QHP issuer, product type, and metal level for respondents and nonrespondents of the QHP Enrollee survey. Of particular interest is the extent to which response rates vary by language, state, or mode and the extent to which response rates within these groups differ by sociodemographic characteristics. For example, a nonresponse bias analysis could investigate whether the sociodemographic characteristics of the mail mode respondents and nonrespondents are systematically different. If bias is found CMS will employ post-stratification to lessen the effects of non-response bias. CMS will also consider the possibility of conducting non-English surveys by telephone if the results of these analyses suggest that there is a significant bias associated with limiting non-English surveys to mail only.
If response rates vary by mode in the psychometric test, CMS will compute a cost per complete for each mode and relate the response rate for that mode to its unit cost to determine if the benefit in terms of better response is worth any additional cost that might be required. This assessment will be made qualitatively once we see the variation in costs and response rates among the modes. There is no a priori assumption about an acceptable benefit-cost tradeoff; however, CMS also wants to remain consistent with standard CAHPS survey administration procedures to the extent possible.
Thus far, the response rates discussed have been at the unit level, where respondents either completed or did not complete the entire survey. There is also item-level nonresponse where a respondent answers some, but not all of the questions they are eligible for in the survey. Although highly unlikely, if the item response rate is less than 70% for any survey questions, CMS will conduct an item nonresponse analysis similar to that discussed above for unit nonresponse as required by Guideline 3.2.10 of the Office of Management and Budget’s Standards and Guideline for Statistical Surveys.
The survey development team conducted nine interviews with key stakeholders to help inform aspects of the Marketplaces that would be important to capture in the surveys; four focus groups with 33 individuals about their perspectives on health insurance, health care, and the new Health Insurance Marketplaces; and two rounds of cognitive testing in all three languages (English, Spanish, and Chinese) for both surveys. To avoid duplicating efforts we relied heavily on cognitive testing that had already been done on the CAHPS questions used in the QHP Enrollee Survey and only tested new or modified questions. Thus, cognitive testing focused mainly on the Marketplace Survey. The first round of testing was conducted with proxy Marketplace users from the Massachusetts Health Connector because it had to be done before Marketplace open enrollment began. The nine interviews in each language were sufficient to understand respondents’ experiences with the Massachusetts Health Connector. The second round of testing was conducted in the first weeks of Marketplace open enrollment when people had varying experiences with the Marketplaces. The nine respondents in each language provided a balanced perspective of positive and negative experiences interacting with the Marketplace in a variety of ways such as on the website, over the phone, and in person. The final cognitive testing report was provided as part of this submission earlier. The CCSQ survey team worked closely with CCIIO’s state-based marketplace team, who collected state level information about enrollees. CMS intends that the psychometric tests will verify and validate the cognitive testing and identify any additional testing needs.
The Marketplace and QHP psychometric and beta test surveys are intended to test and refine the questionnaires and survey procedures prior to the full national implementation of both surveys, with public reporting, which will take place annually beginning in 2016.
This sampling and statistical plan was prepared and reviewed by staff of CMS and by the American Institutes for Research. The primary statistical design was provided by Chris Evensen, MS, of the American Institutes for Research at (919) 918-2310; Michael P. Cohen, PhD, of the American Institutes for Research at (202) 403-6453; Steven Garfinkel, PhD, of the American Institutes for Research at (919) 918-2306, and HarmoniJoie Noel, PhD, of the American Institutes for Research at (202) 403-5779.
1 For consumers applying by phone or in-person, representatives still enter their data in the web site (either Healthcare.gov or an SBM’s dedicated state-based web site), and thus we assume that a phone or in-person assisted application can be partially completed, and that a consumer applying by phone or in-person may not yet have enrolled in a QHP. Paper applications are also entered using the web site but could also be incomplete, and some applicants submitting paper applications may not yet have enrolled in a QHP at the time of sampling.
2 For the proposed data collections we classify D.C. as a “State,” hence there are reference to “51 States” in this document.
3 Howard KI, Forehand GG. A method for correcting item-total correlations for the effect of relevant item inclusion. Educ Psychol Meas. 1962; 22 (4), 731-735.
4 For a discussion of the methods used to calculate the reliability of CAHPS measures, see pp. 62-63 in the document “Instructions for Analyzing Data from CAHPS® Surveys: Using the CAHPS Analysis Program Version 4.1,” Document No. 2015, updated on 04/02/2012; available here: https://cahps.ahrq.gov/surveys-guidance/docs/2015_instructions_for_analyzing_data.pdf . Much of the text in this section is based on information provided in that document.
5 Winer BJ. Statistical principles in experimental design. New York: McGraw-Hill, 1970; also Zaslavsky AM, Buntin MJB. Using survey measures to assess risk selection among Medicare Managed care plans. Inquiry, 6/2002, 39(2), 138-151.
6 Hays RD, Revicki D. Reliability and validity (including responsiveness). In P. Fayers & R. Hays (eds.). Assessing quality of life in clinical trials: Methods and practices, 2nd ed. Oxford: Oxford University Press, 2005, 41-53.
7 Nunnally, JC (1978). Psychometric theory (2nd edition). New York: McGraw‑Hill Book Company.
8 Zaslavsky AM, Statistical issues in reporting quality data: small samples and casemix variation, Int J Qual Health Care, 2001;13(6):481-488.
9 For a discussion of reliability and its relationship to sample size, see the document, “Fielding the CAHPS Clinician & Group Surveys: Sampling Guidelines and Protocols (Document No. 1033),” available here: https://cahps.ahrq.gov/surveys-guidance/docs/1033_CG_Fielding_the_Survey.pdf.
10 Nunnally JC & Bernstein IH (1994). Psychometric theory (3rd Edition). New York: McGraw-Hill, Inc.
11 Stevens J (1992). Applied multivariate statistics for the social sciences (2nd Edition). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
12 For health plans, CAHPS recommends a target of 300 completed suveys per plan with a minimum of 100 for reporting. See p. 65 in the document “Instructions for Analyzing Data from CAHPS® Surveys: Using the CAHPS Analysis Program Version 4.1,” Document No. 2015, updated on 04/02/2012; available here:
https://cahps.ahrq.gov/surveys-guidance/docs/2015_instructions_for_analyzing_data.pdf . For clinician groups, CAHPS recommends 300 completed surveys per group. See p. 7 in the document, “Fielding the CAHPS Clinician & Group Surveys: Sampling Guidelines and Protocols,” Document No. 1033, updated on 09/01/2011; available here:
https://cahps.ahrq.gov/surveys-guidance/docs/1033_CG_Fielding_the_Survey.pdf.
13 Note that CMS will revise the response rate assumptions based on the results of the psychometric test.
14 In practice, this test is conducted using a Satterthwaite unpooled t-test on the mean difference, which accounts for unequal variances. We reproduced the analyses presented in Exhibit B3 using this test and specifying different variances for the single entity variance and the pooled variance. When the single entity variance is smaller than the pooled variance, the sample size required to detect mean differences of a particular magnitude tends to decrease. When the single entity variance is larger than the pooled variance, the sample size required tends to increase. However, the sample size requirements are still overwhelmingly determined by upper limit of either variance, regardless of how unequal they are. The impact on the estimated number of completes associated with the mean differences and variances presented in the exhibit was negligible.
15 Results used for input to this power analysis were derived from a series of one-way analyses of variance (ANOVA) of CAHPS data using the entity as a single predictor and composite scores as outcomes. The square root of the mean square error (Root MSE) represents the total unexplained, or residual (within-entity), variance after removing the portion of variance accounted for by the entities (the explained, or between-entity, variance) from the total variance. See pp. 63-65 of the document “Instructions for Analyzing Data from CAHPS Surveys (Document No. 2015)” available here: https://cahps.ahrq.gov/surveys-guidance/docs/2015_instructions_for_analyzing_data.pdf , for a discussion of star ratings and examples of different effect sizes obtained with different sample sizes.
16 Note: the definition of a Qualified Health Plan includes any health plan offered outside the Exchange by an issuer that is the same as a plan offered inside the Exchange. To be the “same plan” means that the health plan offered outside the Exchange has identical benefits, premium, cost-sharing structure, provider network, and service area as the QHP offered inside the Exchange. This reflects the fact that some issuers are enrolling persons in the same plan outside the Marketplace insfrastructure as well as through the Marketplace. These will mainly be persons who know that their income exceeds the maximum that would qualify for the Advance Payment Tax Credit wihtout going through the Marketplace and, thus, enroll directly with the issuer. They constitute part of the population enrolled in the QHP, because the plan is identical. In order to represent the entire population of the QHP, they will be eligible to be sampled.
17 See p. 65 in the document “Instructions for Analyzing Data from CAHPS® Surveys: Using the CAHPS Analysis Program Version 4.1,” Document No. 2015, updated on 04/02/2012; available here:
https://cahps.ahrq.gov/surveys-guidance/docs/2015_instructions_for_analyzing_data.pdf .
18 As described in Document No. 13b in the CAHPS Health Plan Reporting Kit, which is titled “Fielding the CAHPS Health Plan Survey: Commercial Version.”
| File Type | application/vnd.openxmlformats-officedocument.wordprocessingml.document | 
| Author | CEvensen@air.org | 
| File Modified | 0000-00-00 | 
| File Created | 2021-01-28 |