A Summary of DeVellis R. (1991). Scale Development: Theory and Applications


Construct Validity: “relationship of one variable to a second variable.

Is the measure “behaving” the way that the construct it purports to measure should behave with regard to established measures of other constructs” In sum, Construct Valiidty is a test or a measure of how well a particular measure actually measures that which it is actually claiming to measure.

Different from Criterion-related validity:, which is really a measure of how well one or more variables predicts some sort of outcome based on knowledge of other variables.

Both reach the same end, though I’m not convinced.

In sum, it is the extent that the measure is measuring a criteria that exists in the real world.

Known-Groups validation: Can be either construct or criterion validity, “depending on investigator’s intent”. How well does a scale differentiate members from Group A from Group B, judging on the scale output.

Q: How strong should correlations be in order to demonstrate Construct Validity? No cut off, but essentially above what would otherwise be found due to simple method variance (random error normally established through the covariation of these two)

Multitrait-multimethod Matrix: 

Used to examine construct validity: (how well a measure actually measures what it is actually claiming to measure), but measuring more than one construct with more than one method, creating a matrix of correlations between measurements. Be careful that the correlation is not strong because of “method correlation” (how it is measured) but more as a function of Construct covariation (what is being measured). If measures are related, they score high on convergent validity (similarity between measures of theoretically related constructs; how well are two measures related in relation to their (should be) strong theoretical relatedness). The opposite, Discriminant/Divergent validity would hold if there is no correlation between measures of unrelated constructs, the measures show this. So, if two measures are in fact unrelated, how unrelated are they?

Guidelines in Scale Development: How to create Measurement Scales

Step 1: Determine Clearly What it is You Want to Measure

-Theory as an aid to clarity: Be well versed in the theory related to the construct you want to measure, so that the content of your scale does not “drift into unintended domains”. Yes, always think how theory relates to your construct you want to measure. If not theory exists, think about the construct conceptually, and how your construct relates to other phenomena.

-Specificity as an aid to clarity: Varies along Content domains, setting, or population. How specific is your scale, and how well does it relate to other scales similar to yours.

-Being clear about what to include in a measure: Best to be super clear. How district is the creator’s construct from others’ constructs? Avoid crossing over items that do not fit within the same construct you are intending to measure.

Step 2: Generate an Item Pool

-Choose items that reflect the scale’s purpose; i.e. “the thing”: All items in the scale should reflect the latent variable underlying them. (variable that is not directly observable). How? The content of the item should reflect the construct you intend to measure. Eg: What other ways can an item be worded to get to tap the construct?

Items are: overt manifestations of a common latent variable that is their cause.

-Redundancy: not a bad thing. We want to find as many items that captures the phenomenon. Redundancy is expected.

-Number of items: More than you plan to include in your final scale. You want good internal consistency: how strongly the times correlate with one another. ~ 3-4X more items than in your final scale. The larger the pool, the better. Can start to eliminate items if they are not clear, relevant, unnecessarily similar.

-Characteristics of good and bad items: Clarity. Unambiguous. Avoid lengthy items (leads to complexity). Consider reading level of audience. Avoid multiple negatives. Avoid double-barrelled items (convey two or more ideas). Avoid ambiguous pronoun references: “who is the “their: referring to?

-Positively and negatively worded items: Used to avoid acquiescence, affirmation, or agreement bias. Agree regardless of content in the survey. Problem is negatives may confuse the reader.

-Conclusion: Read above for summary.

Step 3: Determine the format for Measurement: 

Decide on format while choosing items. e.g.: Checklist (be not declarative)

-Thurstone Scaling: Create items that are differentially responsive to specific levels of the attribute in the question. The tuning (determination of what level of the construct each item responds to) is done by a group of judges. Is difficult to find items that confidently resonate to a specific phenomenon.

-Guttman Scaling: Items tapping progressively higher levels of an attribute. Too high of an item should be left out. (smoke 1 pack, some 10 packs, smoke 1000 packs; delete this last one). Applicability is rather limited.

-Scales with equally weighed items:- Thurstone and Guttman: best for scales of items that aim to equally detect a particular phenomenon.

-How many response categories? Can respondent discriminate meaningfully? Avoid “somewhat” and “not very”. Odd or even numbers? Odd implies a central/neutral item; An even requires respondent to make a commitment, even if a weak one. Neither is better.

-Specific types of response formats: many formats.

-Likert Scale: a declarative sentence. Responses should be equally spaced in agreement. Provides a generally strong opinion for surveys. Avoid too mild statements in your questions to avoid too much agreement. Avoid offending participants. Best used to learn opinion, attitude, belief, or other clearly studied construct.

-Semantic Differential: A response on a continuum from right to left, spanning 7-8 lines between them, with respondent placing a mark on the line where they agree.

-Visual Analog: Same as above, but is a continuous scale. Problem: different people give different meanings to different spaces on the line. Great, because they are potentially sensitive, to help determine difference in a weak phenomenon experienced. Also, difficult to recall previous responses, ensuring true responses and avoiding bias in post-manipulation studies.

-Binary Options: e.g.: agree/disagree. Short coming is minimal variability in responses. Need more items to ensure high scale variability. Good as they are easy to answer.

-Item Time Frames: Choose time scale actively rather than passively.

Step 4: Have Initial Item Pool Reviewed by Experts

Get a large group of people who understand your content well to review your questions.

-Confirms or invalidates your definition of the phenomenon. “How relevant is each item to what you intend to measure?”

-Evaluate clarity and conciseness of scale items.

-Helps you tap phenomenon you have failed to include.

-Final decision to include or not is yours.

Step 5: Consider Inclusion of Validation Items

-Add a couple types of times

1) Items that detect flaws/problems. ~ measuring social desirability scale.

2) Pertain to construct validity of the scale.

Step 6: Administer items to a development scale

-administer to a large number of participants. ~ 20 items= 300 participants.

-Sufficiently large to represent the population. Can run a G study to determine generalizability across different populations, so that items are generalized to everyone.

Step 7: Evaluate the Items: The Heart of Scale Development

-Initial examination of items’ performance: High correlation with with the true score of the latent variable. We cannot directly assess the true score and thus cannot compute corrections. We can make inferences.

1) High intercorrelated items. -inspect correlation matrix.

-Reverse Scoring: negative correlations between items? Try to reverse score them.

-Item-scale correlations: one item should strongly correlate with a group of other highly correlated items put into a group, but excluding itself.

-Item Variances: good for scale.

-Item means: If mean is close to middle of scale (1-7), this is good. If it’s to the extreme ends, then may not detect certain values, which will also have low item variance.

-Coefficient alpha/Covariant Matrix: MOST IMPORTANT INDICATOR OF A SCALE’S QUALITY. Evaluates proper variance in the scale that is in fact attributable to the true score. Or, can create  covariate matrix, or Spearman-Brown formula.

Cronbach’s alpha is a measure of internal consistency, or correlations between items, when all items are measuring the same construct. The Alpha value is also influenced by the number of items in the scale.

Alpha: if negative, something is wrong. (0-1). If neg, reverse score items. Anything higher than .70 is a good alpha. If above .90, can consider shortening the scale.

Step 8: Optimize Scale Length

-Effect of scale length on reliability: Since Alpha is influenced by correlation between items and the number of items in a scale, can begin to reduce one of these. Longer scales are more reliable; shorter are easier for participants.

-Effects of dropping “bad” items: If an item has a lower than average correlation with other items, removing that item will increase alpha.

-Tinkering with scale length: The items that have the lowest scale-item correlations should be eliminated first. But, the more items, the more the alpha value increases, and becomes a greater estimate of reliability.

-Scale samples: Split sample of items into two, tinker with one sample, and cross-compare it to the unaffected sample. Split half or unevenly if you have a small sample.

Leave a Reply

Blog authors are solely responsible for the content of the blogs listed in the directory. Neither the content of these blogs, nor the links to other web sites, are screened, approved, reviewed or endorsed by McGill University. The text and other material on these blogs are the opinion of the specific author and are not statements of advice, opinion, or information of McGill.