In brief
Good measurement is the foundation of good science, but making measurements can be difficult. This might be for practical or technical reasons, but it can also be hard to know that we are measuring the right things, in the right way. Good measurements are underpinned by strong theories, are valid, minimise error and avoid bias.
A measurement is not the same as an observation. If we observe someone weeping then we might infer that they are sad; but an observation of weeping cannot itself a measure of how sad a person is. This is because emotions (like other psychological notions, such as motivation, or personality) are ‘hypothetical constructs’: We don’t see them directly — only via the many effects they have on the world. To convert observations of weeping to measurements of sadness we need to have a theory about how sadness relates to observable events.
This theory isn’t tested by the measurement, and reasonable people may have different theories about how constructs relate to observable events. I might claim, for example, that sadness primarily causes ice cream consumption rather than weeping. This can lead to arguments about the construct validity of a measure. That is to say, do our observations really measure what the things we wish them to? A common strategy to improve construct validity is to introduce diversity. If we make many different but related measurements we can capture different aspects of the same construct.
Because any single observation can have multiple causes — weeping can be caused by onions as well as sadness — using observations to make measurements creates measurement error. To reduce this kind of error, the main strategy is repetition. Repeated observations help by “averaging out” other causes, giving us a clearer picture of the construct of interest. For example, if we observe someone once—perhaps while they are chopping onions— then we might mistakenly think they are sad. If we observe them many times over the day then we are less likely to be misled in this way. Diversity can help here too because different observations may not share the ‘other’ causes, and so each can provide an independent signal.
Biases are a special kind of measurement error that can reduce validity (and where repetition can actually make things worse). For example, when answering questionnaires people can give misleading answers, perhaps to please the experimenter or make themselves look better. It’s not possible to avoid all bias, but it’s important to consider biases when interpreting our measurements.
How relaxed are you?
When we measure things like height, weight or speed the measurement itself seems easy — we use a tape measure or scales.
But (perhaps especially) in psychology, just making measurements can be a big part of the challenge. The difficulties can be both practical and conceptual, and we need to be mindful to maintain validity and minimise error and bias.
Discuss in your groups: How would you measure how relaxed someone is?
Specifically:
Make a list: Identify at least 10 observations you could make which would tell you if someone is relaxed. These must be things you could actually see or record if you were watching the person, or spoke to them and asked them questions.
Discuss: Could some of the observations also relate to other constructs, or be caused by something else? Could our observations be subject to any biases?
To extend the exercise:
Individually: Rank-order the list of observations. Put the most relevant observations first, and the least-relevant last. If you don’t think an observation is relevant at all, leave it out of your ranking.
As a group: Compare your rankings. Do you sometimes disagree?
Think of other psychological measurements where disagreements would be very likely to occur, and why.
Self report scales
Self report scales have many flaws but are often the least-worst option! They are cheap, simple to implement, and may be the only way to access some psychological constructs.
Good questionnaire items take time to construct, but should be simple, use familiar language, have face validity for the construct measured, be ‘unbiased’ and not be ‘leading’. Using a common structure is often advisable.
Taken at face value, self-report scales are deceptively simple: We ask participants questions and assume that their responses are ‘caused’ by the variable of interest.
There are lots of possible pitfalls with self report measures, not all of which can be avoided.
Research has shown that participants can give biased answers for many different reasons, and may lack any true insight on which to base their answer. For a quick overview see this helpful article but also Nisbett and Wilson (1977) for a classic illustration.
But psychologists persist with self report because it is cheap, simple, and because it may be hard to measure the constructs we are interested in any other way. Pain is a good example of this; see Chapman et al. (1985), and Katz and Melzack (1999) for a discussion.
Your task is to create a new self-report measure of one of the constructs or behaviours from your causal diagram.
The goal is to construct an instrument which we can use to predict student achievement in higher education.
As a group:
Brainstorm and make notes on all the different aspects of the chosen construct that you think should be captured by your measure (i.e. the measurable effects the construct has on the world).
For each of these aspects, write 3 or 4 questions. Write them so that — if a participant answered them honestly — their responses would be a reflection of your underlying construct (we will only use 6 questions in the end, but you can brainstorm more at this point)
To keep things simple for now, your questions should ALL be answerable with one of the following responses:
- A number from 1 to 7 (that is, a Likert scale) where the two ends of the scale have labels which read: “Completely Agree” (= 7) and “Completely Disagree” (= 0).
‘Think aloud’
Before using new self-report measures it’s important to pilot your questions with volunteers to check that they can be understood. Some questions can be confusing to participants. Examples of bad questions would include double-negatives, or double-barrelled questions (“How enthusiastic and intelligent is your lecturer?”, for example; see Hinkin 1998 for other examples).
In your groups, practice doing the ‘think aloud’ procedure described and practiced in class.
- One person should act as the participant, another the researcher (you can take turns and repeat).
- Notice any ambiguities, confusion
- Reword your questions using this new information
TARA-Led Workshop Tasks
Piloting your scale with ‘think aloud’
In the previous workshop we identified ‘think aloud’ as useful technique when piloting questionnaire items. If you want to read more about this see Willis (2015) (chapter 3) and a specific example of the use of the technique in Jobe and Mingay (1990).
In your original groups: Write out all of your questions in full (preferably on a computer so that they can be re-used for later exercises). Using a shared Office 365 document would be a good choice here.
All groups should split into two halves.
Find another half-group to work with (perhaps someone you already know?), and join with them.
Make half of each group ‘Researchers’ looking to pilot a scale and the other half volunteer ‘participants’:
Ask your new ‘participants’ to read each question aloud, and describe in their own words what they think it means, how they would answer it, and why.
Researchers: Make notes on participants’ answers. Pay particular attention if they give interpretations that didn’t match what you expected.
Now swap roles: researchers become participants and vice/versa.
- Repeat the exercise above in your new roles, again making written notes.
If you find that participants are confused by any of the questions, try to find alternative wordings for them.
Put your questionnaire online
In your groups:
Finalise the questions for your questionnaire (if you haven’t already). There should be no more than 6 questions in total.
Create a new ‘quiz’, and add your questions to a new form
- Test your form and check the data is collected in the format you expected. You must open the exported data in Excel to have a look and check it.
Other important tasks to complete
See the Discourse post here and follow the instructions closely.
How does this help with the assessment? Learning how to collect data is an important component of the module; without some data you will find it hard to complete the next few workshop tasks. We will select examples of different groups’ questionnaires and use them in the subsequent teaching sessions.
References
Chapman, C Richard, KL Casey, R Dubner, KM Foley, RH Gracely, and AE Reading. 1985. “Pain Measurement: An Overview.” Pain 22 (1): 1–31.
Hinkin, Timothy R. 1998. “A Brief Tutorial on the Development of Measures for Use in Survey Questionnaires.” Organizational Research Methods 1 (1): 104–21.
Jobe, J B, and D J Mingay. 1990. “Cognitive Laboratory Approach to Designing Questionnaires for Surveys of the Elderly.” Public Health Reports 105 (5): 518–24. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1580104/.
Katz, Joel, and Ronald Melzack. 1999. “Measurement of Pain.” Surgical Clinics of North America 79 (2): 231–52.
Nisbett, Richard E, and Timothy D Wilson. 1977. “Telling More Than We Can Know: Verbal Reports on Mental Processes.” Psychological Review 84 (3): 231.
Willis, Gordon B. 2015. Analysis of the Cognitive Interview in Questionnaire Design. Oxford University Press.
It’s a convention in causal diagrams to draw things we can observe directly with square-edge boxes. Hidden or ‘latent’ variables which we can’t observe directly are drawn with rounded edges.↩︎