In brief

Good measurement is the foundation of good science, but making measurements can be difficult. This might be for practical or technical reasons, but it can also be hard to know that we are measuring the right things, in the right way. Good measurements are underpinned by strong theories, are valid, minimise error and avoid bias.

A measurement is not the same as an observation. If we observe someone weeping then we might infer that they are sad; but an observation of weeping cannot itself a measure of how sad a person is. This is because emotions (like other psychological notions, such as motivation, or personality) are ‘hypothetical constructs’: We don’t see them directly — only via the many effects they have on the world. To convert observations of weeping to measurements of sadness we need to have a theory about how sadness relates to observable events.

This theory isn’t tested by the measurement, and reasonable people may have different theories about how constructs relate to observable events. I might claim, for example, that sadness primarily causes ice cream consumption rather than weeping. This can lead to arguments about the construct validity of a measure. That is to say, do our observations really measure what the things we wish them to? A common strategy to improve construct validity is to introduce diversity. If we make many different but related measurements we can capture different aspects of the same construct.

Because any single observation can have multiple causes — weeping can be caused by onions as well as sadness — using observations to make measurements creates measurement error. To reduce this kind of error, the main strategy is repetition. Repeated observations help by “averaging out” other causes, giving us a clearer picture of the construct of interest. For example, if we observe someone once—perhaps while they are chopping onions— then we might mistakenly think they are sad. If we observe them many times over the day then we are less likely to be misled in this way. Diversity can help here too because different observations may not share the ‘other’ causes, and so each can provide an independent signal.

Biases are a special kind of measurement error that can reduce validity (and where repetition can actually make things worse). For example, when answering questionnaires people can give misleading answers, perhaps to please the experimenter or make themselves look better. It’s not possible to avoid all bias, but it’s important to consider biases when interpreting our measurements.

How relaxed are you?

When we measure things like height, weight or speed the measurement itself seems easy — we use a tape measure or scales.

But (perhaps especially) in psychology, just making measurements can be a big part of the challenge. The difficulties can be both practical and conceptual, and we need to be mindful to maintain validity and minimise error and bias.

Discuss in your groups: How would you measure how relaxed someone is?

Specifically:

Make a list: Identify at least 10 observations you could make which would tell you if someone is relaxed. These must be things you could actually see or record if you were watching the person, or spoke to them and asked them questions.
Discuss: Could some of the observations also relate to other constructs, or be caused by something else? Could our observations be subject to any biases?

To extend the exercise:

Individually: Rank-order the list of observations. Put the most relevant observations first, and the least-relevant last. If you don’t think an observation is relevant at all, leave it out of your ranking.
As a group: Compare your rankings. Do you sometimes disagree?
Think of other psychological measurements where disagreements would be very likely to occur, and why.

Measurement model

Latent variables

When measuring psychological constructs like depression we assume observations (e.g. someone crying) have a hidden cause (their emotions). Often researchers use the phrase ‘latent variables’ for these hidden causes.

This means we don’t measure constructs directly — we measure their consequences.

This is sometimes termed a ‘reflective’ model of measurement. The diagram below is an example of reflective measurement. The observations in the diagram (IQ test scores on the WAIS, Raven, and Cattell tests) are caused by the construct (intelligence). Intelligence is a hidden cause of the test scores we observe:

Note that not everyone agrees measurement should work this way. For an opposing view see the notes below on ‘reflective vs formative measurement’. But this idea is central to the way we make measurements in most areas of psychological research and is one we will adopt here.

In your groups:

Identify one construct from your causal diagram that you would like to measure. It should have a direct link (in your model) to student attainment.
Expand this section of your causal diagram. Redraw this part, putting the chosen construct at the centre in an rounded box¹. Add square boxes to represent observations we could actually make, and draw arrows from the hidden cause to these effects.

Imagine your causal diagram looked like this:

You need to take one of the constructs from the model – let’s take motivation and create a sub-model, showing the observations we could make that motivation is a hidden cause of:

When you have finished drawing:

Discuss as a group: Do your observations capture everything there is to know about the hidden cause? Would any single observation be enough, on it’s own?
Discuss as a group: Could some of your observations also have other causes, not shown in your diagram? If so, draw them in.

Self report scales

Cartoon: XKCD

Self report scales have many flaws but are often the least-worst option! They are cheap, simple to implement, and may be the only way to access some psychological constructs.

Good questionnaire items take time to construct, but should be simple, use familiar language, have face validity for the construct measured, be ‘unbiased’ and not be ‘leading’. Using a common structure is often advisable.

Taken at face value, self-report scales are deceptively simple: We ask participants questions and assume that their responses are ‘caused’ by the variable of interest.

There are lots of possible pitfalls with self report measures, not all of which can be avoided.

Research has shown that participants can give biased answers for many different reasons, and may lack any true insight on which to base their answer. For a quick overview see this helpful article but also Nisbett and Wilson (1977) for a classic illustration.

But psychologists persist with self report because it is cheap, simple, and because it may be hard to measure the constructs we are interested in any other way. Pain is a good example of this; see Chapman et al. (1985), and Katz and Melzack (1999) for a discussion.

Your task is to create a new self-report measure of one of the constructs or behaviours from your causal diagram.

The goal is to construct an instrument which we can use to predict student achievement in higher education.

As a group:

Brainstorm and make notes on all the different aspects of the chosen construct that you think should be captured by your measure (i.e. the measurable effects the construct has on the world).
For each of these aspects, write 3 or 4 questions. Write them so that — if a participant answered them honestly — their responses would be a reflection of your underlying construct (we will only use 6 questions in the end, but you can brainstorm more at this point)
To keep things simple for now, your questions should ALL be answerable with one of the following responses:
- A number from 1 to 7 (that is, a Likert scale) where the two ends of the scale have labels which read: “Completely Agree” (= 7) and “Completely Disagree” (= 0).

‘Think aloud’

Before using new self-report measures it’s important to pilot your questions with volunteers to check that they can be understood. Some questions can be confusing to participants. Examples of bad questions would include double-negatives, or double-barrelled questions (“How enthusiastic and intelligent is your lecturer?”, for example; see Hinkin 1998 for other examples).

In your groups, practice doing the ‘think aloud’ procedure described and practiced in class.

One person should act as the participant, another the researcher (you can take turns and repeat).
Notice any ambiguities, confusion
Reword your questions using this new information

TARA-Led Workshop Tasks

Piloting your scale with ‘think aloud’

In the previous workshop we identified ‘think aloud’ as useful technique when piloting questionnaire items. If you want to read more about this see Willis (2015) (chapter 3) and a specific example of the use of the technique in Jobe and Mingay (1990).

In your original groups: Write out all of your questions in full (preferably on a computer so that they can be re-used for later exercises). Using a shared Office 365 document would be a good choice here.
All groups should split into two halves.
Find another half-group to work with (perhaps someone you already know?), and join with them.
Make half of each group ‘Researchers’ looking to pilot a scale and the other half volunteer ‘participants’:
- Ask your new ‘participants’ to read each question aloud, and describe in their own words what they think it means, how they would answer it, and why.
- Researchers: Make notes on participants’ answers. Pay particular attention if they give interpretations that didn’t match what you expected.
Now swap roles: researchers become participants and vice/versa.
- Repeat the exercise above in your new roles, again making written notes.
If you find that participants are confused by any of the questions, try to find alternative wordings for them.

Put your questionnaire online

In your groups:

Finalise the questions for your questionnaire (if you haven’t already). There should be no more than 6 questions in total.
Open Office 365 Forms.
Create a new ‘quiz’, and add your questions to a new form

Test your form and check the data is collected in the format you expected. You must open the exported data in Excel to have a look and check it.

Other important tasks to complete

See the Discourse post here and follow the instructions closely.

How does this help with the assessment? Learning how to collect data is an important component of the module; without some data you will find it hard to complete the next few workshop tasks. We will select examples of different groups’ questionnaires and use them in the subsequent teaching sessions.

References

Chapman, C Richard, KL Casey, R Dubner, KM Foley, RH Gracely, and AE Reading. 1985. “Pain Measurement: An Overview.” Pain 22 (1): 1–31.

Hinkin, Timothy R. 1998. “A Brief Tutorial on the Development of Measures for Use in Survey Questionnaires.” Organizational Research Methods 1 (1): 104–21.

Jobe, J B, and D J Mingay. 1990. “Cognitive Laboratory Approach to Designing Questionnaires for Surveys of the Elderly.” Public Health Reports 105 (5): 518–24. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1580104/.

Katz, Joel, and Ronald Melzack. 1999. “Measurement of Pain.” Surgical Clinics of North America 79 (2): 231–52.

Nisbett, Richard E, and Timothy D Wilson. 1977. “Telling More Than We Can Know: Verbal Reports on Mental Processes.” Psychological Review 84 (3): 231.

Willis, Gordon B. 2015. Analysis of the Cognitive Interview in Questionnaire Design. Oxford University Press.

It’s a convention in causal diagrams to draw things we can observe directly with square-edge boxes. Hidden or ‘latent’ variables which we can’t observe directly are drawn with rounded edges.↩︎

Measurement

Ben Whalley

October 2021