Assessment analysis
Item analysis for Science teachers — a practical guide for after the exam
How to turn a marked stack of papers into a short, honest list of what to reteach — without a spreadsheet course or an extra weekend. A step-by-step item analysis you can run after any weighted assessment.
The marking is done. You have a class set of weighted assessment papers, a final mark for each student, and a quiet feeling that there is more in this stack than the marks column is telling you. You are right — there is. The question is how to get it out without losing a weekend to a spreadsheet.
Item analysis has a slightly intimidating, exam-board sound to it. In practice it is just a structured way of asking three questions about a paper you have already marked: which questions did the class struggle with, what do the wrong answers have in common, and what should I do about it? You can do a useful version in about twenty minutes, with nothing more than the scripts, the paper, and a pen.
This guide walks through that practical version — the one that fits into a free period, not a professional-development course.
Start with the questions that surprised you
You do not need to analyse every item. The fastest way to make this worthwhile — and the fastest way to make it sustainable — is to be ruthless about where you look.
Lay out the paper and, for each question, get a rough sense of how many students got it right. You are not after decimal places. "About a third of the class" is enough. Then pull out two short lists: the five or six questions the class did worst on, and any question where the result surprised you — one you genuinely expected them to handle that they did not. That surprise is often the most useful thing on the whole paper, because it points at a gap you did not know was there.
Everything that follows, you only do for those questions.
The two signals worth knowing
Item analysis has its own vocabulary, but underneath it sits two simple ideas.
Difficulty — how many got it right. Often written as a P-value (the proportion correct, from 0 to 1). A question almost everyone got right has a high P-value; one almost everyone got wrong has a low one. Neither is automatically good or bad. A very easy question might be a fair warm-up; a very hard one might be doing the job of stretching your strongest students. The number only becomes interesting when it does not match what you expected.
Discrimination — did it sort understanding. This asks whether the students who did well on the paper overall also tended to get this particular question right. A question with strong discrimination separates secure understanding from shaky understanding cleanly. A question where your strongest and weakest students did about equally well is flagging something — usually that the question was confusing, or that the topic genuinely was not taught clearly to anyone.
You do not need to calculate a formal discrimination index to use this. A quick version: look at the scripts of your few strongest students and your few weakest on that one question. If the strong students mostly got it and the weak ones mostly did not, the question is sorting well. If both groups struggled, the problem is more likely the question or the teaching than the students.
The read that actually matters: what kind of wrong is this?
Here is where item analysis earns its keep. A low score on a question can mean three quite different things, and they need three different responses. Sorting which is which is the real skill.
It might be a misconception. If the same wrong answer turns up again and again — fifteen students all making the same error, not fifteen different ones — you are almost certainly looking at a shared misconception, not bad luck. This is the most valuable thing you can find, because it is fixable with teaching. (We keep a running list of the wrong answers that show up most reliably in primary marking in the primary science misconceptions catalogue, and the method for diagnosing and correcting them in the Science misconceptions hub.)
It might be the question. If the wrong answers are scattered — no single error dominates — or if your strongest students struggled alongside everyone else, the question itself may be the problem. Ambiguous wording, two defensible answers, a diagram that did not print clearly. This is not a teaching gap; it is a question to rewrite before you use it again.
It might be the marking. If you marked the paper over several sittings, or if two teachers split the marking, a strange pattern on one question can simply be inconsistent marking. Worth a thirty-second check before you plan a reteach you do not need.
The whole point of looking at the wrong answers — not just the scores — is that the scores alone cannot tell these three apart. The wrong answers can.
Turning the read into a plan
Once you have sorted your weak questions into those three buckets, the next step almost writes itself.
- Misconception questions → plan a short, focused reteach built around the contrast between the wrong idea and the right one. Keep it proportionate: if only a handful of students hold it, a small-group session beats a whole-class lesson.
- Question problems → mark the item for rewriting before reuse, and do not read too much into the class's performance on it.
- Marking problems → fix the marking, and re-read the question's results once it is consistent.
This is the hinge between assessment and teaching — the move from what happened to what I will do. It is also the middle of a larger loop that runs from marking all the way through to remedial planning, which we lay out in full in from marking to remedial: the Science assessment workflow. For the department-level version — reading these patterns across several classes and deciding what belongs in next year's scheme of work — see post-marking intelligence and item analysis for science departments.
The common mistakes in doing item analysis
Even teachers who value this work tend to trip on the same things:
- Trying to analyse everything. You will not keep it up. Five questions, chosen well, is plenty.
- Reading the number as a verdict on the students. A hard question is not a failure; it may be a good question. Ask what the result means before you judge it.
- Treating every wrong answer as a misconception. Some are slips; some are question problems. The wrong-answer pattern is what tells them apart.
- Stopping at the analysis. The numbers are not the output. The decision about what to reteach is. An item analysis that does not change a single lesson was not worth doing.
A grounded note on what this does and does not do
Item analysis is a way of listening to a paper you have already marked. It is genuinely useful, and it is also not a verdict. The numbers are a prompt for your judgement, not a replacement for it — you know your class and the day they sat the paper in a way no analysis does. Used well, it makes your marking work harder and your reteaching land where it matters. That is the honest claim, and it is enough.
This kind of "use the evidence in front of you to decide what to teach next" is one of the most consistently supported ideas in education research — the formative-assessment work of Paul Black and Dylan Wiliam, and later syntheses by the Education Endowment Foundation, all point the same way. The detail is less important than the simple version: feedback only helps when it changes what happens next.
A printable companion, and a shortcut
If you would like a ready-made working surface for all of this, the free Post-Marking Intelligence Review Pack turns the steps above into a printable set of sheets for the week after marking — an item-level review sheet, a wrong-answer tracker, a misconception-versus-question triage table, and a reteaching decision template.
Get the Post-Marking Intelligence Review Pack
A printable companion for the week after marking — item-level review sheet, wrong-answer tracker, misconception-versus-question triage, and a reteaching decision template. Free, and yours to keep whether or not MyScienceHOD is a fit for your school.
- Item-level error pattern review sheet
- Wrong-answer pattern tracker across up to four classes
- Misconception versus question-quality triage table
- Reteaching decision template scaled to the size of the gap
Done by hand, the slow part is sorting wrong answers across a stack of scripts after you have already spent hours marking. That is the part MyScienceHOD is built to compress: after you record how the class did, it drafts which questions the class struggled with and where the same wrong answer is clustering, so you can spend your time on the decision rather than the sorting. You keep final say on everything it surfaces. If you want to try it on your own class, the free Beta is open to Singapore Science teachers and departments.
FAQ
Frequently asked questions
- What is a good P-value or difficulty for a question?
- There is no single good number. A very easy question (most of the class correct) and a very hard one (almost nobody correct) are both fine if they are doing a job — checking a basic idea, or stretching the strongest students. What you are really looking for is the question that surprised you: one you expected the class to handle that they did not. That gap between what you expected and what happened is the useful signal, not the raw percentage.
- Do I need software or a spreadsheet to do item analysis?
- No. For a quick, useful version you need the marked scripts, the question paper, and about twenty minutes. A spreadsheet helps if you want to track patterns across several classes or across the year, but the thinking — which questions went wrong, what the wrong answers have in common, and what to do about it — is the same with or without one.
- How is item analysis different from just looking at the class average?
- The average tells you how the class did overall. Item analysis tells you why, question by question. Two classes can have the same average while struggling with completely different things. The average cannot tell you what to reteach on Monday; the item-level read can.
- How many questions should I actually analyse?
- Far fewer than you might think. Start with the five or six questions the class did worst on, plus any one you are surprised by. Trying to analyse every item is what makes teachers give up on this entirely. A focused look at the questions that matter is more useful than a shallow pass over all of them.
Sources and further reading
- CurriculumMinistry of Education, Singapore (2023) — Primary Science Syllabus
- ResearchBlack, P. & Wiliam, D. (1998) — Inside the Black Box: Raising Standards Through Classroom Assessment (Phi Delta Kappan)
- ResearchEducation Endowment Foundation (2021) — Teacher Feedback to Improve Pupil Learning (guidance report)
Last reviewed for accuracy: 2026-06-24