Exam Quality Through the Use of Psychometric Analysis

February 6, 2017 Christy Terry

Exam Quality Through Use of Psychometric Analysis—A Primer

When faculty review the performance of an exam, especially when the exam is paper-based, fundamental data points are reviewed to determine exam item and exam form performance: distribution of grades; mean, median, and mode scores; and frequency of wrong answers. While helpful to gauge a rudimentary snapshot of exam performance, these data points do not always portray performance of an exam item or exam form, as would a more in-depth analysis. Literally meaning mental measurement or analysis, psychometrics are essential statistical measures that provide exam writers and administrators with an industry-standard set of data to validate exam reliability, consistency, and quality.

This document provides a fundamental overview of these psychometric statistics, their use, and how they provide insight into exam performance when used together for a data-rich evaluation of an individual exam item and exam form. There is not one single data point able to provide a clear picture into quality; rather, when used together, multiple data points provide the accurate data necessary to prove exam quality, consistency, and reliability.

 

Industry-Standard Data

The concept to quantify the performance of an exam and the ability of the exam-taker is often referred to as classical test theory, which states the reliability of an exam may be improved through the analysis of item difficulty and the quantification of exam-taker ability using statistical references.

The psychometric community has validated these data points used to quantify exam-item and exam form quality over the last century:

  • Item Difficulty Index (p-value)
  • Upper Difficulty Index (Upper 27%)
  • Lower Difficulty Index (Lower 27%)
  • Discrimination Index
  • Point Bi-serial Correlation Coefficient
  • Kuder-Richardson Formula 20 (KR-20)

 

Exam Item Quality Indicators

Item Difficulty Index (p-value) The item difficulty index is represented as a proportional value of the number of incorrect answers compared to the number of total answers in a scale of 0.00 to 1.00. As the item is a proportional value, the data is expressed as p-value.

 

Interpretation

Interpretation of data is tied to a number within a scale of 0.00 to 1.00: a higher number indicates the exam item was a mastery item or not as difficult/discriminatory; conversely, a lower number indicates the exam item was more difficult or discriminatory. Significantly high or low numbers—items at extremes of the scale—may indicate exam items were either at a high level of difficulty, items performed at mastery levels, or were not sufficiently discriminating or difficult.

For example, in an instance of 75 exam-takers, with 55 correct answers, the p-value is 0.73 (73% of students for the answer correct). This is a value on the high end of the scale, but not an extreme. This item may be viewed not as difficult or discriminating than other questions when viewed in the context of other exam items. Comparatively, with 75 exam takers and 12 correct answers, the p-value is 0.16 (16% of students for the answer correct); this indicates the item may be more difficult as compared to the other item with a value of 0.73 and may not be an accurate measure of student learning or be of inappropriately high difficulty for the exam-taker.

As an independent data point, this value is not a clear indicator of exam item reliability, consistency or quality; rather; when reviewed in conjunction with multiple data points, exam item reliability may be determined.

 

Upper 27% Difficulty Index

Related to the p-value described above, the upper 27% value provides the difficulty index for high performers on the exam: exam-takers performing in the upper 27%. This statistic is calculated by first identifying the top 27% of performers of the exam— the 27% of exam-takers to who earned the highest grade. Once this group is identified, a p-value is calculated.

 

Interpretation

The data in this difficulty index determines exam item difficulty for the top performers sitting for the exam. As the value approaches 1.00, it indicates that your highest scorers on the exam performed well on the item. If the value approaches 0.5 or below it could indicate an issue with the item as a large portion of your top performing students failed to get the question correct. Incorrect distractors that are chosen by the upper 27% with some frequency can sometimes point to the potential for a second correct answer to the question or a lack of information/conflicting information within the stem.

For example, in an instance of 75 exam takers, 20 exam-takers would be identified as the top 27% of exam-takers. In this example, there are 3 correct answers; therefore, the upper 27% value is 0.15, which would indicate the item might have been too difficult even for top performers and an item review may be necessary. If 16 exam-takers answered correctly, the upper 27% p-value is 0.8.

As an independent data point, this value is not a clear indicator of exam item reliability, consistency or quality; rather; when reviewed in conjunction with multiple data points, exam item reliability may be determined.

 

Lower 27% Difficulty Index

As with the upper 27%, the lower 27% value demonstrates exam item difficulty for low performers sitting for the exam. As the value approaches 1.00, the item is considered less discriminating or difficult.

This statistic is calculated by first identifying the lower 27% of performers of the exam, or the 27% of exam-takers to who earned the lowest grades. Once this group is identified, a p-value is calculated.

 

Interpretation

The data in this difficulty index determines exam item difficulty for the low performers sitting for the exam. As the value approaches 1.00, the item may be considered less discriminating or mastery.

For example, in an instance of 75 exam takers, 20 exam-takers would also be identified as the lower 27% of exam-takers. In this example, there are 17 correct answers; therefore, the lower 27% value is 0.85, which would indicate the item may be less difficult for the low performers within the exam data pool. If 4 exam-takers answered correctly, the p-value is 0.2; this may indicate the exam item was very difficult for the lower 27% of exam-takers which, when data is viewed in context, may be acceptable or may require further item review.

As an independent data point, this value is also not a clear indicator of exam item reliability. The importance of contextualizing and comparing data points will provide necessary insight into exam item reliability.

 

Discrimination Index

The discrimination index provides a comparative analysis of the upper 27% and the lower 27% on a comparative scale of -1.00 to 1.00.

Calculating the discrimination index is achieved by subtracting the lower 27% Difficulty Index from the Upper 27% Difficulty Index.

 

Interpretation

Assuming exam-takers perform as expected, your exam-takers in the upper 27% should out-perform the exam-takers in the lower 27%. The value of the discrimination index provides an indication of this assumption as indicated by the following suggested discrimination index guidelines:

  • 0.30 and above - Good discrimination
  • 0.10 to 0.29 - Fair discrimination; review may be necessary
  • Equal to 0 - No discrimination. All exam-takers marked the item as correct
  • Negative value - Flawed item; remove or completely revise item

Values approaching zero indicate the upper performers and the lower performers performed similarly on this item and it is not an accurate discriminator. The higher the value, the better the item is at discriminating exam-taker recall. For example, if the lower 27% value is 0.55 and the value of the upper 27% is 0.97, the discrimination index is 0.42 (0.97 – 0.55 = 0.42). This indicates the item has good discrimination. If the lower 27% value is 0.78 and the value of the upper 27% is 0.97, the discrimination index is 0.19 (0.97 – 0.78 = 0.19). This exam item is a fair discriminator, but may require exam review for improved discrimination.

 

Point Bi-Serial Correlation Coefficient

The point bi-serial correlation measures correlation between an exam-taker’s response on a given item and how the exam-taker performed against the overall exam form.

Mp = whole-exam mean for exam-takers answering the question correctly

Mq = whole-exam mean for exam-takers answering the question incorrectly

S = Standard Deviation

p = Proportion of exam-takers answering correctly

q = Proportion of exam-takers answering incorrectly

 

Interpretation

The point bi-serial correlation coefficient provides a scaled index ranging from -1.00 to 1.00. A point bi-serial close to 1.00 indicates a positive correlation between exam-taker performance on the item and performance on the exam, meaning: exam-takers who performed well on the exam form also performed well on this question; exam-takers who performed poorly on the item also performed poorly on the exam form.

A negative point bi-serial indicates a negative correlation between the two, which means exam-takers who did not perform well on the exam item performed well on the exam as a whole; exam-takers who did perform well on the exam item did not perform well on the exam. A negative correlation indicated the exam item should undergo review, as there may be an error in the question.

When the data approaches zero, this indicates that there is little correlation between the performance of this exam item and the exam as a whole. This may indicate the exam item is based on material outside of other learning outcomes assessed on the exam or it was a mastery exam item where all, or most, of the exam-takers marked the item correctly.

 

Exam Form Quality Indicator

Kuder-Richardson Formula 20 (KR20)

Rather than focusing on the performance of exam items in relation to the exam-taker, the Kuder-Richardson Formula 20 score (KR20) provides scaled data—ranging from 0.00 to 1.00—on the performance and quality of the exam form as a whole; specifically, on the consistency of performance and difficulty of exam items throughout the full exam form.

 

k = Number of questions on the exam

pj = Number of exam-takers who answered question j correctly

q j = Number of exam-takers who answered question j incorrectly

σ = variance of the total scores of all exam-takers

 

Interpretation

The KR20 value is presented as a scaled value ranging from 0.00 to 1.00; as the scaled value increases, the exam form is considered more reliable and consistent.

As this statistic is dependent on the number of exam items and exam-takers, data produced by shorter exams and a smaller number of exam-takers may not be fully reliable; therefore, when reviewing the characteristics of the exam in conjunction with the discrete exam items, it is important to consider this point during analysis.

Interpretation of an instructor-made exam should not be held to the same standards as a high-stakes licensure or certification exams, such as the Bar, GRE, NAPLEX, NCLEX, NPTE, PANCE, and USMLE. These high-stakes exams are expected to maintain consistent KR-20 scores higher than 0.80. With smaller sample sizes, course exams with a KR20 score higher than 0.60 to 0.65 should be considered consistent and reliable, while it is recommended to maintain scores higher than 0.70.

 

Use of Data

Context

The use of psychometric data points in context to one another is absolutely critical. No single data point should stand as the sole indicator of exam item or exam form quality, reliability, or consistency. Rather, data provided by these individual statistics should be used in context to one another for further insight in exam item and exam form analysis.

Contextualization of data is essential to gather appropriate evidence in the review of exam data. A poor value on one statistic, when combined with a statistic showing improvement or a higher score may illustrate item error, scoring error, or other such issue with an exam item or exam form.

  • When the upper 27% of exam-takers is divided between two different answer options on a question, it may mean a distractor may reasonably be considered correct.
  • When all exam-takers in the lower 27% mark an exam-item correctly, this may indicate the distractors may not be as valid. In this instance, items should be reviewed to see if they are truly “mastery” items or if item flaws are influencing the statistics.
  • When all exam-takers in the upper-27% mark an exam item incorrectly, this item may be flawed or miskeyed. • Should the discrimination index be low, but the p-value is in an acceptable range, there may be an issue with item distractors or the stem.
  • If a point bi-serial correlation coefficient value is low, this may indicate the question was not properly framed to capture the desired knowledge.

As with all points of data evaluation, the context and interpretation of data is critically essential to making informed decisions of the success of exam-takers, consistency of the exam form, and reliability of the exam form. Exam administrators should remain mindful of these points during the process of evaluation and analysis.

 

Mastery vs. Discrimination

Exam item intention is another consideration in the interpretation of the psychometric data gathered from the exam. Difficulty index, upper 27%, or lower 27% values of 1.00 may be acceptable if the item is meant to measure mastery. For example, the instructor may have spent a significant amount of time on a specific topic essential to course progress; therefore, a score of 1.00 is entirely appropriate.

Should the item be meant for knowledge discrimination—synthesis and application of multiple skills, for example—a 1.00 value may not be appropriate; this type of question would imply recall and knowledge discrimination as the desired result. These same scenarios should also be considered for a point bi-serial correlation coefficient or discrimination index value at zero.

 

Contextual Analysis

Figure 1, Item Analysis Report: Mastery or Discrimination

The item analysis report in figure 1 demonstrates an item where the nearly all exam-takers demonstrated mastery of the item. The p-value is 0.98 with an upper 27% of 1.00 and a lower 27% of 0.98; the discrimination index is 0.04 and the point bi-serial correlation coefficient is 0.10.  If the item was meant to discriminate knowledge, the exam item may require review; however if this item was meant to measure mastery, the data indicates this item was successful.

Figure 2. Item Analysis Report: Contextual Review of the Discrimination Index and Point Bi-Serial Correlation Coefficient

With a discrimination index of 0.25 and a point bi-serial correlation coefficient of 0.22, the exam item appears to demonstrate acceptable levels of discrimination. Additional data reveals, however, potential issue with the item: the upper 27 value is 0.52—only half of the upper 27% of the class marked the item correct. The remaining top performers are spread across other response options—notably option B. While the initial review of the item indicates fair discrimination, the contextual data—difficulty index, response frequency—indicate possible issue with the exam item. Item review may be necessary.

Figure 3. Item Analysis Report: Reverse Discrimination

This item analysis report demonstrates reverse discrimination, with the lower 27% out performing the upper 27%. As previously noted, when reverse discrimination occurs, students who answered this question correctly did worse on the exam and students who answered this question incorrectly did better on the exam. An item review for this particular question is necessary as the discriminators may not be accurately written, more than one answer may be appropriate when analyzed, or the item may be keyed incorrectly.

Figure 4. Item Analysis Report: Nominal Discrimination

With a p-value of 0.66, this item indicates acceptable levels of discrimination; however, one-third of the class has still marked the answer incorrect.  Upon further review, the data indicates an upper 27% of 0.82 and a lower 27% of 0.46. Considering these data points in context with a discrimination index of 0.36 and a point bi-serial correlation coefficient of 0.28, the data indicates this item demonstrated appropriate discrimination. Should this item have been intended as mastery, however, item review may be necessary.

While item analysis data provides more clear indicators of student knowledge recall, the multiple considerations must be made to accurately interpret item success and exam quality, consistency, and reliability. Mitigating factors beyond the exam item and exam form also contribute to student success: perhaps supplemental readings unintentionally affect contradict lecture materials, perhaps students are over-analyzing item discriminators and drawing tangential conclusions as a result of a poorly written item, perhaps the item was simply keyed incorrectly. All of these factors are beyond psychometric data; hence, the need for item analysis accompanied by item review.

 

ExamSoft

The ExamSoft platform provides these statistical data points for exam item and exam form. Collection and review of data for the improvement of teaching and learning is at the core functionality of the data provided in exam reports.

Reports provided by ExamSoft include the full-picture of exam form analysis, which empowers exam administrators to fully understand the performance of exam items and how exam-takers interpret and respond to each item within the exam form. This enables exam administrators to capture and use truly meaningful data to affect the teaching and learning process.

Capturing meaningful data is at the core of the reporting tools in the ExamSoft platform. Not only do exam administrators gather data on exams, students also have access to powerful reporting on their individual performance against learning outcomes and in comparison to their classmates. This helps students make informed and autonomous decisions about their learning processes for self-direction and improvement upon deficient skills—and increases student satisfaction and retention in turn.

ExamSoft stands ready to provide you with the data to improve teaching and learning at your institution. Learn more at learn.examsoft.com.

 

 

Previous Article
ExamSoft Categories—Why does everyone keep talking about them?
ExamSoft Categories—Why does everyone keep talking about them?

Learn how ExamSoft uses categories to run a number of reports on different types of performance that can pr...

Next Video
Category "Hacks"- Non-traditional item categorization strategies to maximize the use of exam data
Category "Hacks"- Non-traditional item categorization strategies to maximize the use of exam data

Recently, Dan Thompson, discussed how his program uses ExamSoft’s category feature to pinpoint curricular w...

×

Schedule a Demo with ExamSoft Today!

First Name
Last Name
Institution Type
State/Region
I am
Thank you! You will be contacted shortly
Error - something went wrong!