Project

Back to overview

Detecting Heterogeneity in Complex IRT Models for Measuring Latent Traits

English title Detecting Heterogeneity in Complex IRT Models for Measuring Latent Traits
Applicant Strobl Carolin
Number 152548
Funding scheme Project funding
Research institution Psychologisches Institut Universität Zürich
Institution of higher education University of Zurich - ZH
Main discipline Psychology
Start/End 01.01.2016 - 30.06.2019
Approved amount 265'921.00
Show all

Keywords (5)

Psychometrics; Item Response Theory; Recursive Partitioning; Differential Item Functioning; Test Fairness

Lay Summary (German)

Lead
Psychologische Eigenschaften lassen sich, im Gegensatz zu den physikalischen Eigenschaften einer Person, nicht direkt messen. Während man z.B. die Größe mithilfe eines Maßbandes einfach ausmessen kann, müssen zur Messung von Fähigkeiten oder Persönlichkeitseigenschaften psychologische Tests oder Fragebogen konstruiert werden. Man nennt solche Eigenschaften daher latent (d.h. nicht direkt beobachtbar). Aus den Antworten einer Person lassen sich zuverlässige Rückschlüsse über die latente Eigenschaft ziehen - aber nur wenn der Test oder Fragebogen bestimmte Qualitätsstandards erfüllt. Die Psychometrie, als Wissenschaft angesiedelt an der Schnittstelle von Psychologie und Statistik, beschäftigt sich mit der mathematischen Beschreibung und Überprüfung dieser Qualitätsstandards.
Lay summary
Ziel des Projektes ist es, für eine Klasse von besonders flexiblen statistischen Modellen zur Validierung von psychologischen Tests und Fragebogen neue Methoden zur Qualitätssicherung zu entwickeln. Modelle der sog. Item Response Theorie erlauben faire Vergleiche zwischen Personen, solange die den Modellen zugrundeliegenden Annahmen erfüllt sind - was allerdings in der Praxis nicht immer der Fall ist. Betrachtet man z.B. einen Test der zur Messung der mathematischen Kompetenz konstruiert wurde, so kann eine verbal formulierte Aufgabe für Schüler mit Deutsch als Fremdsprache schwieriger zu lösen sein als für Schüler mit Deutsch als Muttersprache, obwohl beide Schüler dieselbe mathematische Kompetenz haben. Eine solche Aufgabe, die sog. Differential Item Functioning aufweist, führt zu verzerrten Testergebnissen und erlaubt keinen fairen Vergleich zwischen den Schülern. Im Rahmen dieses Projekts sollen deshalb statistische Verfahren entwickelt werden, mit denen man Aufgaben mit Differential Item Functioning sowie die betroffenen Personengruppen in flexiblen Modellen der Item Response Theorie identifizieren kann. Diese Verfahren basieren auf modernen Ansätzen aus der parametrischen Statistik und dem maschinellen Lernen. Mithilfe dieser Verfahren können problematische Aufgaben bereits bei der Testkonstruktion ausgeschlossen oder mithilfe der Zusatzinformation über die betroffenen Personengruppen entsprechend modifiziert werden.

Das in diesem Projekt entwickelte statistische Instrumentarium kann aufgrund der frei verfügbaren Software-Implementierung direkt dazu verwendet werden, bestehende und neue psychologische Tests und Fragebogen zu validieren und unfaire Aufgaben aufzudecken. Dadurch ermöglicht es zuverlässigere Aussagen über die Eigenschaften von Einzelpersonen und faire Vergleiche von Personengruppen z.B. im Bereich der empirischen Bildungsforschung.
Direct link to Lay Summary Last update: 10.12.2015

Responsible applicant and co-applicants

Employees

Publications

Publication
An Evaluation of Overall Goodness-of-Fit Tests for the Rasch Model
Debelak Rudolf (2019), An Evaluation of Overall Goodness-of-Fit Tests for the Rasch Model, in Frontiers in Psychology, 9.
Investigating Measurement Invariance by Means of Parameter Instability Tests for 2PL and 3PL Models
Debelak Rudolf, Strobl Carolin (2018), Investigating Measurement Invariance by Means of Parameter Instability Tests for 2PL and 3PL Models, in Educational and Psychological Measurement, 79(2), 385-398.
A Comparison of Aggregation Rules for Selecting Anchor Items in Multi Group DIF Analysis
HuelmannThorben, DebelakRudolf, Strobl Carolin, A Comparison of Aggregation Rules for Selecting Anchor Items in Multi Group DIF Analysis, in Journal of Educational Measurement.

Collaboration

Group / person Country
Types of collaboration
Workgroup of Prof. Dr. Achim Zeileis, Department of Statistics, University of Innsbruck Austria (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Workgroup of Prof. Dr. Kurt Hornik, Institute for Statistics and Mathematics, WU Vienna Austria (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
International Meeting of the Psychometric Society 2019 Talk given at a conference The effect of different ratios of group sizes in multi-group DIF detection 15.07.2019 Santiago, Chile Huelmann Thorben;
International Meeting of the Psychometric Society 2019 Talk given at a conference Two new nonparametric local independence tests for the Rasch model 15.07.2019 Santiago, Chile Debelak Rudolf;
Vortrag im Rahmen von Forschungsaufenthalt bei Educational Testing Services Individual talk A comparison of anchor methods for detecting DIF in multiple groups 16.07.2018 Princeton, United States of America Huelmann Thorben;
International Meeting of the Psychometric Society 2018 Talk given at a conference Vuong tests for model selection of mirt models 09.07.2018 New York, United States of America Schneider Lennart;
International Meeting of the Psychometric Society 2018 Talk given at a conference Anchor Point Selection 09.07.2018 New York, United States of America Strobl Carolin;


Use-inspired outputs

Abstract

The aim of this research proposal is to develop a methodological toolbox for the statistical evaluation of group-differences in complex Item Response Theory (IRT) models. IRT includes a variety of parametric models for measuring latent (i.e., not directly observable) traits, such as abilities or attitudes. They provide a statistical framework for testing the measurement properties of psychological tests and questionnaires, and have become widely applied in many areas of psychological, medical and educational research (including the validation of psychological and psychiatric instruments as well as large scale educational assessments like the PISA study).One key assumption of any test or questionnaire is that the measurement properties are the same for all subjects. If this property is not met, any comparison between groups of subjects (such as comparisons between male and female patients or between different countries in educational testing) may not be reliable. For example, if a test contains items that disadvantage women or an ethnic minority, this may induce apparent ability differences between the groups and can lead to wrong conclusions - with severe practical consequences.Therefore, statistical methods are necessary to validate the measurement properties of a test and identify items that need to be modified or excluded because they pose a disadvantage to certain groups of test takers. While standard approaches are limited to evaluating differences between given groups (such as males and females) the methods that will be developed in this project are more flexible. They are not limited to groups defined by a single covariate, but can assess several potentially relevant covariates at the same time and are able to detect even complex non-monotone and interaction patterns (such as items that disadvantage only females of certain ages). This flexible, exploratory approach thus provides a more efficient - and much more realistic - means of test validation.The proposed project builds upon a previous one funded by the German Research Foundation DFG, in which we successfully developed a model-based recursive partitioning framework for detecting parameter differences in simple IRT models. For the new proposal, the aim is to extend this framework to a class of more general IRT models in order to be able to incorporate differential slope and guessing parameters. These models are of high practical relevance because they allow for addressing the common situations that not all items have the same discriminatory power and that random guessing occurs (most importantly in multiple choice tests, where even test takers with very little knowledge have a certain probability to pick the correct alternative by chance). We will also investigate latent class or mixture distribution approaches for capturing heterogeneity related to unobserved or unobservable grouping variables, while continuing to include the valuable option of observed concomitant variables. In addition to likelihood-based approaches, we will investigate the potential of Bayesian models, that seem particularly well-suited to this task, given their flexibility in combining observed and unobserved variables.In summary, the methodological toolbox developed in this project will allow for a flexible validation of a variety of widely applied IRT models and thereby provide a substantial contribution to the construction of objective and fair tests in many areas of psychological, medical and educational research. All methods will be implemented in the free statistical software R, so that they will be promptly available to other researchers from all over the world.
-