Research

Overarching goal of my research aims at advancing statistical methods in educational and psychological measurement and improving validity and fairness in testing and assessments. My general interests are in the latent variable models, item response theory, psychometrics, multilevel models, and test security.

List of publications

Research at the University of Wisconsin-Madison

1 Preknowledge detection

My research focuses on (1) statistical modeling of aberrant testing taking behavior in educational and psychological measurement and (2) psychometric methodologies for the item- and person-level anomaly detection in the form of test collusion and item preknowledge.

2 University of Wisconsin-Madison Testing and Evaluation Services

I conduct psychometric research projects on educational assessment data using contemporary methods of measurement and assessment. I write psychometric data analysis reports at the Office of Testing and Evaluation Services to inform and promote advanced evaluation, test development, scoring, reporting, and test administration practices in line with the contemporary advances in theoretical and applied psychometrics.

3 Institute of Education Sciences (IES)

I served as a research assistant for the Institute of Education Sciences (IES; #R305D190053) Project “Bayesian dynamic borrowing: A method for utilizing historical data in education research” PDF

Principal Investigators: David Kaplan, UW-Madison, & Jianshen Chen, College Board

4 University of Wisconsin-Madison Language Institute

I served as a statistical consultant for analyzing a survey dataset at the Language Institute. The research study focuses on the reasons U.S. undergraduate students decide to enroll (or not) in courses in languages other than English (LOTEs). PDF

Principal Investigator: Dianna L. Murphy, UW-Madison

Research at National Board of Medical Examiners (NBME)

1 Item compromise detection

Testing organizations often investigate if secure test material has been exposed and consequently invalid for scoring and inclusion on future assessments. In the current study, we present an approach for longitudinally modeling both response accuracy and time-intensity to compare against items previously flagged as exposed from subject matter expert review. Preliminary results highlighted normatively extreme items that have substantially drifted from their initial deployment. Further, there did not appear to be a strong correspondence between the statistically unusual items and those flagged for exposure. Thus, decisions on the extent to which items are compromised may benefit from a more comprehensive approach utilizing both qualitative and quantitative methods to ensure that concerning items are discovered.

2 Nonfunctional distractors

Functional distractors (the incorrect options in a multiple-choice question) should draw attention from those test-takers who lack sufficient ability or knowledge to respond correctly. Unfortunately, previous research on distractors has demonstrated the unsettling reality that this rarely occurs in practice leading to recommendations for creating items with fewer incorrect alternatives. The purpose of the present study was to explore if these non-functional distractors (NFDs) may still yield value in detecting unusual examinee behavior. Using empirical data from a high-stakes licensure examination, examinees who selected an excessive number of NFDs were flagged and analyzed with respect to their response times and overall performance. Results indicated that these flagged examinees were also of extremely low ability, selected NFDs consistently across item sequence, and were homogenous in their pacing strategies - spending a similar amount of time when choosing a nonfunctional or functional distractor. Implications for relevant policy decisions, mitigation strategies, operational applications, and test security considerations are discussed. PDF

3 Interactive score reporting dashboard

This research study investigated focus group feedback on an interactive score reporting dashboard presenting results for multiple related assessments. Data were analyzed using two NLP methodologies: topic modeling and sentiment analysis. Results validated qualitative findings, provided additional insight on key themes, and clarified next steps for expanding the dashboard’s utility.

Research at American Institute of Certified Public Accountants (AICPA)

Preknowledge detection at the individual level

We borrowed information on one format to detect preknowledge on another format within a test. A differential person functioning approach yielded higher power than a regression method. Further investigation revealed that power decreased as the percentage of examinees with preknowledge increased, and the number of contaminated items decreased.

Research at the University of Connecticut

1 Omitted response patterns in a large-scale language assessment

This study is an exploratory analysis of examinee behavior in a large-scale language proficiency test. Despite a number-right scoring system with no penalty for guessing, we found that 16% of examinees omitted at least one answer and that women were more likely than men to omit answers. Item-response theory analyses treating the omitted responses as missing rather than wrong showed that examinees had underperformed by skipping the answers, with a greater underperformance among more able participants. An analysis of omitted answer patterns showed that reading passage items were most likely to be omitted, and that native language-translation items were least likely to be omitted. We hypothesized that since reading passage items were most tempting to skip, then among examinees who did answer every question there might be a tendency to guess at these items. Using cluster analyses, we found that underperformance on the reading items was more likely than underperformance on the non-reading passage items. In large-scale operational tests, examinees must know the optimal strategy for taking the test. Test developers must also understand how examinee behavior might impact the validity of score interpretations. PDF

2 English as a second language proficiency: An IRT approach to scoring

This thesis first discusses the operational definition of English proficiency as a second language through several theoretical frameworks. Next, it reviews various English language proficiency tests and statistical methods in their measurement approach. Then, it compares two different measurement approaches to scoring a summative English as a second language proficiency test: IRT vs. number-right scoring. Finally, it addresses some complications of the number-right scoring method given the suggested measurement approach and discusses the advantages of suggested IRT approach. Full-text

Research at Bogazici University

Preservice mathematics teachers’ understanding of complex numbers

This study investigated a prospective secondary mathematics teacher’s development of the meaning of the Cartesian form of complex numbers during a teaching experiment. We illustrate that through shrinking/stretching of the distance(s) between the roots and the x-coordinate of the vertex of any quadratic function one might conceptualize complex numbers as a single entity, element of a well-defined set, rather than a prescription of certain operations. Such awareness also yield to answering why quadratic functions have to have conjugate roots once they have a complex root. Full-text