Alexandru Marcoci

About

I am Assistant Professor of Global Risk and Resilience in the Centre for the Study of Existential Risk (CSER) at the University of Cambridge and Co-Director of the new MPhil in Global Risk and Resilience. Most of my work focuses on collective decision-making and reasoning about long-run risks and AI policy.

Before coming to Cambridge I was Assistant Professor of Politics and International Relations at the University of Nottingham. I was previously affiliated with CSER and Clare Hall, the Centre for Argument Technology, University of Dundee, the Philosophy, Politics and Economics Program at the University of North Carolina, Chapel Hill, and the Department of Government at the London School of Economics and Political Science.

I have a PhD in Philosophy from the London School of Economics and Political Science (2018). Contact me

Activities

I am a Co-Director of the Institute for Replication (I4R). I4R works to improve the credibility of science by systematically reproducing and replicating research findings in leading academic journals. Our team collaborates with researchers to: promote and generate reproductions and replications through one-day hackatons, establish an open access website, prepare standardized file structure, code and documentation, and develop educational materials on replication. From 2024 we have exciting new collaborations with Nature Human Behaviour and Psychological Science.

I am a member of the History, Philosophy, and Culture Working Group of the next generation Event Horizon Telescope Collaboration. We contribute social science and humanities perspectives on responsible telescope siting, outreach, education, foundations, algorithms, inferences, visualizations, governance structures and knowledge formation in scientific collaborations. I co-lead the Collaborations Focus Group, the task force on managing dissent in large scientific collaborations and lead the expert forecasting task force.

I collaborated with the DARPA-funded repliCATS project on using structured expert elicitation techniques to predict the reliability of research in the social & behavioural sciences. The project received the University of Melbourne Research Excellence Award for Interdisciplinary Research in 2022 and you can watch a video that summarises the aims and achievements of repliCATS here:

Research Funding

I was Principal Investigator of the Measuring the quality of collective reasoning project, funded by the British Academy/Leverhulme Small Research Grants (2023-2025). The Co-Investigator was Ans Vercammen (University of Queensland).

I was a UKRI Policy Fellow working on the Future of Online Regulation on part-time secondment to the Department for Science, Innovation and Technology (2023-2025).

I was Co-I of the Benchmarking LLM agents on real-world tasks: Reproducibility project, funded by Open Philanthropy (2024-2025). The other investigators were Abel Brodeur (University of Ottawa/Institute for Replication) and Rohan Alexander (University of Toronto).

I am Co-PI of the LLM Code Reviewer for Scientific Papers project, funded by the Alfred P. Sloan Foundation (2024-2025). The other co-PI is Abel Brodeur (University of Ottawa/Institute for Replication).

I am Co-I of the Opportunities and Potential Risks of AI in Supporting Evaluation project, funded by ai@cam (2025-2026). Our consortium is led by Deborah Talmi (University of Cambridge) and includes researchers from the Universities of Cambridge, Nottingham, Central Lancashire and from Manchester Metropolitan University.

I am a Co-I of the Tracking Stars and Unicorns project, funded by UKRI (2025-2026). Our consortium includes researchers from the Universities of Edinburgh, Bristol, Oxford and the Open University.

I am a Co-I of the Evidence Exchange project, funded by UKRI (2026-2029). Our consortium includes researchers from King’s College London, the University of Edinburgh, Swansea University, Queen’s University Belfast, the University of Manchester and University College London. Project partners include the Universities Policy Engagement Network (UPEN), the Scottish Policy Research Exchange (SPRE), the Resolution Foundation and Apolitical

Publications

Abel Brodeur, David Valenta, Alexandru Marcoci, Juan P. Aparicio, Derek Mikola, Bruno Barbarioli, Rohan Alexander, Lachlan Deer, Tom Stafford, Lars Vilhuber, Gunther Bensch, et al. (2025). Comparing Human-Only, AI-Assisted, and AI-Led Teams on Assessing Research Reproducibility in Quantitative Social Science. I4R Discussion Paper Series, No. 195 Abstract This study evaluates the effectiveness of varying levels of human and artificial intelligence (AI) integration in reproducibility assessments of quantitative social science research. We computationally reproduced quantitative results from published articles in the social sciences with 288 researchers, randomly assigned to 103 teams across three groups - human-only teams, AI-assisted teams and teams whose task was to minimally guide an AI to conduct reproducibility checks (the "AI-led" approach). Findings reveal that when working independently, human teams matched the reproducibility success rates of teams using AI assistance, while both groups substantially outperformed AI-led approaches (with human teams achieving 57 percentage points higher success rates than AI-led teams, p < 0.001). Human teams were particularly effective at identifying serious problems in the analysis: they found significantly more major errors compared to both AI-assisted teams (0.7 more errors per team, p = 0.017) and AI-led teams (1.1 more errors per team, p < 0.001). AI-assisted teams demonstrated an advantage over more automated approaches, detecting 0.4 more major errors per team than AI-led teams (p = 0.029), though still significantly fewer than human-only teams. Finally, both human and AI-assisted teams significantly outperformed AI-led approaches in both proposing (25 percentage points difference, p = 0.017) and implementing (33 percentage points difference, p = 0.005) comprehensive robustness checks. These results underscore both the strengths and limitations of AI assistance in research reproduction and suggest that despite impressive advancements in AI capability, key aspects of the research publication process still require human substantial human involvement.
Lexin Zhou, Pablo A. Moreno-Casares, Fernando Martínez-Plumed, John Burden, Ryan Burnell, Lucy Cheke, Cèsar Ferri, Alexandru Marcoci, Behzad Mehrbakhsh, Yael Moros-Daval, Seán Ó hÉigeartaigh, Danaja Rutar, Wout Schellaert, Konstantinos Voudouris, José Hernández-Orallo. (2023). Predictable Artificial Intelligence. Forthcoming in Artificial IntelligenceAbstract We introduce the fundamental ideas and challenges of "Predictable AI", a nascent research area that explores the ways in which we can anticipate key indicators of present and future AI ecosystems. We argue that achieving predictability is crucial for fostering trust, liability, control, alignment and safety of AI ecosystems, and thus should be prioritised over performance. While distinctive from other areas of technical and non-technical AI research, the questions, hypotheses and challenges relevant to "Predictable AI" were yet to be clearly described. This paper aims to elucidate them, calls for identifying paths towards AI predictability and outlines the potential impact of this emergent field.
Constantin W. Arnscheidt, SJ Beard, Tom Hobson, Paul Ingram, Luke Kemp, Lara Mani, Alexandru Marcoci, Kennedy Mbeva, Seán S. Ó hÉigeartaigh, Anders Sandberg, Lalitha S. Sundaram, Nico Wunderling. (2025). Systemic contributions to global catastrophic risk. Global Sustainability 8:e19 Abstract Humanity faces a complex and dangerous global risk landscape, and many different terms and concepts have been used to make sense of it. One broad strand of research characterises how risk emerges from the complex global system, using concepts like systemic risk, Anthropocene risk, synchronous failure, negative social tipping points, and polycrisis. Another strand focuses on possible worst-case outcomes, using concepts like global catastrophic risk (GCR), existential risk, and extinction risk. Despite their clear relevance to each other, only limited connections have been made between these two strands. Here we provide a framework which synthesises the two and shows how emergent properties of the global system contribute to the risk of global catastrophic outcomes. Specifically, the global system generates hazards, amplification, vulnerability, and latent risk, as well as challenges for GCR assessment and mitigation. This systemic lens helps us understand the origins of GCR, provides a useful interface between two deeply related but infrequently connected bodies of work, and provides important insights for risk reduction.
José Hernández-Orallo, Seán Ó hÉigeartaigh, Alexandru Marcoci, Haydn Belfield, Giulio Corsi, Maurice Chiodo, Fernando Martínez-Plumed. (2025). Feedback on the Second Draft of the General-Purpose AI Code of Practice: Comments and Recommendations. Leverhulme Centre for the Future of Intelligence
Alexandru Marcoci, David P. Wilkinson, Ans Vercammen, Bonnie C. Wintle, Anna Lou Abatayo, Ernest Baskin, Henk Berkman, Erin M. Buchanan, Sara Capitán, Tabaré Capitán, Ginny Chan, Kent Jason G. Cheng, Tom Coupé, Sarah Dryhurst, Jianhua Duan, John E. Edlund, Timothy M. Errington, Anna Fedor, Fiona Fidler, James G. Field, Nicholas Fox, Hannah Fraser, Alexandra LJ Freeman, Anca Hanea, Felix Holzmeister, Sanghyun Hong, Raquel Huggins, Nick Huntington-Klein, Magnus Johannesson, Angela M. Jones, Hansika Kapoor, John Kerr, Melissa Kline Struhl, Marta Kołczyńska, Yang Liu, Zachary Loomas, Brianna Luis, Esteban Méndez, Olivia Miske, Fallon Mody, Carolin Nast, Brian A. Nosek, E. Simon Parsons, Thomas Pfeiffer, W. Robert Reed, Jon Roozenbeek, Alexa R. Schlyfestone, Claudia R. Schneider, Andrew Soh, Anirudh Tagat, Melba Tutor, Andrew Tyner, Karolina Urbanska, Sander van der Linden. (2024). Predicting the replicability of social and behavioural science claims in a crisis: The COVID-19 Preprint Replication Project. Nature Human Behaviour 9: 287–304 Abstract The journal produced a Research Briefing on this paper. Replications are important for assessing the reliability of published findings. However, they are costly, and it is infeasible to replicate everything. Accurate, fast, lower-cost alternatives such as eliciting predictions could accelerate assessment for rapid policy implementation in a crisis. We elicited judgments from participants on 100 claims from preprints about an emerging area of research (COVID-19 pandemic) using an interactive structured elicitation protocol, and we conducted 29 new high-powered replications. After interacting with their peers, participant groups with lower task expertise (‘beginners’) updated their estimates and confidence in their judgements significantly more than groups with greater task expertise (‘experienced’). For experienced individuals, the average accuracy was 0.57 (95% CI: [0.53, 0.61]) after interaction, and they correctly classified 61% of claims; beginners’ average accuracy was 0.58 (95% CI: [0.54, 0.62]), correctly classifying 69% of claims. The difference in accuracy between groups was not statistically significant, and their judgments on the full set of claims were correlated (r=.48). These results suggest that both beginners and more experienced participants using a structured process have some ability to make better-than-chance predictions about the reliability of ‘fast science’ under conditions of high uncertainty. However, given the importance of such assessments for making evidence-based critical decisions in a crisis, more research is required to understand who the right experts in forecasting replicability are and how their judgements ought to be elicited.
Giulio Corsi, Alexandru Marcoci, and Bill Marino. (2024). How Foundation Models Could Transform Synthetic Media Detection. Leverhulme Centre for the Future of Intelligence
Alexandru Marcoci and Alexandra Oprea. (2024). Freedom of Speech on Campus. Philosophical Quarterly 74(4): 1251-1273 Abstract What should be the rules governing campus speech in a liberal democratic society? On one side are those arguing for maximal protections for campus speech analogous to the First Amendment in the United States. On the other are those promoting stricter regulation of speech through formal and informal speech codes. This paper aims to carve a new path in the conversation. Both sides agree that the mission of the university is the discovery and dissemination of knowledge and that achieving this mission requires tolerant and open-minded students, faculty, and administrators. However, neither side has explicitly connected its advocacy for specific speech policies with these shared goals. Our paper advances the conversation by proposing a series of empirically testable mechanisms for connecting speech policies to the desired outcomes. We also argue that focusing on these mechanisms opens the way towards new and more tractable conceptual and normative debates.
Alexandru Marcoci, Margaret E. Webb, Luke Rowe, Ashley Barnett, Tamar Primoratz, Ariel Kruger, Christopher W. Karvetski, Benjamin Stone, Michael L. Diamond, Morgan Saletta, Tim van Gelder, Philip E. Tetlock, Simon Dennis. (2024). Validating a forced choice method for eliciting quality of reasoning judgments. Behavior Research Methods 56: 4958–4973 Abstract In this paper we investigate the criterion validity of forced choice comparisons of the quality of written arguments with normative solutions. Across two studies, assessing quality of reasoning through a forced choice design enabled both novices and experts to choose arguments supporting more accurate solutions – 62.2% (SE=1%) of the time for novices and 74.4% (SE=1%) for experts – and arguments produced by larger teams - up to 82% of the time for novices and 85% for experts – with high inter-rater reliability - 70.58% (95% CI = 1.18) percent agreement for novices and 80.98% (95% CI = 2.26) for experts. We also explored two methods for increasing efficiency. We found that the number of comparative judgments needed can be substantially reduced with little accuracy loss by leveraging transitivity and producing quality of reasoning assessments using an AVL tree method. Moreover, a regression model trained to predict scores based on automatically derived linguistic features of participants’ judgments achieved a high correlation with the objective accuracy scores of the arguments in our dataset. Despite the inherent subjectivity involved in evaluating differing quality of reasoning, the forced choice paradigm allows even novice raters to perform beyond chance and can provide a valid, reliable and efficient method for producing quality of reasoning assessments at scale.
Alexandra Oprea and Alexandru Marcoci. (2024). How Should Colleges Select Students? Justice, Toleration, and University Admissions. Georgetown Journal of Law & Public Policy 22: 977-1000 Abstract As undergraduate education becomes a key formative experience for a larger percentage of the population, it is imperative that political philosophers consider the role of universities in bringing about a more just society. In this paper, we contribute to this task by assessing which university admissions policies are compatible with justice and conducive to the epistemic and civic missions of the university. Scholars agree that universities require a tolerant campus culture, but concrete proposals have focused on interventions at the level of faculty and administrators. The empirical literature, however, shows that students are more influenced by reputational consequences among their peers. We therefore argue that universities should also attend to the selection of the student body. We consider and reject a popular proposal that colleges should select students with underrepresented moral and political beliefs to increase viewpoint diversity. Instead, we propose directly weighing students’ tolerance and open-mindedness in the admission process.
Alexandru Marcoci, Ann C. Thresher, Niels C. M. Martens, Peter Galison, Sheperd S. Doeleman, Michael D. Johnson. (2023). Big STEM collaborations should include humanities and social science. Nature Human Behaviour 7: 1229-1230
Seán Ó hÉigeartaigh, Yolanda Lannquist, Alexandru Marcoci, Jaime Sevilla, Mónica Alejandra Ulloa Ruiz, Yaqub Chaudhary, Tim Schreier, Zach Stein-Perlman and Jeffrey Ladish. (2023). Do companies’ AI Safety Policies meet government best practice?. Leverhulme Centre for the Future of Intelligence Lead Rapid review finds leading AI companies are not meeting UK Government best practice for frontier AI safety.
Bonnie C. Wintle, Eden T. Smith, Martin Bush, Fallon Mody, David P. Wilkinson, Anca M. Hanea, Alexandru Marcoci, Hannah Fraser, Victoria Hemming, Felix Singleton Thorn, Marissa F. McBride, Elliot Gould, Andrew Head, Daniel G. Hamilton, Steven Kambouris, Libby Rumpff, Rink Hoekstra, Mark A. Burgman, Fiona Fidler. (2023). Predicting and reasoning about replicability using structured groups. Royal Society Open Science 10(6): 221553 Abstract This paper explores judgements about the replicability of social and behavioural sciences research and what drives those judgements. Using a mixed methods approach, it draws on qualitative and quantitative data elicited from groups using a structured approach called the IDEA protocol (‘Investigate’, ‘Discuss’, ‘Estimate’ and ‘Aggregate’). Five groups of five people with relevant domain expertise evaluated 25 research claims that were subject to at least one replication study. Participants assessed the probability that each of the 25 research claims would replicate (i.e., that a replication study would find a statistically significant result in the same direction as the original study) and described the reasoning behind those judgements. We quantitatively analysed possible correlates of predictive accuracy, including self-rated expertise and updating of judgements after feedback and discussion. We qualitatively analysed the reasoning data to explore the cues, heuristics and patterns of reasoning used by participants. Participants achieved 84% classification accuracy in predicting replicability. Those who engaged in a greater breadth of reasoning provided more accurate replicability judgements. Some reasons were more commonly invoked by more accurate participants, such as ‘effect size’ and ‘reputation’ (e.g. of the field of research). There was also some evidence of a relationship between statistical literacy and accuracy.
Michael D. Johnson, Kazunori Akiyama, Lindy Blackburn, Katherine L. Bouman, Avery E. Broderick, Vitor Cardoso, Rob Fender, Christian Fromm, Peter Galison, Jose L. Gómez, Daryl Haggard, Matthew L. Lister, Andrei Lobanov, Sera Markoff, Ramesh Narayan, Priyamvada Natarajan, Tiffany Nichols, Dominic W. Pesce, Ziri Younsi, Andrew Chael, Koushik Chatterjee, Ryan Chaves, Juliusz Doboszewski, Richard Dodson, Sheperd S. Doeleman, Jamee Elder, Garret Fitzpatrick, Kari Haworth, Janice Houston, Sara Issaoun, Yuri Kovalev, Aviad Levis, Rocco Lico, Alexandru Marcoci, Niels C.M. Martens, Neil Nagar, Aaron Oppenheimer, Daniel C. M. Palumbo, Angelo Ricarte, María J. Rioja, Freek Roelofs, Ann C. Thresher, Paul Tiede, Jonathan Weintroub, Maciek Wielgus. (2023). Key Science Goals for the Next-Generation Event Horizon Telescope. Galaxies 11(3): 61 (SI: From Vision to Instrument - Creating a Next-Generation Event Horizon Telescope for a New Era of Black Hole Science). Abstract The Event Horizon Telescope (EHT) has led to the first images of a supermassive black hole, revealing the central compact objects in the elliptical galaxy M87 and the Milky Way. Proposed upgrades to this array through the next-generation EHT (ngEHT) program would sharply improve the angular resolution, dynamic range, and temporal coverage of the existing EHT observations. These improvements will uniquely enable a wealth of transformative new discoveries related to black hole science, extending from event-horizon-scale studies of strong gravity to studies of explosive transients to the cosmological growth and influence of supermassive black holes. Here, we present the key science goals for the ngEHT and their associated instrument requirements, both of which have been formulated through a multi-year international effort involving hundreds of scientists worldwide
Mark Burgman, Rafael Chiaravalloti, Fiona Fidler, Yizhong Huan, Marissa McBride, Alexandru Marcoci, Juliet Norman, Ans Vercammen, Bonnie C. Wintle and Yurong Yu. (2023). A toolkit for open and pluralistic conservation science. Conservation Letters 16(1): e12919 Abstract Conservation science practitioners seek to pre-empt irreversible impacts on species, ecosystems, and social-ecological systems, requiring efficient and timely action even when data and understanding are unavailable, incomplete, dated, or biased. These challenges are exacerbated by the scientific community's capacity to consistently distinguish between reliable and unreliable evidence, including the recognition of questionable research practices (QRPs, or ‘questionable practices’), which may threaten the credibility of research, including harming trust in well-designed and reliable scientific research. In this paper, we propose a ‘toolkit’ for open and pluralistic conservation science, highlighting common questionable practices and sources of bias and indicating where remedies for these problems may be found. The toolkit provides an accessible resource for anyone conducting, reviewing, or using conservation research, to identify sources of false claims or misleading evidence that arise unintentionally, or through misunderstandings or carelessness in the application of scientific methods and analyses. We aim to influence editorial and review practices and hopefully to remedy problems before they are published or deployed in policy or conservation practice.
Peter Galison, Juliusz Doboszewski, Jamee Elder, Niels C.M. Martens, Abhay Ashtekar, Jonas Enander, Marie Gueguen, Elizabeth A. Kessler, Roberto Lalli, Martin Lesourd, Alexandru Marcoci, Sebastián Murgueitio Ramírez, Priyamvada Natarajan, James Nguyen, Luis Reyes-Galindo, Sophie Ritson, Mike D. Schneider, Emilie Skulberg, Helene Sorgner, Matthew Stanley, Ann C. Thresher, Jeroen van Dongen, James Owen Weatherall, Jingyi Wu, Adrian Wüthrich. (2023). The Next Generation Event Horizon Telescope Collaboration: History, Philosophy, and Culture. Galaxies 11(1): 32 (SI: From Vision to Instrument - Creating a Next-Generation Event Horizon Telescope for a New Era of Black Hole Science)Abstract This white paper outlines the plans of the History Philosophy Culture Working Group of the Next Generation Event Horizon Telescope Collaboration.
Hannah Fraser, Martin Bush, Bonnie Wintle, Fallon Mody, Eden Smith, Anca Hanea, Elliot Gould, Victoria Hemming, Dan Hamilton, Libby Rumpff, David Peter Wilkinson, Ross Pearson, Felix Singleton Thorn, Raquel Ashton, Aaron Willcox, Charles T Gray, Andrew Head, Melissa Ross, Rebecca Groenewegen, Alexandru Marcoci, Ans Vercammen, Timothy H Parker, Rink Hoekstra, Shinichi Nakagawa, David R Mandel, Don van Ravenzwaaij, Marissa McBride, Richard O Sinnott, Peter Vesk, Mark Burgman and Fiona Fidler. (2023). Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) process. PLoS ONE 18(1): e0274429Abstract As replications of individual studies are resource intensive, techniques for predicting the replicability are required. We introduce the repliCATS (Collaborative Assessments for Trustworthy Science) process, a new method for eliciting expert predictions about the replicability of research. This process is a structured expert elicitation approach based on a modified Delphi technique applied to the evaluation of research claims in social and behavioural sciences. The utility of processes to predict replicability is their capacity to test scientific claims without the costs of full replication. Experimental data supports the validity of this process, with accuracy that meets or exceeds that of other techniques used to predict replicability while providing additional benefits. The repliCATS process is highly scalable, able to be deployed for both rapid assessment of small numbers of claims, and assessment of high volumes of claims over an extended period through an online elicitation platform. It is available to be implemented in a range of ways and we describe one such implementation. An important advantage of the repliCATS process is that it collects qualitative data that has the potential to assist with problems like understanding the limits of generalizability of scientific claims. The primary limitation of the repliCATS process is its reliance on human-derived predictions with consequent costs in terms of participant fatigue although careful design can minimise these costs. The repliCATS process has potential applications in alternative peer review and in the allocation of effort for replication studies.
Luc Bovens and Alexandru Marcoci. (2023). The Gender-Neutral Bathroom: A New Frame and Some Nudges. Behavioural Public Policy 7(1), 1-24 Abstract Gender-neutral bathrooms are usually framed as an accommodation for trans and other gender non-conforming individuals. In this paper we show that the benefits of gender-neutral bathrooms are much broader. First, our simulations show that gender-neutral bathrooms reduce average waiting times: while waiting times for women go down invariably, waiting times for men either go down or slightly increase depending on usage intensity, occupancy time differentials, and the presence of urinals. Second, our result can be turned on its head: firms have an opportunity to reduce the number of facilities and cut costs by making them all gender-neutral without increasing waiting times. These observations can be used to reframe the gender-neutral bathrooms debate so that they appeal to a larger constituency, cutting across the usual dividing lines in the “bathroom wars”. Finally, there are improved designs and behavioural strategies that can help overcome resistance. We explore what strategies can be invoked to mitigate the objections that gender-neutral bathrooms (1) are unsafe; (2) elicit discomfort; and (3) are unhygienic.
Alexandru Marcoci, Margaret E. Webb, Luke Rowe, Ashley Barnett, Tamar Primoratz, Ariel Kruger, Benjamin Stone, Morgan Saletta, Tim van Gelder, Simon Dennis. (2022). Measuring Quality of General Reasoning. In J. Culbertson, A. Perfors, H. Rabagliati & V. Ramenzoni (Eds.), Proceedings of the 44th Annual Conference of the Cognitive Science Society (CogSci 2022), 3229-3235 Abstract Machine learning models that automatically assess reasoning quality are trained on human-annotated written products. These “gold-standard” corpora are typically created by prompting annotators to choose, using a forced choice design, which of two products presented side by side is the most convincing, contains the strongest evidence or would be adopted by more people. Despite the increase in popularity of using a forced choice design for assessing quality of reasoning (QoR), no study to date has established the validity and reliability of such a method. In two studies, we simultaneously presented two products of reasoning to participants and asked them to identify which product was ‘better justified’ through a forced choice design. We investigated the criterion validity and inter-rater reliability of the forced choice protocol by assessing the relationship between QoR, measured using the forced choice protocol, and accuracy in objectively answerable problems using naive raters sampled from MTurk (Study 1) and experts (Study 2), respectively. In both studies products that were closer to the correct answer and products generated by larger teams were consistently preferred. Experts were substantially better at picking the reasoning products that corresponded to accurate answers. Perhaps the most surprising finding was just how rapidly raters made judgements regarding reasoning: On average, both novices and experts made reliable decisions in under 15 seconds. We conclude that forced choice is a valid and reliable method of assessing QoR.
Alexandru Marcoci, Ans Vercammen, Martin Bush, Daniel Hamilton, Anca Hanea, Victoria Hemming, Bonnie C. Wintle, Mark Burgman and Fiona Fidler. (2022). Reimagining peer review as an expert elicitation process. BMC Research Notes 15, 127 (SI: Reproducibility and Research Integrity) Abstract Journal peer review regulates the flow of ideas through an academic discipline and thus has the power to shape what a research community knows, actively investigates, and recommends to policymakers and the wider public. We might assume that editors can identify the ‘best’ experts and rely on them for peer review. But decades of research on both expert decision-making and peer review suggest they cannot. In the absence of a clear criterion for demarcating reliable, insightful, and accurate expert assessors of research quality, the best safeguard against unwanted biases, uneven power distributions and general inefficiencies is to introduce greater transparency and structure into the process. This paper argues that peer review would therefore benefit from applying a series of evidence-based recommendations from the empirical literature on structured expert elicitation. We highlight individual and group characteristics that contribute to higher quality judgements, and elements of elicitation protocols that reduce bias, promote constructive discussion, and enable opinions to be objectively and transparently aggregated.
Ans Vercammen, Alexandru Marcoci and Mark Burgman. (2021). Pre-screening workers to overcome bias amplification in online labour markets. PLoS ONE 16(3), e0249051. Abstract Groups have access to more diverse information and typically outperform individuals on problem solving tasks. Crowdsolving utilises this principle to generate novel and/or superior solutions to intellective tasks by pooling the inputs from a distributed online crowd. However, it is unclear whether this particular instance of “wisdom of the crowd” can overcome the influence of potent cognitive biases that habitually lead individuals to commit reasoning errors. We empirically test the prevalence of cognitive bias on a popular crowdsourcing platform, examining susceptibility to bias of online panels at the individual and aggregate levels. We then investigate the use of the Cognitive Reflection Test, notable for its predictive validity for real-life reasoning, as a screening tool to improve collective performance. We find that systematic biases in crowdsourced answers are not as prevalent as anticipated, but when they occur, biases are amplified with increasing group size, as predicted by the Condorcet Jury Theorem. The results further suggest that pre-screening individuals with the Cognitive Reflection Test can substantially enhance collective judgement and improve crowdsolving performance.
Diana Popescu and Alexandru Marcoci. (2020). Coronavirus: allocating ICU beds and ventilators based on age is discriminatory. The Conversation, April 22 Lead Being a member of a certain age group shouldn't be a liability.
Alexandru Marcoci and James Nguyen. (2020). Judgement aggregation in scientific collaborations: The case for waiving expertise. Studies in History and Philosophy of Science Part A 84, 66-74 Abstract The fragmentation of academic disciplines forces individuals to specialise. In doing so, they become experts over their narrow area of research. However, ambitious scientific projects, such as the search for gravitational waves, require them to come together and collaborate across disciplinary borders. How should scientists with expertise in different disciplines treat each others' expert claims? An intuitive answer is that the collaboration should defer to the opinions of experts. In this paper we show that under certain seemingly innocuous assumptions, this intuitive answer gives rise to an impossibility result when it comes to aggregating the beliefs of experts to deliver the beliefs of a collaboration as a whole. We then argue that when experts' beliefs come into conflict, they should waive their expert status.
Alexandru Marcoci. (2020). Monty Hall saves Dr. Evil: On Elga's restricted principle of indifference. Erkenntnis 85(1), 65-76 Abstract In this paper I show that Elga's argument for a restricted principle of indifference for self-locating belief relies on the kind of mistaken reasoning that recommends the 'staying' strategy in the Monty Hall problem.
Gregg Willcox, Louis Rosenberg, Mark Burgman and Alexandru Marcoci. (2020). Prioritizing Policy Objectives in Polarized Societies using Artificial Swarm Intelligence. In the Proceedings of the IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA 2020), 1-9 Abstract Groups often struggle to reach decisions, especially when populations are strongly divided by conflicting views. Traditional methods for collective decision-making involve polling individuals and aggregating results. In recent years, a new method called Artificial Swarm Intelligence (ASI) has been developed that enables networked human groups to deliberate in real-time systems, moderated by artificial intelligence algorithms. While traditional voting methods aggregate input provided by isolated participants, Swarm-based methods enable participants to influence each other and converge on solutions together. In this study we compare the output of traditional methods such as Majority vote and Borda count to the Swarm method on a set of divisive policy issues. We find that the rankings generated using ASI and the Borda Count methods are often rated as significantly more satisfactory than those generated by the Majority vote system (p<0.05). This result held for both the population that generated the rankings (the “in-group”) and the population that did not (the “out-group”): the in-group ranked the Swarm prioritizations as 9.6% more satisfactory than the Majority prioritizations, while the out-group ranked the Swarm prioritizations as 6.5% more satisfactory than the Majority prioritizations. This effect also held even when the out-group was subject to a demographic sampling bias of 10% (i.e. the out-group was composed of 10% more Labour voters than the in-group). The Swarm method was the only method to be perceived as more satisfactory to the “out-group” than the voting group.
Alexandru Marcoci and James Nguyen. (2019).Objectivity, ambiguity and theory choice. Erkenntnis 84(2), 343–357 Abstract Kuhn argued that scientific theory choice is, in some sense, a rational matter, but one that is not fully determined by shared objective scientific virtues like accuracy, simplicity, and scope. Okasha imports Arrow's impossibility theorem into the context of theory choice to show that rather than not fully determining theory choice, these virtues cannot determine it at all. If Okasha is right, then there is no function (satisfying certain desirable conditions) from 'preference' rankings supplied by scientific virtues over competing theories (or models, or hypotheses) to a single all-things-considered ranking. This threatens the rationality of science. In this paper we show that if Kuhn's claims about the role that subjective elements play in theory choice are taken seriously, then the threat dissolves.
Alexandru Marcoci, Ans Vercammen and Mark Burgman. (2019). ODNI as an analytic ombudsman: Is Intelligence Community Directive 203 up to the task? Intelligence and National Security 34(2), 205-224 Abstract In the wake of 9/11 and the war in Iraq, the Office of the Director of National Intelligence adopted Intelligence Community Directive (ICD) 203 – a list of analytic tradecraft standards – and appointed an ombudsman charged with monitoring their implementation. In this paper, we identify three assumptions behind ICD203: (1) tradecraft standards can be employed consistently; (2) tradecraft standards sufficiently capture the key elements of good reasoning; and (3) good reasoning leads to more accurate judgments. We then report on two controlled experiments that uncover operational constraints in the reliable application of the ICD203 criteria for the assessment of intelligence products.
Alexandru Marcoci, Mark Burgman, Ariel Kruger, Elizabeth Silver, Marissa McBride, Felix Singleton Thorn, Hannah Fraser, Bonnie Wintle, Fiona Fidler and Ans Vercammen. (2019). Better together: Reliable application of the post-9/11 and post-Iraq US intelligence tradecraft standards requires collective analysis. Frontiers in Psychology 9, 2634 (SI: Judgment and Decision Making Under Uncertainty) Abstract Background. The events of 9/11 and the October 2002 National Intelligence Estimate on Iraq's Continuing Programs for Weapons of Mass Destruction precipitated fundamental changes within the US Intelligence Community. As part of the reform, analytic tradecraft standards were revised and codified into a policy document – Intelligence Community Directive (ICD) 203 – and an analytic ombudsman was appointed in the newly created Office for the Director of National Intelligence to ensure compliance across the intelligence community. In this paper we investigate the untested assumption that the ICD203 criteria can facilitate reliable evaluations of analytic products. Method. Fifteen independent raters used a rubric based on the ICD203 criteria to assess the quality of reasoning of 64 analytical reports generated in response to hypothetical intelligence problems. We calculated the intra-class correlation coefficients for single and group-aggregated assessments. Results. Despite general training and rater calibration, the reliability of individual assessments was poor. However, aggregate ratings showed good to excellent reliability. Conclusions. Given that real problems will be more difficult and complex than our hypothetical case studies, we advise that groups of at least three raters are required to obtain reliable quality control procedures for intelligence products. Our study sets limits on assessment reliability and provides a basis for further evaluation of the predictive validity of intelligence reports generated in compliance with the tradecraft standards.
Alexandru Marcoci. (2018). On a dilemma of redistribution. Dialectica 72(3), 453-460 Abstract McKenzie Alexander presents a dilemma for a social planner who wants to correct the unfair distribution of an indivisible good between two equally worthy individuals or groups: either she guarantees a fair outcome, or she follows a fair procedure (but not both). In this paper I show that this dilemma only holds if the social planner can redistribute the good in question at most once. To wit, the bias of the initial distribution always washes out when we allow for sufficiently many redistributions.
Luc Bovens and Alexandru Marcoci. (2018). Gender-neutral restrooms require new (choice) architecture. Behavioural Public Policy Blog, April 17 Lead "What’s not to love about gender-neutral restrooms?" ask Bovens and Marcoci. Their spread could only come about trough a sensitive mix of good design and nudges; working on social norms and behaviours. Some discomforts may, however, prove to be beyond nudging, and an incremental, learning approach is probably required.
Luc Bovens and Alexandru Marcoci. (2017). To those who oppose gender-neutral toilets: they’re better for everybody. The Guardian, December 1 Lead Bovens and Marcoci's research into the economics of these facilities shows they cut waiting for women, and address the concerns of trans and disabled people.
Alexandru Marcoci and James Nguyen. (2017). Scientific rationality by degrees. In M. Massimi, J.W. Romeijn, and G. Schurz (Eds.), EPSA15 Selected Papers. European Studies in Philosophy of Science, Vol. 5 (Cham: Springer), 321-333 Abstract In a recent paper, Samir Okasha imports Arrow's impossibility theorem into the context of theory choice. He shows that there is no function (satisfying certain desirable conditions) from profiles of preference rankings over competing theories, models or hypotheses provided by scientific virtues to a single all-things-considered ranking. This is a prima facie threat to the rationality of theory choice. In this paper we show this threat relies on an all-or-nothing understanding of scientific rationality and articulate instead a notion of rationality by degrees. The move from all-or-nothing rationality to rationality by degrees will allow us to argue that theory choice can be rational enough.
Alexandru Marcoci. (2015). Review of Quitting Certainties: A Bayesian Framework Modeling Degrees of Belief, by Michael G. Titelbaum. Economics and Philosophy 31(1), 194–200
Zoé Christoff, Paolo Galeazzi, Nina Gierasimczuk, Alexandru Marcoci and Sonja Smets (Eds.). Logic and Interactive RAtionality Yearbook 2012: Volumes 1 & 2. Institute for Logic Language and Computation, University of Amsterdam
Alexandru Baltag, Davide Grossi, Alexandru Marcoci, Ben Rodenhäuser and Sonja Smets (Eds.). Logic and Interactive RAtionality Yearbook 2011. Institute for Logic Language and Computation, University of Amsterdam