Course progress: 0%

How have AlphaFold 3’s predictions been validated?

The announcement of AlphaFold 3 (AF3) in May 2024 triggered a rapid response from the scientific community. Within months, a wave of independent validation studies and high-performing alternatives began to emerge.

The rise of alternatives

Within a remarkably short period following AF3’s announcement in May 2024, technical reports and releases for multiple high-performing alternative models began to emerge by late 2024 and early 2025. These alternatives are not mere clones of AF3; they often introduce their own architectural innovations and features, contributing to the broader scientific understanding of multimodal structure prediction.

HelixFold-3, developed by the PaddleHelix team (Liu et al., 2024), builds upon prior HelixFold models (HelixFold, HelixFold-Single, HelixFold-Multimer), with insights from AlphaFold 3. HelixFold-3 claims accuracy comparable to AF3 across these molecular types. HelixFold-3 was trained on PDB structures released before September 30, 2021, augmented with self-distillation datasets. In an evaluation focusing on its utility for Free Energy Perturbation (FEP) calculations, HelixFold-3 outperformed AF2 in predicting binding site conformations. FEP calculations using HelixFold-3 predicted structures (both holo and apo) achieved accuracy comparable to those using experimental crystal structures, even for novel ligand derivatives not present in its training data (Furui, K., & Ohue, M., 2025)

Chai-1, from the Chai Discovery team, is a multi-modal foundation model for molecular structure prediction (Boitreaud et al., 2024). Chai-1’s model architecture and training strategy largely follows that of AlphaFold 3, but involves training a single model with a data cutoff of January 12, 2021. Key architectural additions include the incorporation of residue-level embeddings from a large protein language model to enhance single-sequence prediction capabilities, and new trainable constraint features (pocket, contact, docking constraints) that allow the model to be prompted with experimental restraints. Performance claims include a 77% ligand RMSD success rate on the PoseBusters benchmark (comparable to AF3’s 76%), which increases to 81% when prompted with the apo protein structure. Chai-1 is available as a python package.

Boltz-1 and Boltz-2, respectively, introduce novel capabilities in biomolecular modeling. Boltz-1 (Wohlwend et al., 2025) adheres to the general framework and architecture of AlphaFold 3 but introduces key changes, including conditioning predictions on user-defined binding pockets and a “Boltz-steering” feature designed to improve physical plausibility by addressing issues like steric clashes and incorrect chirality (in the Boltz-1x variant). Building on Boltz-1, the more recent Boltz-2 (Passaro et al., 2025), uniquely offers the capability to predict binding affinity. Boltz-2 expands training data beyond static structures to include experimental and molecular dynamics ensembles, and enhances user control through conditioning on experimental methods, user-defined distance constraints, and multi-chain template integration.When benchmarked against a diverse set of unseen complexes, Boltz-2 matches or moderately improves over Boltz-1’s performance across modalities. Compared to other models like Chai-1 and ProteinX, Boltz-2 performs competitively, though it currently lags a bit behind AlphaFold 3, particularly in antibody-antigen structure prediction, where AlphaFold 3 has highlighted a performance advantage.

Protenix, from the ByteDance AML AI4Science Team, is described as a “comprehensive reproduction of AlphaFold3” implemented in PyTorch (Chen et al., 2025). The Protenix architecture is based on AlphaFold 3, with refinements to ambiguous descriptions, corrections of typographical errors, and targeted adjustments such as slight modifications to the confidence head and zero-initialization strategies for several modules. Notably, Protenix’s data pipeline does not use structural templates or MSAs for nucleic acid chains. Protenix was trained using PDB structures (cutoff September 30, 2021) and protein monomer distillation data from AlphaFold2/OpenFold.

Independent benchmarking of AlphaFold 3

Following the release of AlphaFold 3, numerous independent research groups have undertaken efforts to benchmark its performance across a variety of biomolecular prediction tasks.

One preprint focused on AlphaFold 3’s structure predictions of metal-protein interactions, comparing it to another system called RoseTTAfold-AllAtom. The researchers concluded that “AF3 provides realistic predictions for metal ions” (Dürr and Rothlisberger, 2024).

A second group studied human T cell receptors, specifically their ability to recognise and bind to NRAS cancer neoantigens. They compared multiple implementations of AlphaFold and found that AlphaFold 3 “showed strong performance” – albeit not quite as good as a specialised implementation of AlphaFold 2 called TCRmodel2 (Wu et al., 2024). However, the authors did not optimise AlphaFold 3’s performance. They generated 25 or 200 ranked predictions for AlphaFold 2.3, 1000 predictions for TCRmodel2, and only 5 diffusion samples from 1 seed for AlphaFold 3. It is necessary to sample many seeds to obtain AlphaFold 3’s best predictions for immune system proteins (Abramson et al., 2024).

A third preprint explored whether deep learning systems can learn the physics of protein-ligand interactions. They found “a significant divergence from expected physical behaviours”. In particular, AlphaFold 3 placed small molecules like ATP and heme into their natural binding sites, even if the protein residues that formed the binding site had been mutated to prevent any such interaction. The authors interpreted this to mean that AlphaFold 3 predicts the binding positions of small molecules “not based on molecular interactions”, but using patterns observed in regions distant from the binding site, or in the overall fold of the proteins (Masters et al., 2024).

The function of a complex is determined by its binding energy landscape as well as its molecular structure. A fourth preprint benchmarked AlphaFold 3 and other prediction methods against SKEMPI, a dataset used for assessing binding energy. The authors suggested that “AlphaFold 3 learns unique features beneficial for estimating binding free energy”. The preprint also demonstrated that AlphaFold 3 can improve initial predictions made by other methods (Lu et al, 2024).

RNA structure prediction presents unique challenges due to fundamental differences between RNA and proteins. Bernard et al. (2025) conducted an extensive benchmark of AF3 across diverse RNA test sets, including CASP-RNA.Notably, AlphaFold 3 demonstrated robust generalisation properties for ribosomal structure. However, it is important to acknowledge that the majority of long RNA structures deposited in the PDB are ribosome-related. AlphaFold 3 generally accurately reproduces key RNA interactions and excels in predicting RNA torsion angles. However, the study noted that predicting the 3D structure of long RNAs becomes increasingly difficult with increasing sequence length. Additionally, AlphaFold 3 faces limitations in consistently reproducing all non-Watson-Crick interactions, crucial for structural stability, and in predicting structures from orphan RNA families without supplementary contextual information (Bernard et al., 2025).

A drug discovery focused assessment by Zheng et al. (2025) provided further nuanced insights. This study found that AF3 excels at predicting static protein-ligand interactions where minimal conformational changes occur upon binding (protein RMSD < 0.5Å compared to the apo state). In such cases, AF3 significantly outperformed traditional docking methods, particularly in the accuracy of side-chain orientations. AF3 demonstrated value as a “true-hit binary interaction modeler,” capable of generating reliable structural models for experimentally validated binding pairs. However, a persistent bias towards predicting active G protein-coupled receptor (GPCR) conformations was observed, irrespective of whether the bound ligand was an agonist or an antagonist.

For covalent ligand prediction, an assessment using a new COValid benchmark, showed that AF3 achieved near-perfect classification (average AUC = 98.3%) of covalent active binders against property-matched decoys, dramatically outperforming classical covalent docking tools (Shamir Y. and London, N. 2025). Applying physics-based scoring to AF3-generated models further improved the ranking of these covalent complexes. Notably, AF3 accurately predicted the structure of one covalent complex (PDB: 7O70) that was determined experimentally after the model’s training cutoff date, with a 0.45Å pocket-aligned RMSD, suggesting some generalization capability in this specific context.

The FoldBench assessment (Xu et al., 2025), a comprehensive benchmark for all-atom predictors, rigorously evaluated AF3 against models like Boltz-1, Chai-1, and HelixFold-3 . To ensure a low-homology benchmark, targets with high sequence or structural similarity to training set entries were removed.
For protein-ligand interactions, AF3 achieved a 64.9% success rate on the overall dataset, outperforming the runner-up, Boltz-1, by a significant margin of nearly 10%. Interestingly, on a subset of “unseen proteins” (less than 40% sequence identity to any protein sequence in the training set), AF3’s success rate increased to 69.0%. However, for “unseen ligands” (ligands with less than 0.5 Tanimoto similarity to those in the training set complexed with homologous proteins) the performance (64.3% success rate) was comparable to its overall performance.
A critical finding from FoldBench is that AF3’s ligand docking accuracy notably diminishes as the ligand’s similarity to the training set decreases. Among evaluated models, AlphaFold 3 consistently demonstrates superior accuracy across the majority of tasks. However, a significant challenge persists in the prediction of antibody-antigen complexes. While AF3 performed best among the tested models, its failure rate still exceeded 50%. Despite this, the researchers concluded that AF3’s “superior abilities in monomer and interaction prediction, conformational change modeling, and ranking underscore its remarkable generalization and robustness, positioning it as the leading model.”

Tools for a new generation of predictors

The rapid proliferation of deep learning-based structure prediction methods necessitates new tools for efficient operation and evaluation.

ABCFold was designed to simplify the execution of AlphaFold 3, Boltz-1, and Chai-1 (Elliot et al., 2025). ABCFold streamlines operation by taking a single input in the AlphaFold 3 JSON format and automatically converting it for use with Boltz-1 and Chai-1. It also facilitates running these methods with custom Multiple Sequence Alignments (MSA), and with custom templates (AlphaFold 3 only). It also remaps and reorders output chains resulting from distinct treatments by the three programs, further facilitating comparison.

AlphaBridge is a collection of tools to post-process and analyse information on interaction interfaces between predicted macromolecular complex components and visualise the most relevant information in an accessible and intuitive manner to scientists interested in macromolecular complexes (Alvarez-Salmoral et al., 2024). Currently, AlphaBridge works with results from the AlphaFold3 server.

AlphaFold

How have AlphaFold 3’s predictions been validated?

The rise of alternatives

Independent benchmarking of AlphaFold 3

Tools for a new generation of predictors

Congratulations!