AFragmenter is a schema-free, tunable protein domain segmentation tool for AlphaFold structures based on network analysis.
-
Schema free: AFragmenter only uses the PAE values from AlphaFold structures. No domain-segmentation scheme is learned or used for evaluation.
-
Tunable segmentation: The 'resolution' parameter gives control over the coarseness of clustering, and thus the number of clusers / domains.
- Higher resolution: Yields more, smaller clusters
- Lower resolution: Yields fewer, larger clusters
Resolution = 0.8 | Resolution = 1.1 | Resolution = 0.3 |
---|---|---|
![]() |
![]() |
![]() |
protein: P15807
-
Network representation: Each protein residue is treated as a node within a fully connected network
-
Edge weighting: The edges between the nodes are weighted using transformed Predicted Aligned Error (PAE) values from AlphaFold, reflecting relative positional confidence between residues.
Details on the use of PAE values
- PAE values show differences when looking between inter- versus intra-domain residue pairs.
- Intra-domain residue paris are expected to have lower PAE values compared to inter-domain residue pair.
- This difference is used to distinguish well-structured regions within a protein structure from other well-structured regions and from poorly structured regions.
- This enables us to cluster protein residue pairs of well-structured regions together.
-
Clustering with Leiden algorithm: Utilizes the Leiden clustering algorithm to group residues into domains, with adjustable resolution parameters to control cluster granularity.
The recommended way to use AFragmenter is through jupyter notebooks, where visualization and fine-tuning of parameters is most easily done. The easiest way to begin is by using our Google colab notebook.
- Note: While colab notebooks offers convenience, it can experience slower performance due to shared resources.
An alternative way to get started is by using the [webtool] (coming soon)
- Python Version: Ensure you have Python 3.9 or higher installed on your system.
- Operating Systems: The tool is compatible with Linux, macOS, and Windows.
-
Set Up a Virtual Environment (Recommended): Creating a virtual environment helps manage dependencies effectively. Here's how to set it up:
# Install virtualenv if not already installed pip install virtualenv # Create a new virtual environment virtualenv myenv # Activate the virtual environment # On Windows: myenv\Scripts\activate # On macOS/Linux: source myenv/bin/activate
or alternatively, create and use a conda environment
conda create --name myenv pip 'python>=3.9'
&conda activate myenv
-
Install AFragmenter: Install the package using pip within your activated virtual environment.
pip install AFragmenter
-
Optional Dependencies:
-
py3Dmol: Required for protein structure visualization.
pip install py3Dmol
-
After installation, verify that AFragmenter is correctly installed by running:
afragmenter --version
This command should display the installed version of AFragmenter.
In this short tutorial, we will walk through the process of using AFragmenter to segment protein domains based on AlphaFold structures. We will use the example protein P15807 (PDB: 1KYQ) to demonstrate the steps involved. This protein is classified differently by various protein domain databases, making it an interesting case for domain segmentation.
P15807 is classified as a three-domain protein in both CATH and ECOD, a two-domain protein in SCOPe and InterPro, and a single-domain protein in SCOP.
Since AFragmenter is dependent on the PAE values of AlphaFold, it is a good idea to first have a look at the PAE plot.
from afragmenter import AFragmenter, fetch_afdb_data
pae, structure = fetch_afdb_data('P15807')
p15807 = AFragmenter(pae) # Or bring your own files: a = AFragmenter('filename.json')
p15807.plot_pae()
Here we see some regions of very low PAE values (dark green) on the PAE matrix, which could indicate different domains. However, there are still many green (low PAE) datapoints visible around these potential domains. Therefore, it is important to consider the PAE contrast threshold used.
These PAE values are transformed into edge weights to increase the contrast between high and low PAE values. The PAE contrast threshold can be adjusted to control this contrast. Below, we can see the effect of different thresholds on the weights of the graph.
Show code
from afragmenter.plotting import plot_matrix
p15807 = AFragmenter(pae)
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
p15807.plot_pae(ax=ax[0, 0])
plot_matrix(p15807.edge_weights_matrix, ax=ax[0, 1])
p15807 = AFragmenter(pae, threshold=3)
plot_matrix(p15807.edge_weights_matrix, ax=ax[1, 0])
p15807 = AFragmenter(pae, threshold=1)
plot_matrix(p15807.edge_weights_matrix, ax=ax[1, 1])
ax[0, 0].set_title('PAE matrix')
ax[0, 1].set_title('Edge weights matrix (threshold=5)\n[default]')
ax[1, 0].set_title('Edge weights matrix (threshold=3)')
ax[1, 1].set_title('Edge weights matrix (threshold=1)')
plt.tight_layout()
plt.show()
A threshold of 3 seems to give a good contrast between the higher and lower PAE values.
Next, we cluster the residues into domains using the Leiden clustering algorithm. We get a result, but the resolution parameter can be changed to explore multiple potential solutions.
p15807 = AFragmenter(pae, threshold=3)
p15807.cluster() # default resolution = 0.8
p15807.plot_result()
p15807.py3Dmol(structure)
p15807.cluster(resolution=1.1)
p15807.cluster(resolution=0.3)
Once a solution has been found that is satisfactory to the user, we can print the result and the FASTA file for each domain, or save them to files for further analysis.
p15807 = AFragmenter(pae, threshold=3)
p15807.cluster(resolution=1.1)
p15807.print_result()
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Domain ┃ Number of Residues ┃ Chopping ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ 1 │ 137 │ 10-146 │
│ 2 │ 47 │ 147-193 │
│ 3 │ 81 │ 194-274 │
└────────┴────────────────────┴──────────┘
p15807.print_fasta(structure)
>P15807_1 10-146
QLKDKKILLIGGGEVGLTRLYKLIPTGCKLTLVSPDLHKSIIPKFGKFIQNEDQPDYRED
AKRFINPNWDPTKNEIYEYIRSDFKDEYLDLEDENDAWYIIMTCIPDHPESARIYHLCKE
RFGKQQLVNVADKPDLC
>P15807_2 147-193
DFYFGANLEIGDRLQILISTNGLSPRFGALVRDEIRNLFTQMGDLAL
>P15807_3 194-274
EDAVVKLGELRRGIRLLAPDDKDVKYRMDWARRCTDLFGIQHCHNIDVKRLLDLFKVMFQ
EQNCSLQFPPRERLLSEYCSS
# Or save it
p15807.save_result('result.csv')
p15807.save_fasta(structure, 'result.fasta')
Docs coming soon...
The 'contrast threshold' serves as a soft cut-off to increase the distinction between low and high PAE values. Used in calculating the edge weights of the network and will thus have a large impact on the clustering and segmentation results. It is important to consider this threshold in the context of the AlphaFold results for the protein of interest.
Examples:
Overall good structure with high pLDDt and low PAE scores for the majority of the protein, and lower pLDDT and high PAE scores for the disordered regions / loops, like is expected. Default threshold should be good (default PAE threshold = 5).
AlphaFold structure | PAE plot | Edge weights |
---|---|---|
![]() |
![]() |
![]() |
Very high pLDDT scores and low PAE scores for the AlphaFold structure indicating strong confidence, with one loop as exception. Several linkers, including the disordered N-terminal region, also show unexpectedly high pLDDT scores and low PAE scores, contrary to what would be expected for such regions. This apparent overconfidence is likely due to the inclusion of the crystal structure (1KYQ) in the AlphaFold training dataset.
Lowering the treshold can help reduce this apparent confidence, making it easier to differentiate between genuinely well-structured regions ans those that are more likely to be flexible or disordered.
AlphaFold structure | PAE plot |
---|---|
![]() |
![]() |
Edge weights (default threshold = 5) | Edge weights (treshold = 3) |
![]() |
![]() |
Q9YFU8 is a great example to remind us again that the PAE scores are not primarily intended to be used for domain segmentation, but instead are a measure of how confident AlphaFold is in the relative position of two residues.
The PAE plot for Q9YFU8 shows two distinct parts of the protein seperated with hight PAE values, indicating uncertainty in their relative positions. Going of off the previous examples, it would not be uncommon to assume there to be two distince domains in this protein, but this isn't necessarily the case. Q9YFU8 has two crystal structures in the PDB: 1W5S and 1W5T. Superpositioning of these crystal structures reveals that a significant portion of the protein overlays well, however another part shows a large deviation in orientation. AlphaFold likely learned this similarity and difference, resulting in low PAE scores for the overlapping regions and high PAE scores between the differently oriented parts. These structures might explain the resulting PAE scores, but this means we still need to pay attention choosing the threshold to properly segment the remaining parts of the protein structure.
AlphaFold structure | PAE plot | Crystal structures: 1W5S (green) and 1W5T (red) |
---|---|---|
![]() |
![]() |
![]() |
Lowering the treshold even if initial inspection deems it not necessary can still change the results. Without changing the threshold we see two domains, consistent with the results from SCOP and SCOPe. While lowering the threshold results in 3 domains, consistent with ECOD, CATH, Interpro and SCOP. (SCOP can contain multiple solutions)
Threshold = 5 | Threshold = 3 |
---|---|
![]() |
![]() |
(Other settings kept as default values)
The resolution can be thought of as the coarseness of clustering. Increasing the resolution will result in more, smaller clusters (/domains). Decreasing the resolution will result in fewer but larger clusters.
Examples:
Resolution = 0.8 | Resolution = 1.1 | Resolution = 0.3 |
---|---|---|
![]() |
![]() |
![]() |
Resolution = 0.8 | Resolution = 1.4 |
---|---|
![]() |
![]() |
The objective function that is optimized during clustering, choices are either CPM (constant potts model) or Modularity. The contant potts model does not suffer from the resolution limit problem like modularity does, leading to more, smaller well-defined clusters. This means that 'CPM' will result in more smaller, tightly connected clusters that represent specific subgroups or communities within the data. On the other hand, 'Modularity' will tend to produce fewer, larger clusters that encompass broader groups within the data.
For AFragmenter, 'CPM' translates to a more sensitive approach where we see many more smaller clusters, especially for disordered regions. 'Modularity' is less sensitive to small shifts in PAE values, and will be better at clustering residues from disordered regions together.
Examples: