A step-by-step tutorial to perform PCA with Gromacs MD trajectory

It is a common practice to perform principal component analysis to explore the transitions and dynamics of macromolecules simulations.

There may exist multiple states in the free energy landscape, so it is important to extract the representative structures from the energy minima in the free energy surface. To generate such a free energy surface, we could define some collective variables (CVs). However, due to the high dimensionality of the simulation trajectory, it is not always straightforward to select several more important CVs.

So the problem is how could we find one or two CVs which could describe the slow motions of the system? To this end, PCA-based dimension reduction and projections could partially solve the problem by transforming the original dataset and grasp the largest variance of the system. Although there has been a tool in Gromacs, which perform the PCA using g_covar and g_anaeig. The two utilities, however, are not flexible enough to do customized PCA with unconventional dataset.

The detail theory of PCA is out of the scope of this tutorial, but an explanation could be found in the following links:

https://en.wikipedia.org/wiki/Principal_component_analysis

https://www.youtube.com/watch?v=HMOI_lkzW08

The tutorial is organized as the following:

1. What do we need before we calculate PCA using xtc trajectory?

2. How to perform the analysis?

3. How to visualize the result?

4. What's more?

Lets begin!

1. What we need?

Working environment:

So I just presume that you are comfortable with Linux, since the whole tutorial and codes are deployed in Linux.

Software and libraries:

Gromacs

Anaconda or Miniconda

python 3.6 (or 2.7), pandas, scikit-learn, numpy

Files:

In your working directory, you need a Gromacs trajectory file (eg. mytraj.xtc), a reference pdb file. The reference pdb file works as the topology file, as we need in all other MD packages.

The atom numbers in the xtc file should be extactly the same as that in the pdb file.

Read Gromacs xtc with mdtraj

>>> import mdtraj as mt
>>> traj = mt.load_xtc(traj.xtc, reference.pdb)

mdtraj manual could be found here:
http://mdtraj.org/latest/index.html

2. Howto perform PCA analysis?

If you'd like to use the all-atom xyz coordination as the input for the PCA:

>>> xyz = traj.xyz.reshape((traj.xyz.shape[0], -1))

Or you only need the C-alpha xyz
>>> calpha_index = traj.topology.select("name CA")
>>> traj = traj.atom_slice(calpha_index)
>>> xyz = traj.xyz.reshape((traj.xyz.shape[0], -1))

Do PCA calculation with scikit-learn

>>> from sklearn import decomposition
>>> pca = decomposition.PCA()
>>> pca.fit(xyz)
>>> xyz_transformed = pca.transform(xyz)

Now you have need projections stored in xyz_transformed.
The eigenvector and eigenvalue are stored in pca instance.

3. Visualise the first two projections

We use matplotlib to visualize the first 2 PCs.

>>> import matplotlib.pyplot as plt
>>> plt.scatter(xyz_transformed[:, 0], xyz_transformed[:, 1],)
>>> plt.xlabel("PC1")
>>> plt.ylabel("PC2")
>>> plt.show()

Actually, as long as we could prepare our trajectory data into a certain format, we could
do the PCA calculation. We could use the C-C contact map, or all the dihedral angles (but have to transform them into sin and cos values first), or hydrogen bonding contact map, and feed the MxN matrix into scikit-learn.

4. What's more?

Acutally, there is a package for easy gromacs trajectory PCA analysis:

https://www.github.com/zhenglz/dockingML

Download that package, install it, and then put the package bin path in your bashrc.

$ export PATH=/home/john/dockingML/bin:$PATH
$ # show help information
$ gmx_pca -h
$ # do diheral pca calculation, but have to prepare an index file contain the dihedral atom indices
$ gmx_pca.py -mode dihedral -f mytrajecotry.xtc -s reference.pdb -n dihedral.ndx -proj 10

Any issues or bugs, please leave me a comment. Thanks!

Search This Blog

Memory Down