/** \page Performances Performances In this page we collect hints on how to use the features available in PLUMED to speed up your calculations. Please note that PLUMED performs many different tasks, it can calculate a number of different collective variables, functions of collective variables, bias, on-the-fly analysis, etc in a way that is compatible with a number of different molecular dynamics codes. This means that there cannot be a single strategy to speed up all the possible calculations. PLUMED makes use of MPI and OpenMP to parallelise some of its functions, try to always compile it with these features enabled. Furthermore, newer compilers with proper optimisation flags can provide a drammatic boost to performances. PLUMED collects atoms from an external code and sends back forces, so it is key to minimise the effect of PLUMED on highly parallel calculations to keep to the minimum the number of atoms used by PLUMED at every calculation step. The less is the number of atoms you need to send to PLUMED the less will be the overhead in the comunication between PLUMED and the code. In the following you can find specific strategies for specific calculations, these could help in taking the most by using PLUMED for your simulations. - \subpage GMXGPU - \subpage Metadyn - \subpage MTS - \subpage Multicolvar - \subpage Neighbour - \subpage Openmp - \subpage Secondary - \subpage Time \page GMXGPU GROMACS and PLUMED with GPU Since version 4.6.x GROMACS can run in an hybrid mode making use of both your CPU and your GPU (either using CUDA or OpenCL for newer versions of GROMACS). The calculation of the short-range non-bonded interactions is performed on the GPU while long-range and bonded interactions are at the same time calculated on the CPU. By varing the cut-off for short-range interactions GROMACS can optimise the balance between GPU/CPU loading and obtain amazing performances. GROMACS patched with PLUMED takes into account PLUMED in its load-balancing, adding the PLUMED timings to the one resulting from bonded interactions and long- range interactions. This means that the CPU/GPU balance will be optimised automatically to take into account PLUMED! It is important to notice that the optimal setup to use GROMACS alone on the GPU or GROMACS + PLUMED can be different, try to change the number of MPI/OpenMP processes (\ref Openmp) used by GROMACS and PLUMED to find optimal performances. Remember that in GROMACS multiple MPI threads can use the same GPU: i.e. if you have 4 cores and 2 GPU you can: - use 2 MPI/2GPU/2OPENMP: export PLUMED_NUM_THREADS=2 mpiexec -np 2 gmx_mpi mdrun -nb gpu -ntomp 2 -pin on -gpu_id 01 - use 4 MPI/2GPU: export PLUMED_NUM_THREADS=1 mpiexec -np 4 gmx_mpi mdrun -nb gpu -ntomp 1 -pin on -gpu_id 0011 \page Metadyn Metadynamics Metadynamics can be sped up significantly using grids, which are activated setting the GRID_MIN and GRID_MAX keywords of \ref METAD. This makes addition of a hill to the list a bit slower (since the Gaussian has to be evaluated for many grid points) but the evaluation of the potential very fast. Since the latter is usually done every few hundred steps, whereas the former typically ad every step, using grids will make the simulation much faster. Notice that when restarting a simulation the history is read by default from a file and hills are added again to the grid. This allows one to change the grid boundaries upon restart. However, the first step after restart is usually very slow. Since PLUMED 2.3 you can also store the grid on a file and read it upon restart. This can be particularly useful if you perform many restarts and if your hills file has become very large. For the precise syntax, see \ref METAD \page MTS Multiple time stepping By setting a STRIDE different from 1, you change how frequently an action is calculated. In the case of actions such as \ref PRINT, this just means how frequently you dump some quantity on the disk. Notice that variables are only computed when necessary. Thus, if a variable is only appearing as the argument of a \ref PRINT statement with STRIDE=10, it will be computed every 10 steps. In a similar fashion, the STRIDE keyword can be used in a bias potential so as to apply the bias potential every few steps. In this case, forces from this bias potential are scaled up by a factor equal to STRIDE. This technique can allow your simulation to run faster if you need the apply a bias potential on some very expensive collective variable. Consider the following input: \verbatim c1: COM ATOMS=1-1000 c2: COM ATOMS=1001-2000 d: DISTANCE ATOMS=c1,c2 METAD ARG=d HEIGHT=1 SIGMA=0.1 BIASFACTOR=5 PACE=500 \endverbatim This performs a \ref METAD simulation biasing the distance between two centers of mass. Since computing these centers requires a lot of atoms to be imported from the MD engine, it could slow down significantly the simulation. Notice that whereas the bias is changed every PACE=500 steps, it is applied every STRIDE step, where STRIDE=1 by default. The following input could lead to a significantly faster simulation at the price of a negligible systematic error \verbatim c1: COM ATOMS=1-1000 c2: COM ATOMS=1001-2000 d: DISTANCE ATOMS=c1,c2 METAD ARG=d HEIGHT=1 SIGMA=0.1 BIASFACTOR=5 PACE=500 STRIDE=2 \endverbatim Similarly, the STRIDE keyword can be used with other biases (e.g. \ref RESTRAINT). The technique is discussed in details here \cite Ferrarotti2015. See also \subpage EFFECTIVE_ENERGY_DRIFT. \page Multicolvar Multicolvar Whenever you have a multicolvar action such as: \verbatim COORDINATIONNUMBER SPECIES=1-100 SWITCH={RATIONAL R_0=1. D_MAX=3.0} MORE_THAN={RATIONAL R_0=6.0 NN=6 MM=12 D_0=0} \endverbatim You will get a collosal speedup by specifying the D_MAX keyword in all switching functions that act on distances. D_MAX tells PLUMED that the switching function is strictly zero if the distance is greater than this value. As a result PLUMED knows that it does not need to calculate these zero terms in what are essentially sums with a very lage number of terms. In fact when D_MAX is set PLUMED uses linked lists when calculating these coordination numbers, which is what gives you such a dramatic increase in performance. \page Neighbour Neighbour Lists Collective variables that can be speed up making us of neighbour lists: - \ref COORDINATION - \ref DHENERGY - \ref PATHMSD By tuning the cut-off for the neighbour list and the frequency for the recalculation of the list it is possible to balance between accuracy and performances. Notice that for \ref COORDINATION and \ref DHENERGY using a neighbor list could imply that a smaller number of atoms are requested to the host MD engine. This is typically true when considering \ref COORDINATION of a small number of atoms (e.g. a ligand) again many atoms (e.g. water). When the neighbor list is used, only the water atoms close to the ligand will be requested at each step. \warning Notice that the calculation of the neighbour list is not not parallelized for \ref COORDINATION and \ref DHENERGY. As a consequence, if you run with many processors and/or OpenMP threads, the neighbor list might even make the calculation slower. \page Openmp OpenMP PLUMED is partly parallelized using OpenMP. This should be enabled by default if your compiler supports it, and can be disabled with `--disable-openmp`.. At runtime, you should set the environment variable PLUMED_NUM_THREADS to the number of threads you wish to use with PLUMED. By default (if PLUMED_NUM_THREADS is unset) openmp will be disabled at runtime. E.g., to run with gromacs you should do: \verbatim export PLUMED_NUM_THREADS=8 mdrun -plumed \endverbatim Notice that: - This option is likely to improve the performance, but could also slow down the code in some case. - Results could be slightly different because of numerical roundoff and different order in summations. This should be harmless. - The optimum number of threads is not necessary "all of them", nor should be equal to the number of threads used to parallelize MD. - Only a few CVs are parallelized with opemMP (currently, \ref COORDINATION and \ref DHENERGY). - You might want to tune also the environmental variable PLUMED_CACHELINE_SIZE, by default 512, to set the size of cachelines on your machine. This is used by PLUMED to decrease the number of threads to be used in each loop so as to avoid clashes in memory access. This variable is expected to affect performance only, not results. \page Secondary Secondary Structure Secondary Structure collective variables (\ref ALPHARMSD, \ref PARABETARMSD and \ref ANTIBETARMSD) can be particulary demanding if you want to calculate them for all the residues of a protein. This is particularty true for the calculation of beta structures. The FIRST thing to speed up \ref PARABETARMSD and \ref ANTIBETARMSD is to use the keyword STRANDS_CUTOFF (i.e. STRANDS_CUTOFF=1), in this way only a subset of possible fragments, the one less than 1. nm apart, are used in the calculation. The metric used to calculate the distance from ideal secondary structure elements can also influence the performances, try to use TYPE=OPTIMAL or TYPE=OPTIMAL-FAST instead of TYPE=DRMSD. At last, try to reduce the number of residues in the calculation. \page Time Time your Input Once you have prepared your plumed input file you can run a test simulation, or use driver, to see which collective variable, function, bias or analysis is consuming more time and can thus be the target for a different definition (use less atoms, change relevant parameters, or just use somenthing else) To have an accurate timing of your input you can use the \ref DEBUG DETAILED_TIMERS. */