Cluster Expansion Using the CELL Tool

by Maria Troppenz & Santiago Rigamonti for exciting carbon

Purpose: In this tutorial, you will perform a cluster expansion of the total energy for the alloy Si1-xGex using the package CELL in combination with exciting. The iterative procedure of CELL is presented, which allows for building a Cluster Expansion for large parent ceLLs. This is possible thanks to efficient samplings of the configurational space.

0. Preparations

Read the following paragraphs before starting with the rest of this tutorial!

The first thing you should do is to create a directory where you should copy the source code of the CELL tool. From now on the symbol $ will indicate the shell prompt.

$ mkdir /home/tutorials/CELL

Then download the source code of CELL using the next link.

Copy the CELL.source.tar.gz file in the /home/tutorials/CELL directory and move inside it. Here, unpack the source code.

$ cd /home/tutorials/CELL
$ tar xvf CELL.source.tar.gz

General requirements: Bash shell. Python numpy, lxml, and sys libraries.

1. Theoretical background

Substitutional alloys, such as, e.g., Si1-xGex , have a large configurational space at every concentration $0 < x < 1$. This makes almost impossible to perform ab-initio calculations for every configuration. The cluster-expansion (CE) technique allows for building numerically efficient models to predict the energy $E$ of the alloy by exploiting the unique dependence of $E$ on the configuration (see Ref.[1]). It relies on highly accurate ab-initio calculations obtained by density-functional theory and has been applied to describe stable phases of bulk and surface alloys (see, for instance, Refs.[2] and [3]).

For an arbitrary configuration $s$ of the alloy, the energy can be parametrized in terms of clusters $\alpha=\{i,j,...\}$ (a set of crystal sites $i$):

\begin{align} \hat{E}_s = \sum_{\alpha}^{N_c} \,m_{\alpha} \,J_{\alpha}\, X_{s \alpha}. \end{align}

Here, $\hat{E}_s$ represents the prediction of the model for the energy $E_s$ of structure $s$. The coefficients $J_{\alpha}$ are the so-called effective-cluster interactions (ECIs). The sum runs over a set of $N_c$ clusters $\alpha$, each cluster built from up to $n$ sites (empty cluster, 1-point clusters, …, $n$-point clusters). The multiplicity $m_{\alpha}$ represents the number of distinct clusters obtained by applying all symmetry operations of the pristine (i.e. non-substituted) crystal to $\alpha$. The value $X_{s \alpha}$ represents the correlation of the cluster $\alpha$ to the configuration $s$ ($X_{s \alpha} \in [-1,+1]$), and is calculated by

\begin{align} X_{s \alpha} = \frac{1}{m_{\alpha}} ~ \sum_{\beta \equiv \alpha} \, \prod_{i \in \beta } \sigma_{si}\:. \end{align}

The correlation averages the product of the occupation variables $\sigma_{si}$'s over the clusters $\beta$ equivalent to $\alpha$. The occupation variable $\sigma_{si}$ are assigned to each site $i$ in structure $s$. For a binary alloy consisting of the two elements A and B, the variable $\sigma_{si}$ takes the value -1 if the site $i$ is occupied by atom of element A, or +1 if it is occupied by atom of element B (in case of Si1-xGex , A=Si and B=Ge).

The coefficients $J_{\alpha}$ of the CE are obtained by minimizing the objective function.

\begin{align} S^2=\frac{1}{N_t} \sum_{s=1}^{N_t} \, (E_s- \hat{E}_{s})^2 + A\sum_{\alpha } \, | J_{\alpha} |^2\:. \end{align}

The first term on the right hand side is the mean squared error (MSE) of the predicted values $\hat{E}_{s}$. The sum runs over all $N_t$ structures whose energies $E_s$ are known from ab-initio calculations and used to build the CE model. The $N_t$ structures are the so-called training set. The second term favors shrinkage of the ECIs and, additionally, this term regularizes the problem of finding the best fit, i.e., to find an optimal set of ECIs even if the number of clusters $N_c$ is larger than $N_t$. Both the optimal penalization strength $A$ and the optimal set of clusters are obtained by minimizing the the cross-validation score $CV$ (see Refs.[4] and [5]),

\begin{align} {{CV}}^2=\frac{1}{N_t} \sum_{s=1}^{N_t} \, (E_s - \hat{E}_{(s)})^2\:. \end{align}

Here, $\hat{E}_{(s)}$ is the predicted value for $E_s$, which is obtained from the ECIs that optimize the equation, excluding the data point $s$.

2. Set up the cluster expansion

If the file CELL.source.tar.gz has been correctly unpacked, in the current directory you will find the SiGe folder. Move inside this folder which will be your working directory.

$ cd SiGe

Inside the directory SiGe, you find the following files:

filename description Configuration file containing global variables used by CELL. If an parameter is not given inside this file, it uses the default value. Description of the parent lattice. Lattice of the supercell.
exciting.wrap.xml File containing the parameters needed by exciting (e.g., rgkmax, ngridk, etc.).
Si.xml & Ge.xml Species files of the two elements Si and Ge.

First, we have a look at the file:

# ab-initio
CALCULATOR = "exciting"
EXCITINGBIN = "/home/tutorials/exciting/bin"

# crystal

CELL_PAR_0 = [[0.0000, 5.4310, 5.4310],
              [5.4310, 0.0000, 5.4310],
              [5.4310, 5.4310, 0.0000]]

CELL_PAR_1 = [[0.0000, 5.6580, 5.6580],
              [5.6580, 0.0000, 5.6580],
              [5.6580, 5.6580, 0.0000]]

Here, you can find the definitions of both variables used for describing the ab-initio calculations (i.e., CALCULATOR, EXCITINGBIN, and ENERGY_UNITS) and for characterizing the crystal lattice (i.e., PAR_LAT_FILE, MET_LAT_FILE, VEGARDS_LAW, etc.).

Furthermore, the parent lattice is decribed in

0.00 5.45 5.45
5.45 0.00 5.45
5.45 5.45 0.00
0.5 0.0 0.0
0.0 0.5 0.0
0.0 0.0 0.5
0.00000000 0.00000000 0.00000000 Si,Ge
0.12500000 0.12500000 0.12500000 Si,Ge

The file above consists of three parts:

  1. A matrix V formed by the first three lines,
  2. a matrix D formed by the next three lines and
  3. the atomic coordinates P.

The cartesian vectors of the primitive cell are the rows of the matrix L = DV, while the real space coordinates of the atoms are the rows of the matrix R = PV. This is convenient because lattice files for supercells can be easily written by defining the supercell with the matrix D, while the reference system for the atomic positions, (V), remains the same.

In this tutorial, we perform a cluster expansion of a 2$\times$2$\times$2 supercell with, in total, 16 substitutional sites (see the file below).

0.00 5.45 5.45
5.45 0.00 5.45
5.45 5.45 0.00
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
0.00000000 0.00000000 0.00000000 Si,Ge
0.50000000 0.00000000 0.00000000 Si,Ge
0.00000000 0.50000000 0.00000000 Si,Ge
0.00000000 0.00000000 0.50000000 Si,Ge
0.12500000 0.12500000 0.12500000 Si,Ge
0.12500000 0.12500000 0.62500000 Si,Ge
0.12500000 0.62500000 0.12500000 Si,Ge
0.62500000 0.12500000 0.12500000 Si,Ge
0.12500000 0.62500000 0.62500000 Si,Ge
0.62500000 0.12500000 0.62500000 Si,Ge
0.62500000 0.62500000 0.12500000 Si,Ge
0.62500000 0.62500000 0.62500000 Si,Ge
0.00000000 0.50000000 0.50000000 Si,Ge
0.50000000 0.00000000 0.50000000 Si,Ge
0.50000000 0.50000000 0.00000000 Si,Ge
0.50000000 0.50000000 0.50000000 Si,Ge

In addition to the previous files, there are folders labeled with numbers (1, 2, …, 14). Each folder contains a structure randomly generated for each composition $x=y/16$ in the range $0 \leq y \leq 16$, except for the values $y$ = 4, 8, and 12. You can find in each folder the following files:

filename description
str.out Structure with a specific configuration of Si and Ge atoms in the lattice.
input.xml Input file for exciting.
input.xsf Structure file to visualize the structure in xcrysden.
energy File containing the energy obtained by exciting calculations.

For the values $y$ = 4, 8, and 12, you can add a random structure by yourself. To do so, you create the structures and perform their ab-initio calculation with the following commands:

$ cell gen_rnd_str 4,8,12

This creates the folders 15, 16, and 17 containing the random structures of the values 4, 8, and 12, respectively, in the files str.out. In every folder the files input.xml and input.xsf are created, too. Now, we will calculate the energies with exciting by executing:

$ cell calc_ab_initio

This takes around 5min. In the meanwhile, you can open another terminal and go inside the working directory again. You can take a look at certain random structures with xcrsyden. You can visualize then any of the input.xsf files in the enumerated folders.

When the above calculation is completed, we will have a structure with an ab-initio energy at every composition. Thus, you can start to build a CE as explained in thext section.

3. Cluster expansion: First iteration

i) Finding the optimal cluster expansion

Firstly, collect the ab-initio energies of the structures stored in each folder into a file called allenergy.out using the following command.

$ cell write_ab_initio

After this, we extend the with the following lines:

# cluster expansion
ENERGY_CE = "mixingPerSite"

Here, the variable ENERGY_CE defines the energy in which the CE is built. In this case, the energy of mixing is chosen. The energy of mixing is defined as the energy difference between the energy $E(x)$ and an linear interpolation between the pristine structures $E(y=0)$ (only Si atoms) and $E(y=16)$ (only Ge atoms). Thus, the formula is as follows:

\begin{align} e_{\rm mix} (y) = \frac{1}{16} \left\lbrace E(y) - \left[ \hat{E}(y=0) + \left( \hat{E}(y=16) - \hat{E}(y=0) \right) \frac{y}{16} \right] \right\rbrace \end{align}

The two remaining variables MAX_RADIUS and MAX_POINTS define the set of clusters considered for the cluster optimization discussed below. MAX_RADIUS specifies the maximum radius (i.e., the maximum distance between two points in the clusters), and MAX_POINTS the largest number of points of the cluster. The next command will create a pool of clusters up to this radius and number of points:

$ cell gen_clusters

Now, we can calculate the correlation matrix X:

$ cell gen_correlations

After running these two commands, the files clusters.out, containing the clusters, and allcorr.out ,containing the correlations, are created in the working directory. Now, we perform a cluster optimization. The following command creates different sets of clusters taken from the pool.

$ cell optimize_clusters
index   CV-LOO     CV-5%      CV-10%     CV-15%     MSE        Npts_max  Rcl_max    Nr of clusters
1       0.0003887  0.0003817  0.0003806  0.0003741  0.0002773  2         2.3599192  3
2       0.0001349  0.0001227  0.0001336  0.0001365  0.0000821  2         3.8537319  4
3       0.0001650  0.0001469  0.0001310  0.0001727  0.0000594  2         4.5189012  5
4       0.0001295  0.0001339  0.0001270  0.0001345  0.0000597  3         3.8537319  7
5       0.0000760  0.0000834  0.0001099  0.0001459  0.0000354  3         4.5189012  11
6       0.0002019  0.0001607  0.0002017  0.0002759  0.0000594  4         3.8537319  10
7       0.0045847  0.0044954  0.0033676  0.0016914  0.0000001  4         4.5189012  22


Optimal CV found for set 5
CV = 0.000076097860,   MSE = 0.000035480505,   #of clusters: 11

Optimal ECIs written to file energy.eci


This script chooses automatically the optimal CE by selecting the set of clusters with the smallest CV and writing the corresponding ECIs to the file energy.eci. In addition, we can plot the MSE and CV versus the number of clusters used for the CE.

$ cell plot_optimize_clusters

The following plot is generated:
ii) Ground-state search

Once the optimal CE is determined, we can search for ground-state configurations at arbitrary concentrations (it could be also possible to use another supercell by redifining the file). The ground-state search performs a Metropolis sampling of the configuration space at a given concentration and returns the structure with the lowest energy. If the ground-state configuration with its ab-initio calculation exists already, the ground-state search returns the next lowest-non-degenerate (LND) structure.

Before starting the search, we have to add the following lines to the

# Ground-state search
TEMP = 1000 # temperature [K] for the Metropolis sampling
MAX_N_CONF = 300 # number of configuratons to sample after the last LND structure was found

Here, we define the temperature TEMP used for the Metropolis sampling and the number of configurations MAX_N_CONF explored after the last LND is found.

Now, we can submit the ground-state search with the command:

$ cell gs_search 4,8,12 t

Sampling configurational space for 4 substitutional atoms.
Found lowest non-degenerate structure for nsub = 4 substitutions.
Acceptance ratio of Metropolis search:  0.853862212944 

FOLDER | Predicted    |  Ab-initio
New LND  |  -0.001939122051
15  |  -0.001672768784  |  -0.00166373171885
Created new str.out file in folder  ./18

Sampling configurational space for 8 substitutional atoms.
Found lowest non-degenerate structure for nsub = 8 substitutions.
Acceptance ratio of Metropolis search:  0.751851851852 

FOLDER | Predicted    |  Ab-initio
New LND  |  -0.002583396044
16  |  -0.00240810368  |  -0.00239358781255
Created new str.out file in folder  ./19

Sampling configurational space for 12 substitutional atoms.
Found lowest non-degenerate structure for nsub = 12 substitutions.
Acceptance ratio of Metropolis search:  0.715053763441 

FOLDER | Predicted    |  Ab-initio
New LND  |  -0.001940169736
17  |  -0.001696331639  |  -0.00166380015639
Created new str.out file in folder  ./20

The parameter t at the end is optional. If t is used, the LND structures found by the search are written in folders labeled again with numbers. After the ground-state search has finished, three new folders are created in your working directory, each containing a new LND structure. At this point, we can plot the ground-state search (sampling stored in metropolis.log) together with the initial set of structures.

$ cell plot_gs_search
You should obtain a result similar to what is shown below:

The blue circles indicate the ab-initio energy, whereas the blue dots indicate their predictions from the CE. The predicted energies of structures visited by the Metropolis sampling are indicated with red diamonds.

4. Cluster expansion: Next iterations

First, store the first iteration with the command:

$ cell store

You can improve your cluster expansion by adding the ab-initio calculations of the structures in folders 18 to 20 found by the ground-state search.

$ cell calc_ab_initio

Now, there are three additional ab-initio calculations, which can be added to build another CE with a higher predictive power (lower CV). The more ab-initio calculations are included to build the CE, the better the predictions of the CE are expected to be. In order to build the CE based on the extended ab-initio data, you can follow the optimization procedure explained in Section 3, with the following commands:

$ cell write_ab_initio
$ cell gen_clusters
$ cell gen_correlations
$ cell optimize_clusters

Once this improved CE is performed, we can proceed to do a new ground-state search:

$ cell gs_search 3,6,10,13 t

This new search gives you new LND structures (iteration 2). Following the iterative procedure, the ab-initio calculations of the structures from iteration 2 can be added and after a new optimization of the cluster expansion, the ground-state search can be started as follows:

$ cell gs_search t
This could lead to a plot similar to this:

The iterative procedure of adding new ab-initio data, optimizing the cluster expansion and performing a ground-state search can be continued until the CV is low enough to predict the ground-state configuration with a high accuracy.


  • Perform a new cluster expansion with finer parameters for the ab-initio calculation.
  • To do so, create a new folder and copy the following files from the working directory SiGe inside it:,,, exciting.wrap.xml, Si.xml, and Ge.xml
  • To change the parameters of the ab-initio calculations, go inside exciting.wrap.xml. For instance, you can increase the value of rgkmax or use finer k-grid. Be aware that more accurate calculations are also more computationally demanding.
  • Start with the generation of random structures and follow the procedure explained above.
  • Another suggestion is to change the size of the supercell, e.g., you could study a 4$\times$4$\times$4 supercell.

1. Generalized cluster description of multicomponent systems. J. M. Sanchez, F. Ducastelle, and D. Gratias, Physica A 128, 334 (1984)
2. G. D. Garbulksy and G. Ceder, Phys. Rev. B 51, 67 (1995).
3. B. Hülsen, M. Scheffler, and P. Kratzer, Phys. Rev. B 79, 09447 (2009).
4. Cross-validation Choice and Assessment of Statistical Predictions. M. Stone, J. Royal Statistical Society B (Methodological) 36, 111 (1974).
5. Automating First-Principles Phase Diagram Calculations. A . van de Walle and G. Ceder, J. Phase Equilibria 23, 348 (2002).
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License