by Maria Troppenz & Santiago Rigamonti for exciting carbon
Purpose: In this tutorial, you will perform a cluster expansion of the total energy for the alloy Si_{1x}Ge_{x} using the package CELL in combination with exciting. The iterative procedure of CELL is presented, which allows for building a Cluster Expansion for large parent ceLLs. This is possible thanks to efficient samplings of the configurational space.
0. Preparations
Read the following paragraphs before starting with the rest of this tutorial!
The first thing you should do is to create a directory where you should copy the source code of the CELL tool. From now on the symbol $ will indicate the shell prompt.
$ mkdir /home/tutorials/CELL
Then download the source code of CELL using the next link.
Copy the CELL.source.tar.gz file in the /home/tutorials/CELL directory and move inside it. Here, unpack the source code.
$ cd /home/tutorials/CELL
$ tar xvf CELL.source.tar.gz
General requirements: Bash shell. Python numpy, lxml, and sys libraries.
1. Theoretical background
Substitutional alloys, such as, e.g., Si_{1x}Ge_{x} , have a large configurational space at every concentration $0 < x < 1$. This makes almost impossible to perform abinitio calculations for every configuration. The clusterexpansion (CE) technique allows for building numerically efficient models to predict the energy $E$ of the alloy by exploiting the unique dependence of $E$ on the configuration (see Ref.[1]). It relies on highly accurate abinitio calculations obtained by densityfunctional theory and has been applied to describe stable phases of bulk and surface alloys (see, for instance, Refs.[2] and [3]).
For an arbitrary configuration $s$ of the alloy, the energy can be parametrized in terms of clusters $\alpha=\{i,j,...\}$ (a set of crystal sites $i$):
(1)Here, $\hat{E}_s$ represents the prediction of the model for the energy $E_s$ of structure $s$. The coefficients $J_{\alpha}$ are the socalled effectivecluster interactions (ECIs). The sum runs over a set of $N_c$ clusters $\alpha$, each cluster built from up to $n$ sites (empty cluster, 1point clusters, …, $n$point clusters). The multiplicity $m_{\alpha}$ represents the number of distinct clusters obtained by applying all symmetry operations of the pristine (i.e. nonsubstituted) crystal to $\alpha$. The value $X_{s \alpha}$ represents the correlation of the cluster $\alpha$ to the configuration $s$ ($X_{s \alpha} \in [1,+1]$), and is calculated by
(2)The correlation averages the product of the occupation variables $\sigma_{si}$'s over the clusters $\beta$ equivalent to $\alpha$. The occupation variable $\sigma_{si}$ are assigned to each site $i$ in structure $s$. For a binary alloy consisting of the two elements A and B, the variable $\sigma_{si}$ takes the value 1 if the site $i$ is occupied by atom of element A, or +1 if it is occupied by atom of element B (in case of Si_{1x}Ge_{x} , A=Si and B=Ge).
The coefficients $J_{\alpha}$ of the CE are obtained by minimizing the objective function.
(3)The first term on the right hand side is the mean squared error (MSE) of the predicted values $\hat{E}_{s}$. The sum runs over all $N_t$ structures whose energies $E_s$ are known from abinitio calculations and used to build the CE model. The $N_t$ structures are the socalled training set. The second term favors shrinkage of the ECIs and, additionally, this term regularizes the problem of finding the best fit, i.e., to find an optimal set of ECIs even if the number of clusters $N_c$ is larger than $N_t$. Both the optimal penalization strength $A$ and the optimal set of clusters are obtained by minimizing the the crossvalidation score $CV$ (see Refs.[4] and [5]),
(4)Here, $\hat{E}_{(s)}$ is the predicted value for $E_s$, which is obtained from the ECIs that optimize the equation, excluding the data point $s$.
2. Set up the cluster expansion
If the file CELL.source.tar.gz has been correctly unpacked, in the current directory you will find the SiGe folder. Move inside this folder which will be your working directory.
$ cd SiGe
Inside the directory SiGe, you find the following files:

First, we have a look at the config.py file:
# abinitio
CALCULATOR = "exciting"
EXCITINGBIN = "/home/tutorials/exciting/bin"
ENERGY_UNITS = "Ha"
# crystal
PAR_LAT_FILE = "lat.in"
MET_LAT_FILE = "lat.in.met"
VEGARDS_LAW = True
CELL_PAR_0 = [[0.0000, 5.4310, 5.4310],
[5.4310, 0.0000, 5.4310],
[5.4310, 5.4310, 0.0000]]
CELL_PAR_1 = [[0.0000, 5.6580, 5.6580],
[5.6580, 0.0000, 5.6580],
[5.6580, 5.6580, 0.0000]]
Here, you can find the definitions of both variables used for describing the abinitio calculations (i.e., CALCULATOR, EXCITINGBIN, and ENERGY_UNITS) and for characterizing the crystal lattice (i.e., PAR_LAT_FILE, MET_LAT_FILE, VEGARDS_LAW, etc.).
Furthermore, the parent lattice is decribed in lat.in:
0.00 5.45 5.45
5.45 0.00 5.45
5.45 5.45 0.00
0.5 0.0 0.0
0.0 0.5 0.0
0.0 0.0 0.5
0.00000000 0.00000000 0.00000000 Si,Ge
0.12500000 0.12500000 0.12500000 Si,Ge
The file above consists of three parts:
 A matrix V formed by the first three lines,
 a matrix D formed by the next three lines and
 the atomic coordinates P.
The cartesian vectors of the primitive cell are the rows of the matrix L = DV, while the real space coordinates of the atoms are the rows of the matrix R = PV. This is convenient because lattice files for supercells can be easily written by defining the supercell with the matrix D, while the reference system for the atomic positions, (V), remains the same.
In this tutorial, we perform a cluster expansion of a 2$\times$2$\times$2 supercell with, in total, 16 substitutional sites (see the lat.in.met file below).
0.00 5.45 5.45
5.45 0.00 5.45
5.45 5.45 0.00
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
0.00000000 0.00000000 0.00000000 Si,Ge
0.50000000 0.00000000 0.00000000 Si,Ge
0.00000000 0.50000000 0.00000000 Si,Ge
0.00000000 0.00000000 0.50000000 Si,Ge
0.12500000 0.12500000 0.12500000 Si,Ge
0.12500000 0.12500000 0.62500000 Si,Ge
0.12500000 0.62500000 0.12500000 Si,Ge
0.62500000 0.12500000 0.12500000 Si,Ge
0.12500000 0.62500000 0.62500000 Si,Ge
0.62500000 0.12500000 0.62500000 Si,Ge
0.62500000 0.62500000 0.12500000 Si,Ge
0.62500000 0.62500000 0.62500000 Si,Ge
0.00000000 0.50000000 0.50000000 Si,Ge
0.50000000 0.00000000 0.50000000 Si,Ge
0.50000000 0.50000000 0.00000000 Si,Ge
0.50000000 0.50000000 0.50000000 Si,Ge
In addition to the previous files, there are folders labeled with numbers (1, 2, …, 14). Each folder contains a structure randomly generated for each composition $x=y/16$ in the range $0 \leq y \leq 16$, except for the values $y$ = 4, 8, and 12. You can find in each folder the following files:

For the values $y$ = 4, 8, and 12, you can add a random structure by yourself. To do so, you create the structures and perform their abinitio calculation with the following commands:
$ cell gen_rnd_str 4,8,12
This creates the folders 15, 16, and 17 containing the random structures of the values 4, 8, and 12, respectively, in the files str.out. In every folder the files input.xml and input.xsf are created, too. Now, we will calculate the energies with exciting by executing:
$ cell calc_ab_initio
This takes around 5min. In the meanwhile, you can open another terminal and go inside the working directory again. You can take a look at certain random structures with xcrsyden. You can visualize then any of the input.xsf files in the enumerated folders.
When the above calculation is completed, we will have a structure with an abinitio energy at every composition. Thus, you can start to build a CE as explained in thext section.
3. Cluster expansion: First iteration
i) Finding the optimal cluster expansion
Firstly, collect the abinitio energies of the structures stored in each folder into a file called allenergy.out using the following command.
$ cell write_ab_initio
After this, we extend the config.py with the following lines:
# cluster expansion
ENERGY_CE = "mixingPerSite"
MAX_RADIUS = 5.0
MAX_POINTS = 4
Here, the variable ENERGY_CE defines the energy in which the CE is built. In this case, the energy of mixing is chosen. The energy of mixing is defined as the energy difference between the energy $E(x)$ and an linear interpolation between the pristine structures $E(y=0)$ (only Si atoms) and $E(y=16)$ (only Ge atoms). Thus, the formula is as follows:
(5)The two remaining variables MAX_RADIUS and MAX_POINTS define the set of clusters considered for the cluster optimization discussed below. MAX_RADIUS specifies the maximum radius (i.e., the maximum distance between two points in the clusters), and MAX_POINTS the largest number of points of the cluster. The next command will create a pool of clusters up to this radius and number of points:
$ cell gen_clusters
Now, we can calculate the correlation matrix X:
$ cell gen_correlations
After running these two commands, the files clusters.out, containing the clusters, and allcorr.out ,containing the correlations, are created in the working directory. Now, we perform a cluster optimization. The following command creates different sets of clusters taken from the pool.
$ cell optimize_clusters
index CVLOO CV5% CV10% CV15% MSE Npts_max Rcl_max Nr of clusters
1 0.0003887 0.0003817 0.0003806 0.0003741 0.0002773 2 2.3599192 3
2 0.0001349 0.0001227 0.0001336 0.0001365 0.0000821 2 3.8537319 4
3 0.0001650 0.0001469 0.0001310 0.0001727 0.0000594 2 4.5189012 5
4 0.0001295 0.0001339 0.0001270 0.0001345 0.0000597 3 3.8537319 7
5 0.0000760 0.0000834 0.0001099 0.0001459 0.0000354 3 4.5189012 11
6 0.0002019 0.0001607 0.0002017 0.0002759 0.0000594 4 3.8537319 10
7 0.0045847 0.0044954 0.0033676 0.0016914 0.0000001 4 4.5189012 22
#######################################################
Optimal CV found for set 5
CV = 0.000076097860, MSE = 0.000035480505, #of clusters: 11
Optimal ECIs written to file energy.eci
#######################################################
This script chooses automatically the optimal CE by selecting the set of clusters with the smallest CV and writing the corresponding ECIs to the file energy.eci. In addition, we can plot the MSE and CV versus the number of clusters used for the CE.
$ cell plot_optimize_clusters
The following plot is generated:
ii) Groundstate search
Once the optimal CE is determined, we can search for groundstate configurations at arbitrary concentrations (it could be also possible to use another supercell by redifining the lat.in.met file). The groundstate search performs a Metropolis sampling of the configuration space at a given concentration and returns the structure with the lowest energy. If the groundstate configuration with its abinitio calculation exists already, the groundstate search returns the next lowestnondegenerate (LND) structure.
Before starting the search, we have to add the following lines to the config.py:
# Groundstate search
TEMP = 1000 # temperature [K] for the Metropolis sampling
MAX_N_CONF = 300 # number of configuratons to sample after the last LND structure was found
Here, we define the temperature TEMP used for the Metropolis sampling and the number of configurations MAX_N_CONF explored after the last LND is found.
Now, we can submit the groundstate search with the command:
$ cell gs_search 4,8,12 t
=================================================================================
Sampling configurational space for 4 substitutional atoms.
=================================================================================
Found lowest nondegenerate structure for nsub = 4 substitutions.
Acceptance ratio of Metropolis search: 0.853862212944
FOLDER  Predicted  Abinitio
New LND  0.001939122051
15  0.001672768784  0.00166373171885
Created new str.out file in folder ./18
=================================================================================
=================================================================================
Sampling configurational space for 8 substitutional atoms.
=================================================================================
Found lowest nondegenerate structure for nsub = 8 substitutions.
Acceptance ratio of Metropolis search: 0.751851851852
FOLDER  Predicted  Abinitio
New LND  0.002583396044
16  0.00240810368  0.00239358781255
Created new str.out file in folder ./19
=================================================================================
=================================================================================
Sampling configurational space for 12 substitutional atoms.
=================================================================================
Found lowest nondegenerate structure for nsub = 12 substitutions.
Acceptance ratio of Metropolis search: 0.715053763441
FOLDER  Predicted  Abinitio
New LND  0.001940169736
17  0.001696331639  0.00166380015639
Created new str.out file in folder ./20
=================================================================================
The parameter t at the end is optional. If t is used, the LND structures found by the search are written in folders labeled again with numbers. After the groundstate search has finished, three new folders are created in your working directory, each containing a new LND structure. At this point, we can plot the groundstate search (sampling stored in metropolis.log) together with the initial set of structures.
$ cell plot_gs_search
The blue circles indicate the abinitio energy, whereas the blue dots indicate their predictions from the CE. The predicted energies of structures visited by the Metropolis sampling are indicated with red diamonds.
4. Cluster expansion: Next iterations
First, store the first iteration with the command:
$ cell store
You can improve your cluster expansion by adding the abinitio calculations of the structures in folders 18 to 20 found by the groundstate search.
$ cell calc_ab_initio
Now, there are three additional abinitio calculations, which can be added to build another CE with a higher predictive power (lower CV). The more abinitio calculations are included to build the CE, the better the predictions of the CE are expected to be. In order to build the CE based on the extended abinitio data, you can follow the optimization procedure explained in Section 3, with the following commands:
$ cell write_ab_initio
$ cell gen_clusters
$ cell gen_correlations
$ cell optimize_clusters
Once this improved CE is performed, we can proceed to do a new groundstate search:
$ cell gs_search 3,6,10,13 t
This new search gives you new LND structures (iteration 2). Following the iterative procedure, the abinitio calculations of the structures from iteration 2 can be added and after a new optimization of the cluster expansion, the groundstate search can be started as follows:
$ cell gs_search t
The iterative procedure of adding new abinitio data, optimizing the cluster expansion and performing a groundstate search can be continued until the CV is low enough to predict the groundstate configuration with a high accuracy.
Exercise
 Perform a new cluster expansion with finer parameters for the abinitio calculation.
 To do so, create a new folder and copy the following files from the working directory SiGe inside it: config.py, lat.in, lat.in.met, exciting.wrap.xml, Si.xml, and Ge.xml
 To change the parameters of the abinitio calculations, go inside exciting.wrap.xml. For instance, you can increase the value of rgkmax or use finer kgrid. Be aware that more accurate calculations are also more computationally demanding.
 Start with the generation of random structures and follow the procedure explained above.
 Another suggestion is to change the size of the supercell, e.g., you could study a 4$\times$4$\times$4 supercell.