CELL - A python package for cluster expansion with a focus on complex alloys

by Maria Troppenz, Martin Kuban, & Santiago Rigamonti for exciting nitrogen

Purpose: In this tutorial, you will learn the basics of the cluster expansion package CELL. You will start by creating parent lattices of bulk and surface systems, next you will learn how to set up supercells and structures sets used for training cluster expansion models. For these systems, you will create pools of clusters and study the cluster correlations for particular cases. Finally you will learn different methods to find the optimal model out of a large pool of clusters.

0. Preparations

Read the following paragraphs before starting with the rest of this tutorial!

First, you need to download the CELL source code to your local machine. Open a terminal and move to the directory you used for this workshop:

$cd /home/tutorials  You can get the zip file with the source code using: $ cp /users/sol/exciting64/CELL_tutorial/CELL_tutorial.zip .


Please notice: This is a development version of CELL. You are not allowed to copy the code from the local computers. If you want to use the code for a different purpose, please contact us.

To access the code, you have to extract it:

$unzip CELL_tutorial.zip  Now, move to the new folder: $ cd CELL_tutorial


To install CELL on your computer, first move to the clusterx folder:

$cd clusterx  To install it locally, please type: $ pip3 install -e .


After the installation is finished, you can leave the folder.
$cd ..  Now, all Python packages that are required to run CELL are installed. Before starting, we need to make a small modification to the .bashrc file. You already edited this file in this tutorial. Add the following line to your ~/.bashrc: export PATH=~/.local/bin:$PATH


After editing, do not forget to source the ~/.bashrc:
$source ~/.bashrc  To view the jupyter notebooks, which will allow you to interactively visualize the usage of the CELL code, you must type in the terminal: $ jupyter-nbextension enable nglview --py --user


Now you should be able to continue with the tutorial. You can read the full documentation by typing:
$chromium /users/sol/exciting64/CELL_tutorial/html/index.html &  Before we start: You can perform all the tutorials in a Python 3 shell (e.g. IPython3), or you use the corresponding Jupyter-notebook files, which you find in the tutorial folder. We strongly recommend using Jupyter, you can find a short introduction on how to use Jupyter below. All contents of the web tutorial are also in the Jupyter notebooks. Before starting, make sure you are in the correct folder: /home/tutorials/CELL_tutorial. 1. Tutorial 1 1.1 Building parent lattices Here you will learn how to set up and visualize a parent lattice, which is the most basic object in CELL. We will consider two examples: a bulk fcc crystal, and a surface system with adsorbed atoms and surface alloying. We start with the example of a bulk binary fcc metal alloy: from ase.build import bulk from clusterx.parent_lattice import ParentLattice pri1 = bulk('Cu', 'fcc') sub1 = bulk('Al', 'fcc') platt1 = ParentLattice(pri1, substitutions=[sub1])  In the first line above, we import the bulk module of the Atomic Simulation Environment (ASE). In the second line, the ParentLattice class of CELL is loaded. In the next two lines, using the bulk function, we define the structures for the pristine Cu non-substituted lattice pri1 and the fully substituted Al (fcc) lattice sub1. These two structures are then used to initialize the ParentLattice object (which we call platt1) in the last line. Next, we would like to visualize the just created parent lattice. To this end, we use the juview function of the visualization module of CELL: from clusterx.visualization import juview juview(platt1)  The figure on the left, corresponds to the pristine non-substituted Cu fcc crystal, while the figure on the right represents the fully Al-substituted crystal. In general, the line of code juview(parent_lattice), will generate as many additional figures as possible substituents are present in the parent lattice, as you will see for the next example of a surface system. Now, we will setup the parent lattice for a surface system. It consists on a fcc(111) Al surface, with possible Na substitution on the uppermost Al layer and adsorption of oxygen atoms in "on-top" configuration. In order to build the parent lattice for such a system, we execute the following code: from ase.build import fcc111, add_adsorbate pri2 = fcc111('Al', size=(1,1,3)) add_adsorbate(pri2,'X',1.7,'ontop') pri2.center(vacuum=10.0, axis=2)  In the code above, first we load some builder utilities from ASE (fcc111 and add_adsorbate). In the next three lines, we i) create an (111)-terminated fcc Al slab with three atomic layers ii) add a vacancy (symbol X) site with "on top" configuration, and iii) add vaccuum on the sides of the slab along the$z$-direction. In this way we have defined the pristine structure pri2. Now we would like to set up the substitutions: Na on the top-most Al layer and oxygen on the "on-top" vacancy sites. To proceed, we first need some information from the pristine structure, as shown below: for i, (symbol, z_coord) in enumerate(zip(pri2.get_chemical_symbols(),pri2.get_positions()[:,2])): print("atom index: ",i,"| Chemical symbol: ",symbol,"| z coordinate: ",z_coord)  From this output, we see that the uppermost "Al" layer has atom index 2, and that the adsorbate layer has atom index 3. With this information, we initialize the parent lattice object in an alternative way, by telling which species can occupy every atom index: This is done with the site_symbols argument, which allows us to tell CELL which atomic species can occupy every atomic site: platt2 = ParentLattice(pri2, site_symbols=[['Al'],['Al'],['Al','Na'],['X','O']]) juview(platt2)  In this way, we see that for atom indices 1 and 2 only ['Al'] is allowed, while atom index 2 admits the species in the array ['Al','Na'] and atom index 3 admits species ['X','O'], where 'X' denotes a vacancy. From left to right, the figures above denote: pristine non-substituted lattice (vacancy sites indicated with white color), on-top vacancy site substituted by oxygen (red), and top-most Al layer substituted by Na (purple). 1.2 Building sturcture sets In order to generate ab initio data to be used as input to train a cluster expansion model, we need to perform calculations on super cells of the parent lattice. In CELL, super cells are represented by objects of the class SuperCell. We will take the example of the surface system above, and create a$4\times4\times1$super cell object: from clusterx.super_cell import SuperCell import numpy as np scell2 = SuperCell(platt2,np.diag([4,4,1])) juview(scell2)  As you can see, a super cell looks very much like an enlarged parent lattice, indeed, objects of the SuperCell class inherit from the ParentLattice class and share many properties. Now, using the created super cell, we will generate a few random decorations of it at different concentrations. The generated structures will be collected in a StructuresSet object, that will be later used as training set for a cluster expansion. Before doing so, however, we need some information from the just created SuperCell object that we will need to define the concentrations of Na substituents and vacancies in the generation of random structures. This information is contained in the sites dictionary of the super cell, which we can access with: print(scell2.get_idx_subs())  This tells us that the supercell has three types of sites, or sublattices, with indexes$0$,$1$and$2$. Sites of type$0$, contain species number 0 (vacancy) and can be substituted by species number 8 (oxygen) , sites of type$1$contain species number 13 (Al) and can be substituted with species number 11 (Na); while sites of type$2$contain species number 13 (Al) and can not be substituted. In the code shown below, we first load the StructuresSet class and then create a structures-set object that we call sset2. Next, in three different for loops, by using the scell2's gen_random() method, we create i) two random structures with 4 on-top oxygen atoms, ii) two random structures with 4 Al$\rightarrow$Na substitutions, and iii) 2 structures with 4 oxygen atoms and 4 Al$\rightarrow$Na substitutions: from clusterx.structures_set import StructuresSet sset2 = StructuresSet(platt2) nstruc = 2 # i) Random structures with 4 on-top oxygen atoms for i in range(nstruc): sset2.add_structure(scell2.gen_random({0:[4]})) # ii) Random structures with 4 substituent Na atoms for i in range(nstruc): sset2.add_structure(scell2.gen_random({1:[4]})) # iii) Random structures with 4 on-top oxygen and 4 substituent Na atoms for i in range(nstruc): sset2.add_structure(scell2.gen_random({0:[2],1:[4]})) juview(sset2)  In the figure above, red spheres represent oxygen atoms and purple spheres represent Na substitutions. Exercise 1 Build the parent lattice for a two-dimensional square lattice of a binary (e.g. SiGe) material and create (and visualize) 6 random structures on a$5\times5$super cell. As help, you can use the following Atoms object to initialze the ParentLattice object: from ase import Atoms a=3.0 pri4 = Atoms(positions=[[0,0,0]],symbols=['Si'],cell=[[a,0,0],[0,a,0],[0,0,2*a]],pbc=(1,1,0))  Excercise 2 Generate and visualize a few random structures for the fcc CuAl alloy of the first example on this tutorial. Do it so in a$3\times3\times3$super cell. 2. Tutorial 2 2.1 Building a pool of clusters Now that you know how to build parent lattices and structures sets, the next task is to create a pool of clusters. The clusters pool defines the basis set for the expansion of configuration-dependent properties in terms of cluster functions. Although this basis is infinite (consisting of all symetrically distinct atomic configurations of the substituent species), in practical applications one must cut off the basis. In order to fix ideas, we will start with a very simple model (corresponding to the solution of Excercise 1 of Tutorial 1) of a binary two-dimensional lattice. Afterwards, you will tackle the more complicated surface system shown in Tutorial 1 by solving Excercise 4 of this tutorial. To start, we must define the parent lattice: from ase import Atoms from clusterx.parent_lattice import ParentLattice from clusterx.visualization import juview a=4.0 pri = Atoms(positions=[[0,0,0]],symbols=['Si'],cell=[[a,0,0],[0,a,0],[0,0,2*a]],pbc=(1,1,0)) plat = ParentLattice(pri,site_symbols=[['Si','Ge']]) juview(plat)  Now, we will create all possible clusters of up to four points and radius up to$\sqrt{2}\,a$. This is done with the following piece of code: from clusterx.clusters.clusters_pool import ClustersPool r = 1.5 cpool = ClustersPool(plat, npoints=[0,1,2,3,4], radii=[0,0,a*r,a*r,a*r])  In the first line, we load the ClustersPool class. Then, we create an object of this class (cpool). To initialize it, we use the parameters npoints and radii. npoints indicates the number of points of the clusters in the pool, and the parameter radii indicates the maximum radius corresponding to each of the number of points indicated in npoints. In this way, we have created a clusters pool containing the empty cluster, all the 1-point clusters (in this case there will be only one of this kind), and all the 2-, 3- and 4-point clusters up to radius$1.5\times a$. The number of clusters just created is: print("Number of clusters: ", len(cpool))  These can be visualized by first creating a clusters database: cpool.write_clusters_db()  Now, we have at least two ways to visualize the pool of clusters: i) we can plot a few of them (e.g. n=6) on this notebook with juview by invoking the get_cpool_atoms() method of cpool: juview(cpool.get_cpool_atoms(),n=6)  ii) by using the graphical user interface (gui) of ASE on a terminal. To use this second option, which is the recommended way to proceed when the clusters pool is large, first note that when creating the clusters database above with cpool.write_clusters_db(), a file called cpool.json was created in the same folder (let's call it$CWD) where this tutorial is located. This file has the proper format to be visualized with ASE's gui. Now, open a terminal and move to this folder with $>cd$CWD (the $> denotes the command prompt) and execute$>ase gui cpool.json. A number of windows will open. The relevant ones for the visualization of the clusters are shown in the screen capture below:

You can visualize all the clusters by clicking on the Back and Forward buttons of the Movie window. Let's see how this works for a larger pool:

r = 2.5

cpool.write_clusters_db()
print(len(cpool), " clusters were created.")


Now, visualize these 34 clusters with ASE's gui as explained above.

2.2 Cluster orbits

The clusters obtained above are all symmetrically inequivalent. Moreover, each of them is a representative of an infinite set of symmetrically equivalent clusters. Every infinite set is called a cluster orbit. We can calculate a cluster orbit for a supercell and visualize it with ASE's gui. Let's do so for one cluster of the clusters pool cpool:

cluster_index = 6 # Select a cluster from the pool

# Obtain the orbit for this cluster
cluster_orbit, cluster_multiplicity = cpool.get_cluster_orbit(cluster_index = cluster_index)

print("There are ",len(cluster_orbit),"symmetrically equivalent representations of cluster ", cluster_index," in the supercell.")
print("The cluster multiplicity is ",cluster_multiplicity,".")


The cluster multiplicity is the number of symetrically equivalent realizations of the cluster, under the symmetry operations of the parent cell, without counting the internal translations of the parent cell inside the supercell.

cpool.write_clusters_db(orbit=cluster_orbit,db_name="cluster_orbit.json")


Now, you may visualize the cluster orbit by executing the command $>ase gui cluster_orbit.json in a terminal. 2.3 Cluster correlations In this section we will illustrate the calculation of the cluster correlations for the simple binary case studied here. To simplify the analysis, we will use a basis in which the site occupations are represented by a variable$\sigma_{s,i}$, which is equal to$1$if the crystal site$i$of structure$s$is ocupied with the substitutional species (Ge in this example), or zero otherwise (Si in this example). The cluster correlations$X_{s\alpha}$for a binary compound are given by the equation:$X_{s\alpha}=\frac{1}{M_\alpha}\sum_{\beta\equiv\alpha}{\prod_{i\in\beta}\sigma_{s,i}}$Here, the index$\alpha$denotes a cluster and$m_\alpha$the multiplicity of cluster$\alpha$. The sum goes through the clusters$\beta$in the orbit of cluster$\alpha$, and$i$denote the crystal sites (points) of cluster$\beta$. In short, the cluster correlations are the orbit-averaged cluster functions$\prod_{i\in\beta}\sigma_{s,i}\$.

Let's now calculate the cluster correlations for a the same small pool of clusters obtained at the start of this tutorial:

from clusterx.clusters.clusters_pool import ClustersPool

r = 1.5
cpool.write_clusters_db()

print(len(cpool), " clusters were created.")
juview(cpool.get_cpool_atoms(),n=6)


To accomplish this task, we must first create a CorrelationsCalculator object:

from clusterx.correlations import CorrelationsCalculator

corrcal = CorrelationsCalculator("binary-linear", plat, cpool)


Let's now create a random structure and then obtain the cluster correlations for this structure:

structure = cpool.get_cpool_scell().gen_random({0:[3]})
juview(structure.get_atoms())

mult = cpool.get_multiplicities()

corrs = corrcal.get_cluster_correlations(structure, multiplicities=mult)

for i in range(len(cpool)):
print("Cluster: ",i,"|   Correlation: ", corrs[i], "|   Multiplicity: ", mult[i])


Excercise 3

With the figures of the clusters, the figure of the structure and the obtained multiplicities, calculate *by hand* the correlations for the random structure and verify that they are equal to what is returned by the get_cluster_correlations method.

Excercise 4

Create a pool of clusters for the surface system of Tutorial 1 and visulaize the generated clusters with ASE's gui

3. Tutorial 3

3.1 The optimal model

Once a training data set and a pool of clusters is available, the next question is how to find the optimal cluster expansion model to make predictions. This task requires finding the set of clusters that are most relevant to describe the interactions present in the system. A number of optimality criteria can be used. Here we will focus on obtaining models which are optimal in the sense of providing the best possible predictions for new data, i.e., data not included in the training set. For this purpose, the quality of the predictions can be quantified by the cross validation score (CVS), which you will learn to calculate and interpret in this section.

Here, we will use the surface system with oxygen adsorption and Na substitution, which was used in Tutorial 1 and Tutorial 2, to find the optimal cluster expansion model. We start by generating again the needed elements for the task, namely a training data set and a pool of clusters.

from ase.build import fcc111, add_adsorbate
from clusterx.parent_lattice import ParentLattice
from clusterx.structures_set import StructuresSet
from clusterx.visualization import juview
from clusterx.super_cell import SuperCell
import numpy as np
from random import randint

nsc = 4

pri2 = fcc111('Al', size=(1,1,3))
pri2.center(vacuum=10.0, axis=2)

platt2 = ParentLattice(pri2, site_symbols=[['Al'],['Al'],['Al','Na'],['X','O']])
scell2 = SuperCell(platt2,np.diag([nsc,nsc,1]))

sset2 = StructuresSet(platt2)

nstruc = 60

for i in range(nstruc):
concentration = {0:[randint(0,nsc*nsc)],1:[randint(0,nsc*nsc)]}

juview(sset2,n=3) # Plot the first 3 created random structrues

from clusterx.calculators.emt import EMT2
sset2.set_calculator(EMT2())
energies = sset2.calculate_property()
print(energies)

r=3.5
from clusterx.clusters.clusters_pool import ClustersPool
cpool.write_clusters_db()
print(len(cpool)," clusters were generated.")

juview(cpool.get_cpool_atoms(),n=6)


Now that we have the basic elements (i.e. training set and pool of clusters), we can proceed to calculate the matrix of cluster correlations. We do so with a CorrelationsCalculator object:

from clusterx.correlations import CorrelationsCalculator

corrcal = CorrelationsCalculator("trigonometric", platt2, cpool)
comat = corrcal.get_correlation_matrix(sset2)
print(comat)


Now we are ready to perform the selection of the optimal cluster expansion model. To this end, we load the ClustersSelector class of clusterx and create and instance of it (that we call clsel):

from clusterx.clusters_selector import ClustersSelector
from clusterx.visualization import plot_optimization_vs_number_of_clusters
from clusterx.visualization import plot_predictions

clsel = ClustersSelector('linreg', cpool, clusters_sets = "size")


the clsel object defined above, will select clusters using linear regression for the fitting of cluster interaction ('linreg' parameter), it will select the optimal model out of the pool of clusters contained in the cpool object, by building clusters pools of increasing cluster size (as indicated by the option clusters_sets = "size") .

Now, we are prepared to call the select_clusters method of the clsel object. This will perform the actual clusters selection with an optimallity criterium based on cross-validation. The optimal clusters pool will be contained on the optimal_clusters member of clsel, which we assign to the cp2 variable. Finally, the cluster optimisation process may be visualised with the function plot_optimization_vs_number_of_clusters:

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):

print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))

from clusterx.visualization import plot_optimization_vs_number_of_clusters
plot_optimization_vs_number_of_clusters(clsel)


It is always useful to inspect how the predictions correlate to the data, this can be done with the visualisation function plot_predictions:

from clusterx.visualization import plot_predictions
plot_predictions(clsel,energies)


Now, we will demonstrate a different strategy for cluster selection, which is based on making a (partial) exhaustive list of clusters subsets (i.e. all possible combinations of clusters for certain clusters subset from the pool) and then select from that list according to smallest CV score:

clsel = ClustersSelector('linreg', cpool, clusters_sets = "size+combinations", nclmax = 1, set0 = [1,0])

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):

print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))

plot_optimization_vs_number_of_clusters(clsel)
plot_predictions(clsel,energies)


Now, we try with a different nclmax and set0:

clsel = ClustersSelector('linreg', cpool, clusters_sets = "size+combinations", nclmax = 3, set0 = [2,3.5])

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):

print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))

plot_optimization_vs_number_of_clusters(clsel)
plot_predictions(clsel,energies)


Another cluster optimisation strategy is based on LASSO (least absolute shrinkage and selection operator), which approximates the full combinatorial optimisation in an efficient way, by solving a convex optimisation problem:

clsel = ClustersSelector('lasso', cpool, sparsity_max=0.10, sparsity_min=0.001)

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):
plot_predictions(clsel,energies)