CELL - A python package for cluster expansion with a focus on complex alloys

by Maria Troppenz, Martin Kuban, & Santiago Rigamonti for exciting nitrogen

Purpose: In this tutorial, you will learn the basics of the cluster expansion package CELL. You will start by creating parent lattices of bulk and surface systems, next you will learn how to set up supercells and structures sets used for training cluster expansion models. For these systems, you will create pools of clusters and study the cluster correlations for particular cases. Finally you will learn different methods to find the optimal model out of a large pool of clusters.

Before starting, make sure you are in the correct folder: /home/tutorials/CELL_tutorial.

### 1. Tutorial 1

#### 1.1 Building parent lattices

Here you will learn how to set up and visualize a parent lattice, which is the most basic object in CELL. We will consider two examples: a bulk fcc crystal, and a surface system with adsorbed atoms and surface alloying.

We start with the example of a bulk binary fcc metal alloy:

from ase.build import bulk
from clusterx.parent_lattice import ParentLattice

pri1 = bulk('Cu', 'fcc')
sub1 = bulk('Al', 'fcc')

platt1 = ParentLattice(pri1, substitutions=[sub1])


In the first line above, we import the bulk module of the Atomic Simulation Environment (ASE). In the second line, the ParentLattice class of CELL is loaded. In the next two lines, using the bulk function, we define the structures for the pristine Cu non-substituted lattice pri1 and the fully substituted Al (fcc) lattice sub1. These two structures are then used to initialize the ParentLattice object (which we call platt1) in the last line.

Next, we would like to visualize the just created parent lattice. To this end, we use the juview function of the visualization module of CELL:

from clusterx.visualization import juview

juview(platt1)


The figure on the left, corresponds to the pristine non-substituted Cu fcc crystal, while the figure on the right represents the fully Al-substituted crystal. In general, the line of code juview(parent_lattice), will generate as many additional figures as possible substituents are present in the parent lattice, as you will see for the next example of a surface system.

Now, we will setup the parent lattice for a surface system. It consists on a fcc(111) Al surface, with possible Na substitution on the uppermost Al layer and adsorption of oxygen atoms in "on-top" configuration. In order to build the parent lattice for such a system, we execute the following code:

from ase.build import fcc111, add_adsorbate

pri2 = fcc111('Al', size=(1,1,3))
pri2.center(vacuum=10.0, axis=2)


In the code above, first we load some builder utilities from ASE (fcc111 and add_adsorbate). In the next three lines, we
i) create an (111)-terminated fcc Al slab with three atomic layers
ii) add a vacancy (symbol X) site with "on top" configuration, and
iii) add vaccuum on the sides of the slab along the $z$-direction.

In this way we have defined the pristine structure pri2. Now we would like to set up the substitutions: Na on the top-most Al layer and oxygen on the "on-top" vacancy sites. To proceed, we first need some information from the pristine structure, as shown below:

for i, (symbol, z_coord) in enumerate(zip(pri2.get_chemical_symbols(),pri2.get_positions()[:,2])):
print("atom index: ",i,"| Chemical symbol: ",symbol,"| z coordinate: ",z_coord)


From this output, we see that the uppermost "Al" layer has atom index 2, and that the adsorbate layer has atom index 3. With this information, we initialize the parent lattice object in an alternative way, by telling which species can occupy every atom index: This is done with the site_symbols argument, which allows us to tell CELL which atomic species can occupy every atomic site:

platt2 = ParentLattice(pri2, site_symbols=[['Al'],['Al'],['Al','Na'],['X','O']])
juview(platt2)


In this way, we see that for atom indices 1 and 2 only ['Al'] is allowed, while atom index 2 admits the species in the array ['Al','Na'] and atom index 3 admits species ['X','O'], where 'X' denotes a vacancy. From left to right, the figures above denote: pristine non-substituted lattice (vacancy sites indicated with white color), on-top vacancy site substituted by oxygen (red), and top-most Al layer substituted by Na (purple).

#### 1.2 Building sturcture sets

In order to generate ab initio data to be used as input to train a cluster expansion model, we need to perform calculations on super cells of the parent lattice. In CELL, super cells are represented by objects of the class SuperCell. We will take the example of the surface system above, and create a $4\times4\times1$ super cell object:

from clusterx.super_cell import SuperCell
import numpy as np

scell2 = SuperCell(platt2,np.diag([4,4,1]))
juview(scell2)


As you can see, a super cell looks very much like an enlarged parent lattice, indeed, objects of the SuperCell class inherit from the ParentLattice class and share many properties.
Now, using the created super cell, we will generate a few random decorations of it at different concentrations. The generated structures will be collected in a StructuresSet object, that will be later used as training set for a cluster expansion.

Before doing so, however, we need some information from the just created SuperCell object that we will need to define the concentrations of Na substituents and vacancies in the generation of random structures. This information is contained in the sites dictionary of the super cell, which we can access with:

print(scell2.get_idx_subs())


This tells us that the supercell has three types of sites, or sublattices, with indexes $0$, $1$ and $2$. Sites of type $0$, contain species number 0 (vacancy) and can be substituted by species number 8 (oxygen) , sites of type $1$ contain species number 13 (Al) and can be substituted with species number 11 (Na); while sites of type $2$ contain species number 13 (Al) and can not be substituted.

In the code shown below, we first load the StructuresSet class and then create a structures-set object that we call sset2. Next, in three different for loops, by using the scell2's gen_random() method, we create
i) two random structures with 4 on-top oxygen atoms,
ii) two random structures with 4 Al$\rightarrow$Na substitutions, and
iii) 2 structures with 4 oxygen atoms and 4 Al$\rightarrow$Na substitutions:

from clusterx.structures_set import StructuresSet

sset2 = StructuresSet(platt2)

nstruc = 2

# i) Random structures with 4 on-top oxygen atoms
for i in range(nstruc):

# ii) Random structures with 4 substituent Na atoms
for i in range(nstruc):

# iii) Random structures with 4 on-top oxygen and 4 substituent Na atoms
for i in range(nstruc):

juview(sset2)


In the figure above, red spheres represent oxygen atoms and purple spheres represent Na substitutions.

Exercise 1

Build the parent lattice for a two-dimensional square lattice of a binary (e.g. SiGe) material and create (and visualize) 6 random structures on a $5\times5$ super cell.

As help, you can use the following Atoms object to initialze the ParentLattice object:

from ase import Atoms

a=3.0
pri4 = Atoms(positions=[[0,0,0]],symbols=['Si'],cell=[[a,0,0],[0,a,0],[0,0,2*a]],pbc=(1,1,0))


Excercise 2

Generate and visualize a few random structures for the fcc CuAl alloy of the first example on this tutorial. Do it so in a $3\times3\times3$ super cell.

### 2. Tutorial 2

#### 2.1 Building a pool of clusters

Now that you know how to build parent lattices and structures sets, the next task is to create a pool of clusters. The clusters pool defines the basis set for the expansion of configuration-dependent properties in terms of cluster functions. Although this basis is infinite (consisting of all symetrically distinct atomic configurations of the substituent species), in practical applications one must cut off the basis. In order to fix ideas, we will start with a very simple model (corresponding to the solution of Excercise 1 of Tutorial 1) of a binary two-dimensional lattice. Afterwards, you will tackle the more complicated surface system shown in Tutorial 1 by solving Excercise 4 of this tutorial.

To start, we must define the parent lattice:

from ase import Atoms
from clusterx.parent_lattice import ParentLattice
from clusterx.visualization import juview

a=4.0
pri = Atoms(positions=[[0,0,0]],symbols=['Si'],cell=[[a,0,0],[0,a,0],[0,0,2*a]],pbc=(1,1,0))

plat = ParentLattice(pri,site_symbols=[['Si','Ge']])

juview(plat)


Now, we will create all possible clusters of up to four points and radius up to $\sqrt{2}\,a$. This is done with the following piece of code:

from clusterx.clusters.clusters_pool import ClustersPool

r = 1.5
cpool = ClustersPool(plat, npoints=[0,1,2,3,4], radii=[0,0,a*r,a*r,a*r])


In the first line, we load the ClustersPool class. Then, we create an object of this class (cpool). To initialize it, we use the parameters npoints and radii. npoints indicates the number of points of the clusters in the pool, and the parameter radii indicates the maximum radius corresponding to each of the number of points indicated in npoints. In this way, we have created a clusters pool containing the empty cluster, all the 1-point clusters (in this case there will be only one of this kind), and all the 2-, 3- and 4-point clusters up to radius $1.5\times a$.

The number of clusters just created is:

print("Number of clusters: ", len(cpool))


These can be visualized by first creating a clusters database:

cpool.write_clusters_db()


Now, we have at least two ways to visualize the pool of clusters:
i) we can plot a few of them (e.g. n=6) on this notebook with juview by invoking the get_cpool_atoms() method of cpool:

juview(cpool.get_cpool_atoms(),n=6)


#### 2.3 Cluster correlations

In this section we will illustrate the calculation of the cluster correlations for the simple binary case studied here. To simplify the analysis, we will use a basis in which the site occupations are represented by a variable $\sigma_{s,i}$, which is equal to $1$ if the crystal site $i$ of structure $s$ is ocupied with the substitutional species (Ge in this example), or zero otherwise (Si in this example).

The cluster correlations $X_{s\alpha}$ for a binary compound are given by the equation:

$X_{s\alpha}=\frac{1}{M_\alpha}\sum_{\beta\equiv\alpha}{\prod_{i\in\beta}\sigma_{s,i}}$

Here, the index $\alpha$ denotes a cluster and $m_\alpha$ the multiplicity of cluster $\alpha$. The sum goes through the clusters $\beta$ in the orbit of cluster $\alpha$, and $i$ denote the crystal sites (points) of cluster $\beta$. In short, the cluster correlations are the orbit-averaged cluster functions $\prod_{i\in\beta}\sigma_{s,i}$.

Let's now calculate the cluster correlations for a the same small pool of clusters obtained at the start of this tutorial:

from clusterx.clusters.clusters_pool import ClustersPool

r = 1.5
cpool.write_clusters_db()

print(len(cpool), " clusters were created.")
juview(cpool.get_cpool_atoms(),n=6)


To accomplish this task, we must first create a CorrelationsCalculator object:

from clusterx.correlations import CorrelationsCalculator

corrcal = CorrelationsCalculator("binary-linear", plat, cpool)


Let's now create a random structure and then obtain the cluster correlations for this structure:

structure = cpool.get_cpool_scell().gen_random({0:[3]})
juview(structure.get_atoms())

mult = cpool.get_multiplicities()

corrs = corrcal.get_cluster_correlations(structure, multiplicities=mult)

for i in range(len(cpool)):
print("Cluster: ",i,"|   Correlation: ", corrs[i], "|   Multiplicity: ", mult[i])


Excercise 3

With the figures of the clusters, the figure of the structure and the obtained multiplicities, calculate *by hand* the correlations for the random structure and verify that they are equal to what is returned by the get_cluster_correlations method.

Excercise 4

Create a pool of clusters for the surface system of Tutorial 1 and visulaize the generated clusters with ASE's gui

### 3. Tutorial 3

#### 3.1 The optimal model

Once a training data set and a pool of clusters is available, the next question is how to find the optimal cluster expansion model to make predictions. This task requires finding the set of clusters that are most relevant to describe the interactions present in the system. A number of optimality criteria can be used. Here we will focus on obtaining models which are optimal in the sense of providing the best possible predictions for new data, i.e., data not included in the training set. For this purpose, the quality of the predictions can be quantified by the cross validation score (CVS), which you will learn to calculate and interpret in this section.

Here, we will use the surface system with oxygen adsorption and Na substitution, which was used in Tutorial 1 and Tutorial 2, to find the optimal cluster expansion model. We start by generating again the needed elements for the task, namely a training data set and a pool of clusters.

from ase.build import fcc111, add_adsorbate
from clusterx.parent_lattice import ParentLattice
from clusterx.structures_set import StructuresSet
from clusterx.visualization import juview
from clusterx.super_cell import SuperCell
import numpy as np
from random import randint

nsc = 4

pri2 = fcc111('Al', size=(1,1,3))
pri2.center(vacuum=10.0, axis=2)

platt2 = ParentLattice(pri2, site_symbols=[['Al'],['Al'],['Al','Na'],['X','O']])
scell2 = SuperCell(platt2,np.diag([nsc,nsc,1]))

sset2 = StructuresSet(platt2)

nstruc = 60

for i in range(nstruc):
concentration = {0:[randint(0,nsc*nsc)],1:[randint(0,nsc*nsc)]}

juview(sset2,n=3) # Plot the first 3 created random structrues

from clusterx.calculators.emt import EMT2
sset2.set_calculator(EMT2())
energies = sset2.calculate_property()
print(energies)

r=3.5
from clusterx.clusters.clusters_pool import ClustersPool
cpool.write_clusters_db()
print(len(cpool)," clusters were generated.")

juview(cpool.get_cpool_atoms(),n=6)


Now that we have the basic elements (i.e. training set and pool of clusters), we can proceed to calculate the matrix of cluster correlations. We do so with a CorrelationsCalculator object:

from clusterx.correlations import CorrelationsCalculator

corrcal = CorrelationsCalculator("trigonometric", platt2, cpool)
comat = corrcal.get_correlation_matrix(sset2)
print(comat)


Now we are ready to perform the selection of the optimal cluster expansion model. To this end, we load the ClustersSelector class of clusterx and create and instance of it (that we call clsel):

from clusterx.clusters_selector import ClustersSelector
from clusterx.visualization import plot_optimization_vs_number_of_clusters
from clusterx.visualization import plot_predictions

clsel = ClustersSelector('linreg', cpool, clusters_sets = "size")


the clsel object defined above, will select clusters using linear regression for the fitting of cluster interaction ('linreg' parameter), it will select the optimal model out of the pool of clusters contained in the cpool object, by building clusters pools of increasing cluster size (as indicated by the option clusters_sets = "size") .

Now, we are prepared to call the select_clusters method of the clsel object. This will perform the actual clusters selection with an optimallity criterium based on cross-validation. The optimal clusters pool will be contained on the optimal_clusters member of clsel, which we assign to the cp2 variable. Finally, the cluster optimisation process may be visualised with the function plot_optimization_vs_number_of_clusters:

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):

print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))

from clusterx.visualization import plot_optimization_vs_number_of_clusters
plot_optimization_vs_number_of_clusters(clsel)


It is always useful to inspect how the predictions correlate to the data, this can be done with the visualisation function plot_predictions:

from clusterx.visualization import plot_predictions
plot_predictions(clsel,energies)


Now, we will demonstrate a different strategy for cluster selection, which is based on making a (partial) exhaustive list of clusters subsets (i.e. all possible combinations of clusters for certain clusters subset from the pool) and then select from that list according to smallest CV score:

clsel = ClustersSelector('linreg', cpool, clusters_sets = "size+combinations", nclmax = 1, set0 = [1,0])

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):

print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))

plot_optimization_vs_number_of_clusters(clsel)
plot_predictions(clsel,energies)


Now, we try with a different nclmax and set0:

clsel = ClustersSelector('linreg', cpool, clusters_sets = "size+combinations", nclmax = 3, set0 = [2,3.5])

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):

print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))

plot_optimization_vs_number_of_clusters(clsel)
plot_predictions(clsel,energies)


Another cluster optimisation strategy is based on LASSO (least absolute shrinkage and selection operator), which approximates the full combinatorial optimisation in an efficient way, by solving a convex optimisation problem:

clsel = ClustersSelector('lasso', cpool, sparsity_max=0.10, sparsity_min=0.001)

clsel.select_clusters(comat,energies)

print("Optimal set of clusters:")
cp2 = clsel.optimal_clusters._cpool
npoints=[]
for i,c in enumerate(cp2):

print("CV-loo of optimal set: "+str(clsel.opt_mean_cv))
print("RMSE of optimal set: "+str(clsel.opt_rmse))

plot_optimization_vs_number_of_clusters(clsel)

from clusterx.visualization import plot_optimization_vs_sparsity

plot_optimization_vs_sparsity(clsel)
plot_predictions(clsel,energies)


[[/collapsible]]