ALFF: Data generator¶
To generate labeled data for training ML forcefields, alff_gen
is a command-line tool designed to perform the following tasks automatically without needs any user intervention:
- Build atomic structures
- Generate DFT codes to optimize atomic structures
- Submit DFT optimization jobs to the clusters, monitor the job status, and get back the DFT calcualation results
- Generate atomic structures by scaling and perturbing the optimized structures
- Generate DFT codes to run AIMD simulations
- Submit AIMD jobs to the clusters, monitor the job status, and get back the DFT calculation results
- Collect the data from the DFT calculations
- Convert the data to the format that can ready to used for training ML forcefields.
PARAM.yaml
: The parameters of the generator.MACHINE.yaml
: The settings of the machines running the generator's subprocesses.
An example run using alff
to generate labeled data:
-------------------------------- ALFF --------------------------------
Version: 0.1.dev409+g177de1d
Path: C:/conda/envs/py13/Lib/site-packages/alff
---------------------------- Dependencies ----------------------------
numpy 1.26.4 C:/conda/envs/py13/Lib/site-packages/numpy
scipy 1.14.1 C:/conda/envs/py13/Lib/site-packages/scipy
ase 3.23.1b1 C:/conda/envs/py13/Lib/site-packages/ase
thutil 0.1.dev122 C:/conda/envs/py13/Lib/site-packages/thutil
phonopy 2.29.1 C:/conda/envs/py13/Lib/site-packages/phonopy
----------------------- Author: C.Thang Nguyen -----------------------
----------------- Contact: http://thang.eu.org/email -----------------
___ __ ____________
/ | / / / ____/ ____/
/ /| | / / / /_ / /_
/ ___ |/ /___/ __/ / __/
/_/ |_/_____/_/ /_/
alff-INFO: START GENERATING DATA
alff-INFO: -------------------- stage_00: build_structure -------------
The directory `Mo_bulk_bcc_02x02x02` already existed. Select an action: [yes/backup/no]?
Yes: overwrite the existing directory that continue or update uncompleted tasks.
Backup: backup the existing directory and perform fresh tasks.
No: skip and exit.
Your answer (y/b/n): y
Overwrite the existing directory
alff-INFO: Working on the path: Mo_bulk_bcc_02x01x01/00_build_structure
alff-INFO: Build structures from scratch
alff-INFO: -------------------- stage_01: optimize --------------------
alff-INFO: Optimize the structures
alff-INFO: No DFT task is found. Skip the DFT calculation.
alff-INFO: -------------------- stage_02: scale_perturb ---------------
alff-INFO: Scaling on the path: Mo_bulk_bcc_02x01x01/01_scale_perturb
alff-INFO: -------------------- stage_03: run_dft ---------------------
alff-INFO: Run AIMD calculations
alff-INFO: Running DFT jobs... be patient
Remote host: some_IP_address
Remote path: /uwork/user01/work/w24_alff_job
Log file: logs/20241020_220540_dispatcher.log
alff-INFO: Running chunk 1 of 9 chunks (20 of 431 tasks).
alff-INFO: Running chunk 2 of 9 chunks (20 of 411 tasks). Estimated time: 1 days, 4:39
alff-INFO: Running chunk 3 of 9 chunks (20 of 391 tasks). Estimated time: 1 days, 3:15
...
alff-INFO: -------------------- stage_04: collect_data ----------------
alff-INFO: Collect data on the path: o_bulk_bcc_02x01x01/02_gendata
alff-INFO: FINISHED !
Parameters¶
The parameters of the generator are stored in a YAML/JSON/JSONC file. Here is an example of the parameters:
stages:
- build_structure # build the atomic structures
- optimize # optimize the structure
- scale_perturb # scale and perturb the structure
- run_dft # run the DFTsinglepoint/AIMD simulation
- collect_data # collect the data
structure: # atomic structure information
# from_extxyz: ["path/to/extxyz_file"] # list-of-paths to the EXTXYZ files to be used as the initial structure. If provided, the structure will be read from the file, and the other structure parameters will be ignored.
from_scratch: # build the structure from scratch
structure_type: "bulk" # bulk, molecule, surface,
chem_formula: "W" # chemical formula/element. e.g., "H2O", "Mg2O2", "Mg",
supercell: [ 2, 2, 2 ] # size of the supercell
pbc: [1, 1, 1]
ase_arg: # ASE kwargs for building the structure. Accept all ASE arguments.
crystalstructure: "fcc" # choices: sc,fcc,bcc,tetragonal,bct,hcp,rhombohedral,orthorhombic,mcl,diamond,zincblende,rocksalt,cesiumchloride,fluorite,wurtzite.
a: 3.15 # lattice constant
# cubic: True
scale_perturb: # scale and perturb the structure
scale_x: [0.9, 0.95, 1.0, 1.05, 1.1] # scale the structure in x-direction
scale_y: [0.9, 0.95, 1.0, 1.05, 1.1] # scale the structure in y-direction
scale_z: [0.9, 0.95, 1.0, 1.05, 1.1] # scale the structure in z-direction
perturb_num: 1 # number of perturbations on each structure
perturb_disp: 0.01 # stadard deviation of the perturb displacement
dft:
calc_type: 'aimd' # choices: 'singlepoint', 'aimd'
job_per_dispatch: 18 # maximum number of jobs per submission to the cluster
gpaw_calc: # accept GPAW parameters
mode:
name: 'pw' # use PlaneWave method energy cutoff in eV
ecut: 500
xc: "PBE" # exchange-correlation functional
kpts: {"density": 6, "gamma": False } # if not set `kpts`, then only Gamma-point is used
parallel:
sl_auto: True # enable ScaLAPACK parallelization
# use_elpa: True # enable Elpa eigensolver
augment_grids: True # use all cores for XC/Poisson solver
optimize: # run DFT to optimize the structure
fmax: 0.05 # force convergence criteria
aimd: # run AIMD simulation
dt: 1.0 # time step in fs
temperature: 300 # temperature in K
ensemble: "NVE" # ensemble type. choices: "NVE", "NVT"
collect_frames: 5 # number of frames to be collected. Then nsteps = collect_frames * traj_freq
traj_freq: 1 # dump the frames every `traj_freq` steps
Context options¶
When the output directory already exists, the generator will ask for the user's choice to proceed. The options are:
- Yes
: overwrite the existing directory and continue.
- Backup
: backup the existing directory and continue.
- No
: skip the building process and exit.
Yes
: is recommended. With this option, the generator will
- Overwrite and continue in the existing directory
- Automatically select the unLABELED configurations to run DFT calculations, avoi running DFT for LABELED configurations.
- This is the best way to continue the previous uncomplishness or update some more sampling options.
Machines¶
The settings of the machines running the generator's subprocesses are stored in a YAML/JSON/JSONC file. Here is an example of the machine settings:
dft:
command: "mpirun -np $NSLOTS gpaw"
machine:
batch_type: SGE # Supported systems: SGE, SLURM, PBS, TORQUE, BASH
context_type: SSHContext # Supported contexts: SSH, Local
remote_root: /path/of/project/in/remote/machine
remote_profile:
hostname: some_IP_address # address of the remote machine
username: little_bird # username to login the remote machine
password: "123456" # password to login the remote machine
port: 2022 # port to connect the remote machine
timeout: 20 # timeout for the SSH connection
execute_command: "ssh cluster" # command to execute the SSH connection
resources:
group_size: 1
queue_name: "lion-normal.q"
cpu_per_node: 12
kwargs:
pe_name: lion-normal
job_name: zalff_dft
custom_flags:
- "#$ -l h_rt=168:00:00"
module_list:
- conda/py11gpaw
source_list:
- /etc/profile.d/modules.sh
envs:
OMP_NUM_THREADS: 1
OMPI_MCA_btl_openib_allow_ib: 1
# OMPI_MCA_btl: ^tcp