ALFF: Data generator¶

To generate labeled data for training ML forcefields, alff_gen is a command-line tool designed to perform the following tasks automatically without needs any user intervention: - Build atomic structures - Generate DFT codes to optimize atomic structures - Submit DFT optimization jobs to the clusters, monitor the job status, and get back the DFT calcualation results - Generate atomic structures by scaling and perturbing the optimized structures - Generate DFT codes to run AIMD simulations - Submit AIMD jobs to the clusters, monitor the job status, and get back the DFT calculation results - Collect the data from the DFT calculations - Convert the data to the format that can ready to used for training ML forcefields.

alff_gen PARAM.yaml MACHINE.yaml

PARAM.yaml: The parameters of the generator.
MACHINE.yaml: The settings of the machines running the generator's subprocesses.

An example run using alff to generate labeled data:

-------------------------------- ALFF --------------------------------
    Version:  0.1.dev409+g177de1d
       Path:  C:/conda/envs/py13/Lib/site-packages/alff
---------------------------- Dependencies ----------------------------
       numpy  1.26.4       C:/conda/envs/py13/Lib/site-packages/numpy
       scipy  1.14.1       C:/conda/envs/py13/Lib/site-packages/scipy
         ase  3.23.1b1     C:/conda/envs/py13/Lib/site-packages/ase
      thutil  0.1.dev122   C:/conda/envs/py13/Lib/site-packages/thutil
     phonopy  2.29.1       C:/conda/envs/py13/Lib/site-packages/phonopy
----------------------- Author: C.Thang Nguyen -----------------------
----------------- Contact: http://thang.eu.org/email -----------------

                             ___    __    ____________
                            /   |  / /   / ____/ ____/
                           / /| | / /   / /_  / /_
                          / ___ |/ /___/ __/ / __/
                         /_/  |_/_____/_/   /_/

INFO: START GENERATING DATA
INFO: -------------------- stage_00: build_structure -------------
 The directory `Mo_bulk_bcc_02x02x02` already existed. Select an action: [yes/backup/no]?
 Yes: overwrite the existing directory that continue or update uncompleted tasks.
 Backup: backup the existing directory and perform fresh tasks.
 No: skip and exit.
        Your answer (y/b/n): y
        Overwrite the existing directory
INFO: Working on the path: Mo_bulk_bcc_02x01x01/00_build_structure
INFO: Build structures from scratch
INFO: -------------------- stage_01: optimize --------------------
INFO: Optimize the structures
INFO: No DFT task is found. Skip the DFT calculation.
INFO: -------------------- stage_02: scale_perturb ---------------
INFO: Scaling on the path: Mo_bulk_bcc_02x01x01/01_scale_perturb
INFO: -------------------- stage_03: run_dft ---------------------
INFO: Run AIMD calculations
INFO: Running DFT jobs... be patient
               Remote host: some_IP_address
               Remote path: /uwork/user01/work/w24_alff_job
               Log file: logs/20241020_220540_dispatcher.log
INFO: Running chunk 1 of 9 chunks (20 of 431 tasks).
INFO: Running chunk 2 of 9 chunks (20 of 411 tasks). Estimated time: 1 days, 4:39
INFO: Running chunk 3 of 9 chunks (20 of 391 tasks). Estimated time: 1 days, 3:15
...

INFO: -------------------- stage_04: collect_data ----------------
INFO: Collect data on the path: o_bulk_bcc_02x01x01/02_gendata
INFO: FINISHED !

Parameters¶

The parameters of the generator are stored in a YAML/JSON/JSONC file. Here is an example of the parameters:

stages:
  - build_structure         # build the atomic structures
  - optimize                # optimize the structure
  - scale_perturb           # scale and perturb the structure
  - run_dft                 # run the DFTsinglepoint/AIMD simulation
  - collect_data            # collect the data

structure:  # atomic structure information
  # from_extxyz: ["path/to/extxyz_file"]  # list-of-paths to the EXTXYZ files to be used as the initial structure. If provided, the structure will be read from the file, and the other structure parameters will be ignored.

  from_scratch:               # build the structure from scratch
    structure_type: "bulk"    # bulk, molecule, surface,
    chem_formula: "W"         # chemical formula/element. e.g., "H2O", "Mg2O2", "Mg",
    supercell: [ 2, 2, 2 ]    # size of the supercell
    pbc: [1, 1, 1]
    ase_arg:                 # ASE kwargs for building the structure. Accept all ASE arguments.
      crystalstructure: "fcc" # choices: sc,fcc,bcc,tetragonal,bct,hcp,rhombohedral,orthorhombic,mcl,diamond,zincblende,rocksalt,cesiumchloride,fluorite,wurtzite.
      a: 3.15  # lattice constant
      # cubic: True


scale_perturb:                          # scale and perturb the structure
  scale_x: [0.9, 0.95, 1.0, 1.05, 1.1]  # scale the structure in x-direction
  scale_y: [0.9, 0.95, 1.0, 1.05, 1.1]  # scale the structure in y-direction
  scale_z: [0.9, 0.95, 1.0, 1.05, 1.1]  # scale the structure in z-direction
  perturb_num: 1                        # number of perturbations on each structure
  perturb_disp: 0.01                    # stadard deviation of the perturb displacement


dft:
  calc_type: 'aimd'              # choices: 'singlepoint', 'aimd'
  job_limit: 18         # maximum number of jobs per submission to the cluster

  gpaw_calc:                    # accept GPAW parameters
    mode:
      name: 'pw'                # use PlaneWave method energy cutoff in eV
      ecut: 500
    xc: "PBE"                   # exchange-correlation functional
    kpts: {"density": 6, "gamma": False }  # if not set `kpts`, then only Gamma-point is used
    parallel:
      sl_auto: True             # enable ScaLAPACK parallelization
      # use_elpa: True          # enable Elpa eigensolver
      augment_grids: True       # use all cores for XC/Poisson solver

  optimize:                     # run DFT to optimize the structure
    fmax: 0.05                  # force convergence criteria

  aimd:                         # run AIMD simulation
    dt: 1.0                     # time step in fs
    temperature: 300            # temperature in K
    ensemble: "NVE"             # ensemble type. choices: "NVE", "NVT"
    collect_frames: 5           # number of frames to be collected. Then nsteps = collect_frames * traj_freq
    traj_freq: 1                # dump the frames every `traj_freq` steps

Context options¶

When the output directory already exists, the generator will ask for the user's choice to proceed. The options are: - Yes: overwrite the existing directory and continue. - Backup: backup the existing directory and continue. - No: skip the building process and exit.

Yes: is recommended. With this option, the generator will - Overwrite and continue in the existing directory - Automatically select the unLABELED configurations to run DFT calculations, avoi running DFT for LABELED configurations. - This is the best way to continue the previous uncomplishness or update some more sampling options.

Machines¶

The settings of the machines running the generator's subprocesses are stored in a YAML/JSON/JSONC file. Here is an example of the machine settings:

dft:
  command: "mpirun -np $NSLOTS gpaw"

  machine:
    batch_type: SGE            # Supported systems: SGE, SLURM, PBS, TORQUE, BASH
    context_type: SSHContext   # Supported contexts: SSH, Local
    remote_root: /path/of/project/in/remote/machine
    remote_profile:
      hostname: some_IP_address         # address of the remote machine
      username: little_bird             # username to login the remote machine
      password: "123456"                # password to login the remote machine
      port: 2022                        # port to connect the remote machine
      timeout: 20                       # timeout for the SSH connection
      execute_command: "ssh cluster"    # command to execute the SSH connection

  resources:
    group_size: 1
    queue_name: "lion-normal.q"
    cpu_per_node: 12
    kwargs:
      pe_name: lion-normal
      job_name: zalff_dft
    custom_flags:
      - "#$ -l h_rt=168:00:00"
    module_list:
      - conda/py11gpaw
    source_list:
      - /etc/profile.d/modules.sh
    envs:
      OMP_NUM_THREADS: 1
      OMPI_MCA_btl_openib_allow_ib: 1
      # OMPI_MCA_btl: ^tcp