Machine Learning#

Overview of Machine Learning

We will learn about how machine learning is a method of modeling data, typically with predictive functions. Machine learning includes many techniques, but here we will focus on only those necessary to transition into deep learning. For example, random forests, support vector machines, and nearest neighbor are widely-used machine learning techniques that are effective but not covered here.

What is about the model ?

We want a model capable of handling our inputs and producing something in the shape of our ouputs.

Big Data#

Additional Dimensions

  • Complexity: multiple source and data streams

  • Variability

    • Unpredictable Data flows

    • Social media trending

Why Big Data is important

  • Data constains information

  • information lead to insights

  • Insights helps in making better decisions

How to derive insights from data?

–> Machine Leanring

Conclusions:

  • Data is nothing without insights

  • Machine Learning is the key for deriving inisghts from data

  • Big Data and Machine Learning ha a huge potential

Algorithm in ML#

The below picture shows an overview of machine learning

https://github.com/thangckt/note_ml/blob/main/notebook/0_basic_MLDL/image/1_1_machine-learning.png?raw=1

Supervised Learning#

Given features we want our model to predict label. See more

  • Classification

    • Decision Trees

    • Naive Bayers Classification

  • Regession

    • Ordinary Least Squares Regression

    • Logistic Regession

    • Support Vector Machines

    • Ensemble Methods

Unsuppervised Learning#

No label in this type

  • Clustering

    • Centroid-based algorithm

    • Connectivity-based algorithm

    • Density-based algorithm

    • Probabilistic

    • Dimensionality Reduction

    • Neural network/ Deep Learning

  • Pricipal Component Analysis

  • Independent Component Analysis

  • Singular Value Decomposition

Reinforement Learning#

The Ingredients#

Machine learning the fitting of models \(\hat{f}(\vec{x})\) to data \(\vec{x}, y\) that we know came from some ``data generation’’ process \(f(x)\) . Firstly, definitions:

Features

    set of \(N\) vectors \(\{\vec{x}_i\}\) of dimension \(D\). Can be reals, integers, etc.

Labels

    set of \(N\) integers or reals \(\{y_i\}\). \(y_i\) is usually a scalar

Labeled Data

    set of \(N\) tuples \(\{\left(\vec{x}_i, y_i\right)\}\)

Unlabeled Data

    set of \(N\) features \(\{\vec{x}_i\}\) that may have unknown \(y\) labels

Data generation process

    The unseen process \(f(\vec{x})\) that takes a given feature vector in and returns a real label \(y\) (what we’re trying to model)

Model

    A function \(\hat{f}(\vec{x})\) that takes a given feature vector in and returns a predicted \(\hat{y}\)

Predictions

     \(\hat{y}\), our predicted output for a given input \(\vec{x}\).

Note

The content in this part is primary from:

See also

  1. Introductory Machine Learning

  2. Two reviews of machine learning in materials[]

  3. A review of machine learning in computational chemistry[]

  4. A review of machine learning in metals[]

Terminologies in ML#

  • The patterns: the learned parameters in model, or the parameters to find in the relationship between inputs and outputs. For e.g., in linear model \(y = ax +b\), the learned patterns (paramters to be found) are the weight a and the bias b.

  • Hidden units: neurons in hidden layers

  • Hypeparameters: are all user-choice parameters in model (e.g., learning rate, number of layers, number of neuron in layers,…)

  • Epoch: step

  • Loss function: measures how wrong your model predictions are. The higher the loss, the worse your model. It is sometimes calles “loss criterion”, “criterion”, or “cost function”.

Workflow in ML#

This workflow work with PyTorch. See this lesson

1. Prepare data#

  1. Prepare inputs and output in the format suitable for ML framework will be used (e.g., Pytorch only work with data in the form of torch.tensor)

  2. Split data into sets of train and test (somtimes are: strain, validation, test)

2. Build model#

  1. Constructing a model by subclassing nn.Module

  2. Defining a loss function and optimizer.

May consider more step: Setting up device agnostic code (so our model can run on CPU or GPU if it’s available).

3. Train model#

PyTorch steps in training:

  1. Forward pass - The model goes through all of the training data once, performing its forward() function calculations (compute model(x_train)).

  2. Calculate the loss - The model’s outputs (predictions) are compared to the ground truth and evaluated to see how wrong they are (loss = loss_fn(y_pred, y_train)).

  3. Zero gradients - The optimizers gradients are set to zero (they are accumulated by default) so they can be recalculated for the specific training step (optimizer.zero_grad()).

  4. Perform backpropagation on the loss - Computes the gradient of the loss with respect for every model parameter to be updated (each parameter with requires_grad=True). This is known as backpropagation, hence “backwards” (loss.backward()).

  5. Step the optimizer (gradient descent) - Update the parameters with requires_grad=True with respect to the loss gradients in order to improve them (optimizer.step()).

Libraries to use#

  • sklearn for ML model. This package is widely used and has implemented almost ML model: Random Forest Regression,…

  • pytorch for DeepML model

  • There is also a package skorch to convert pytorch models to sklearn models, then some penefits from sklearn lib can be used. Read more