Machine Learning

Machine Learning#

Machine learning is a method of modeling data, typically with predictive functions. Machine learning includes many techniques, but here we will focus on only those necessary to transition into deep learning. For example, random forests, support vector machines, and nearest neighbor are widely-used machine learning techniques that are effective but not covered here.

We want a model capable of handling our inputs and producing something in the shape of our ouputs.

Big Data#

Additional Dimensions

Complexity: multiple source and data streams
Variability
- Unpredictable Data flows
- Social media trending

Why Big Data is important

Data constains information
information lead to insights
Insights helps in making better decisions

How to derive insights from data?

–> Machine Leanring

Conclusions:

Data is nothing without insights
Machine Learning is the key for deriving inisghts from data
Big Data and Machine Learning ha a huge potential

Algorithm in ML#

The below picture shows an overview of machine learning

https://github.com/thangckt/note_ml/blob/main/notebook/0_basic_MLDL/image/1_1_machine-learning.png?raw=1

Supervised Learning#

Given features we want our model to predict label. See more

Classification
- Decision Trees
- Naive Bayers Classification
Regession
- Ordinary Least Squares Regression
- Logistic Regession
- Support Vector Machines
- Ensemble Methods

Unsuppervised Learning#

No label in this type

Clustering
- Centroid-based algorithm
- Connectivity-based algorithm
- Density-based algorithm
- Probabilistic
- Dimensionality Reduction
- Neural network/ Deep Learning
Pricipal Component Analysis
Independent Component Analysis
Singular Value Decomposition

Reinforement Learning#

The Ingredients#

Machine learning the fitting of models \(\hat{f}(\vec{x})\) to data \(\vec{x}, y\) that we know came from some ``data generation’’ process \(f(x)\) . Firstly, definitions:

Features

set of \(N\) vectors \(\{\vec{x}_i\}\) of dimension \(D\). Can be reals, integers, etc.

Labels

set of \(N\) integers or reals \(\{y_i\}\). \(y_i\) is usually a scalar

Labeled Data

set of \(N\) tuples \(\{\left(\vec{x}_i, y_i\right)\}\)

Unlabeled Data

set of \(N\) features \(\{\vec{x}_i\}\) that may have unknown \(y\) labels

Data generation process

The unseen process \(f(\vec{x})\) that takes a given feature vector in and returns a real label \(y\) (what we’re trying to model)

Model

A function \(\hat{f}(\vec{x})\) that takes a given feature vector in and returns a predicted \(\hat{y}\)

Predictions

\(\hat{y}\), our predicted output for a given input \(\vec{x}\).

Note

The content in this part is primary from:

Deep Learning for molecules & materials

Terminologies in ML#

The patterns: the learned parameters in model, or the parameters to find in the relationship between inputs and outputs. For e.g., in linear model \(y = ax +b\), the learned patterns (paramters to be found) are the weight a and the bias b.
Hidden units: neurons in hidden layers
Hypeparameters: are all user-choice parameters in model (e.g., learning rate, number of layers, number of neuron in layers,…)
Epoch: optimize step
Loss function: measures how wrong your model predictions are. The higher the loss, the worse your model. It is sometimes calles “loss criterion”, “criterion”, or “cost function”.

Workflow in ML#

This workflow work with PyTorch. See this lesson

1. Prepare data#

Prepare inputs and output in the format suitable for ML framework will be used (e.g., Pytorch only work with data in the form of torch.tensor)
Split data into sets of train and test (somtimes are: strain, validation, test)

2. Build model#

Constructing a model by subclassing nn.Module
Defining a loss function and optimizer.

May consider more step: Setting up device agnostic code (so our model can run on CPU or GPU if it’s available).

3. Train model#

PyTorch steps in training:

Forward pass - The model goes through all of the training data once, performing its forward() function calculations (compute model(x_train)).
Calculate the loss - The model’s outputs (predictions) are compared to the ground truth and evaluated to see how wrong they are (loss = loss_fn(y_pred, y_train)).
Zero gradients - The optimizers gradients are set to zero (they are accumulated by default) so they can be recalculated for the specific training step (optimizer.zero_grad()).
Perform backpropagation on the loss - Computes the gradient of the loss with respect for every model parameter to be updated (each parameter with requires_grad=True). This is known as backpropagation, hence “backwards” (loss.backward()).
Step the optimizer (gradient descent) - Update the parameters with requires_grad=True with respect to the loss gradients in order to improve them (optimizer.step()).