Hello everyone, as the title suggests, in this blog, we will be going through different types and processes involved in Machine Learning. But, let's first understand how machine learning works.
How does Machine Learning Work?
- The Machine Learning algorithm is trained using a training dataset and a model is created.
- Then the trained model works on a test dataset and generates predictions.
- The prediction is evaluated for accuracy and if the accuracy is acceptable, the Machine Learning model is deployed. If the accuracy is not acceptable, then the Machine Learning model is trained with other techniques.
- The type of technique (or algorithm) of ML used in a particular project depends on the desired output.
- The process of selecting better techniques starts with examining the dataset.
Types of DATA
In general situation, there are different ways to train the machine learning model, each with its own advantages and disadvantages. To understand the pros and cons of each type of machine learning, the first step is to look at the kind of data they ingest. Here, we mostly have two kinds of data — labeled data and unlabeled data.
Labeled data has both the input and output parameters in a completely machine-readable pattern, but requires a lot of human labor to label the data, to begin with. Whereas, unlabeled data only have one or none of the parameters in a machine-readable form. This neglects the need for human labor but requires more complex solutions. So, to deal with these, we first examine the type of data.
TYPES OF MACHINE LEARNING ALGORITHMS
Supervised learning is one of the most basic types of machine learning. Supervised Learning is the one, where you can consider the learning is guided by a teacher. We have a dataset that acts as a teacher and its role is to train the model or the machine. In this type, the machine learning algorithm is trained on labeled data. Even though the data needs to be labeled accurately for this method to work, supervised learning is extremely powerful when used in the right circumstances.
In supervised learning, the ML algorithm is given a small training dataset to work with. This training dataset is a smaller part of the bigger dataset and serves to give the algorithm a basic idea of the problem, solution, and data points to be dealt with. The training dataset is also very similar to the final dataset in its characteristics and provides the algorithm with the labeled parameters required for the problem.
The algorithm then finds relationships between the parameters given, essentially establishing a cause and effect relationship between the variables in the dataset. At the end of the training, the algorithm has an idea of how the data works and the relationship between the input and the output.
This solution is then deployed for use with the final dataset, which it learns from in the same way as the training dataset. This means that supervised machine learning algorithms will continue to improve even after being deployed, discovering new patterns and relationships as it trains itself on new data.
Unsupervised machine learning holds the advantage of being able to work with unlabeled data. This means that human labor is not required to make the dataset machine-readable, allowing much larger datasets to be worked on by the program. The model learns through observation and finds structures in the data. Once the model is given a dataset, it automatically finds patterns and relationships in the dataset by creating clusters in it.
In supervised learning, the labels allow the algorithm to find the exact nature of the relationship between any two data points. However, unsupervised learning does not have labels to work off of, resulting in the creation of hidden structures. Relationships between data points are perceived by the algorithm abstractly, with no input required from human beings.
The creation of these hidden structures is what makes unsupervised learning algorithms versatile. Instead of a defined and set problem statement, unsupervised learning algorithms can adapt to the data by dynamically changing hidden structures. This offers more post-deployment development than supervised learning algorithms.
Reinforcement learning directly takes inspiration from how human beings learn from data in their lives. It features an algorithm that improves upon itself and learns from new situations using a trial-and-error method. Favorable outputs are encouraged or ‘reinforced’, and non-favorable outputs are discouraged or ‘punished’.
Based on the psychological concept of conditioning, reinforcement learning works by putting the algorithm in a work environment with an interpreter and a reward system. In every iteration of the algorithm, the output result is given to the interpreter, which decides whether the outcome is favorable or not.
In the case of the program finding the correct solution, the interpreter reinforces the solution by providing a reward to the algorithm. If the outcome is not favorable, the algorithm is forced to reiterate until it finds a better result. In most cases, the reward system is directly tied to the effectiveness of the result.
In typical reinforcement learning use-cases, such as finding the shortest route between two points on a map, the solution is not an absolute value. Instead, it takes on a score of effectiveness, expressed in a percentage value. The higher this percentage value is, the more reward is given to the algorithm. Thus, the program is trained to give the best possible solution for the best possible reward. And again, once trained it gets ready to predict the new data presented to it.
Now, we know the types of models present in Machine Learning, we will now see the processes involved in a data science project.
PROCESSES OF MACHINE LEARNING
1 — Data Collection
- The quantity & quality of your data determines the accuracy of your model.
- Using pre-collected data, by way of datasets from Kaggle, UCI, etc., still fits into this step
2 — Data Preparation
- Wrangle data and prepare it for training.
- Cleaning data may require some of these techniques — remove duplicates, correct errors, deal with missing values, normalization, data type conversions, etc.
- Randomize data, which erases the effects of the particular order in which our data.
- Visualize data to help detect relevant relationships between variables or class imbalances or perform other exploratory analyses.
- Split into training and evaluation sets
3 — Choose a Model
- Different algorithms are for different tasks; choose the right one.
4 — Train the Model
- The goal of training is to answer a question or make a prediction correctly as often as possible.
- Linear regression example: algorithm would need to learn values form (or W) and b (x is input, y is output).
- Each iteration of the process is a training step.
5 — Evaluate the Model
- Uses some metric or combination of metrics to “measure” the objective performance of the model.
- Test the model against previously unseen data.
- This unseen data is meant to be somewhat representative of model performance in the real world but still helps tune the model (as opposed to testing data, which does not).
- Good train/test split ratio is — 80/20, 70/30; or similar, depending on the domain, data availability, dataset particulars, etc.
6 — Parameter Tuning
- This step refers to hyperparameter tuning, which is an “art form” as opposed to a science.
- Tune model parameters for improved performance.
- Simple model hyperparameters may include several training steps(epoch), learning rate, initialization values, and distribution, etc.
7 — Make Predictions
- Now, we make our model predict on newer data or test dataset and give us the results. The results are compared to the original output and an accuracy score is generated.
This brings us to the end of this blog. Make sure you follow me up for upcoming blogs on this topic.