Reinforcement Learning — Reward-Oriented Intelligence

Introduction to Reinforcement Learning with deep intuition of Markov Decision Process (MDP), Markov Chains and Discounted Rate γ.

6 min readJul 18, 2023

As we saw in the above image, we can see the robot is thinking. This is actually Reinforcement Learning, i.e. making computers learn by making various decisions. Let’s look at the definition part:

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Actually, you might find this definition difficult to understand but don’t worry, even I don’t understand definitions properly. So, let me conclude the definition: Reinforcement Learning is a type of Machine Learning. This learning makes the computer itself learn from its environment, get a reward on successfully completing a task and the main aim is to maximize the reward after the end of all tasks. Through various blogs, I have already completed all supervised and unsupervised Machine Learning algorithms with math intuition, and now it’s time to learn reinforcement learning.

Reinforcement Learning has great scope in future, it is said to be the hope of true artificial intelligence. Reinforcement Learning is growing rapidly, producing a wide variety of learning algorithms for different applications. Hence it is important to be familiar with the techniques of reinforcement learning.

Terms in Reinforcement Learning

Agent. The program you train to perform specific tasks is an agent.
Environment. The surroundings (real or virtual) in which the agent performs actions.
Action. A move made by the agent, which causes a status change in the environment.
Rewards. The evaluation or score of an action performed by an agent, which can be positive or negative.

We can understand this terminology by looking at a reinforced learned robot, it will surely be interesting.

This is basically a plastic cleaning robot, its main aim is to collect plastic garbage from the floor. The robot works this way:

Gets +10 points when it successfully picks a plastic.
Gets -10 when it hits a person.
Gets -50 when it falls off.
Gets +50 when it successfully collects all the garbage in desired time.

Here our Robot is the Agent, the room’s Floor is the Environment, Pick garbage is an Action and Points earned are Rewards.

So, till now we are done with the basic definition and the terminology.

Characteristics of Reinforcement Learning(RL)

RL does exploitation: RL's main aim is to maximize its rewards, thus exploitation is its main aim.

RL is dynamic in nature: RL has no specific answer (output), its action depends on its surrounding. Most of the time, RL takes action on the basis of trial and error method, thus being dynamic.

RL requires exploration: RL needs to get information about its environment by exploration to perform better. Generally, RL needs to balance between exploration and exploitation.

RL is a multi-decision system: RL forms a decision chain to perform a given task, thus taking multiple decisions at each stage in the completion of work.

These were the base characteristics of RL, I hope you got it well. Further, in the blog, we will discuss the decision process of RL.

Introducing the Markov Decision Process (MDP)

Markov decision processes give us a way to formalize sequential decision-making. This formalization is the basis for structuring problems that are solved with reinforcement learning. This can be designed as:

Set of states, S
Set of actions, A
Reward function, R
Policy, π
Value, V

This process of selecting an action(A) from a given state(S1), transitioning to a new state(S2), and receiving a reward(R1) happens sequentially over and over again, which creates something called a trajectory that shows the sequence of states, actions, and rewards.

The set of actions we took define our policy (π) and the rewards we get in return define our value (V). Our task here is to maximize our rewards by choosing the correct policy.

This diagram nicely illustrates this entire idea.

Let’s break down this diagram into steps.

At time T1, the environment is in its initial state S1.
The agent observes the current state and selects action A1.
The action of the environmental state from S1 to S2.
The environment transitions to state and grants the agent reward R1.
This process then starts over for the next time step T2, now with the initial state S2.

Note: This process continues until the final goal is reached. At each stage, a new state is formed and a new decision is taken, finally forming a Markov Chain.

A Markov Chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

P1(S2,R1 ; S1,A1) = Probability{(S(t+1)=S2, R(t+1)=R1) / (S(t)=S1, A(t)=A1} (t is initial time)

Expected Return (G): The agent's goal is to maximize the expected return, or the sum of all returns.

G = R1 + R2 + R3 + ….. R(T, final timestep)

Episodic vs continuous tasks

Episodic tasks are tasks that have a terminal state (end). In RL, episodes are considered agent-environment interactions from initial to final states.

For example, in a car racing video game, you start the game (initial state) and play the game until it is over (final state). This is called an episode. Once the game is over, you start the next episode by restarting the game, and you will begin from the initial state irrespective of the position you were in the previous game. So, each episode is independent of the other.

In a continuous task, there is no terminal state. Continuous tasks will never end. For example, a personal assistance robot does not have a terminal state.

Now, in order to focus on maximizing the return, we need to take care of how current action will affect the future. For that, we introduce discounted rates.

Discounted Return

Rather than the agent’s goal is to maximize the expected return of rewards, it will instead be to maximize the expected discounted return of rewards.

The discount factor essentially determines how much the reinforcement learning agents care about rewards in the distant future relative to those in the immediate future. If γ=0, the agent will be completely myopic and only learn about actions that produce an immediate reward. The value of γ usually lies between 0 and 1, i.e. [0, 1].

Thus the value of G changes as follows:

G = R1 + γR2 + γ²R3 + γ³R4 + ….

G = R1 + γ(R2 + γR3 + γ²R4 + ….)

G = R1 + γG1 (G1 = 1/(1-γ))

This infinite sum yields a finite result. This is the same as the infinite sum of Geometric Progression, i.e. 1/(1-γ).

While the agent does consider all of the expected future rewards when selecting an action, the more immediate rewards influence the agent greater than rewards that are expected to be received further out due to the discount rate.

Summary

First, we discussed the meaning of Reinforcement Learning.
We learned major terminology with an example.
Saw the characteristics of RL.
Then completed the major part of the Markov decision process.
Finally, ended up with the discounted rate.

Next time we’ll be building on the ideas from our introduction to MDPs and discounted return to see how we can measure “ how good” any particular state or any particular action is for the agent with a given policy. I’ll see ya in the next one!