January 10, 2018

Gradient descent is one of the most important concepts in machine learning, it is the heart of many techniques which gives machines the power to optimize current solutions - to literally “learn” to do things better. In this post, I’ll be explaining what gradient descent actually is, and implement it in Python from scratch.

The goal of machine learning is letting machines figure out how to make the best predictions on their own. This is achieved by crunching numbers over a training dataset and generating a most accurate prediction model using an algorithm.

But how could a machine know which model is the most accurate one? Well, we often define a loss/cost function to measure how inaccurate our model’s predictions are - when we have the minimum loss possible, we should be able to say that our model is the most accurate one.

Let’s say we have a single-variable quadratic loss function (our prediction model has only one parameter) like in the image above and we want to find its minimum. The machines can’t spot the valley in the plot intuitively like you and I can, so in order to find the minimum, one idea is to simply check and compare every point on that function. Yes, this is comprehensive, however impractical in real world as we almost always have multiple variables instead of one here - the problem will become intractable pretty soon. So what can we do? Don’t panic, *gradient descent* comes to rescue!

Gradient descent is all about finding the best parameters for our prediction function that minimizes the loss. The idea is pretty simple:

- Calculate the gradient (slope) of a current position on our loss function, see which direction is heading down to the valley.
- Move into that direction for some distance.
- Repeat the two previous steps, until we are in a position where moving forward is not heading downhill anymore. That position should then be our approximated (local) minimum of our function.

But how do you actually use this concept in practice? I love learning by doing, so let’s implement gradient descent in Python to some real life data!

There is an interesting dataset on Kaggle, which contains a big list of sales data from more than 16,500 games from 1980 to 2016, along with the game’s platform, publisher and other info. It’s obvious that a game’s regional sales number should have a strong correlation to its degree of global success. Let’s explore that!

Although various gradient descent algorithms can be found in almost every powerful ML libraries out there, we will be implementing the vanilla gradient descent from scratch for learning purposes. The Xbox One has been a line of very popular gaming consoles from Microsoft since its initial release in 2013, so we should have lots of titles and sales data available here. Let’s see if we can fit the sales in North America and total worldwide sales into a linear regression model. We’ll be using NumPy, Pandas for data wrangling and Matplotlib for visualization.

The first step of working on any data analysis or ML project is, of course, import the data. You may need to sign up for a Kaggle account before downloading the dataset, but once the CSV file is ready, it’s time to fire up a Jupyter Notebook (strongly recommended) or a code editor!

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# assuming the dataset is placed under the ~/projects directory
data_raw = pd.read_csv('~/projects/vgsales.csv')
print(data_raw.info())
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
Rank 16598 non-null int64
Name 16598 non-null object
Platform 16598 non-null object
Year 16327 non-null float64
Genre 16598 non-null object
Publisher 16540 non-null object
NA_Sales 16598 non-null float64
EU_Sales 16598 non-null float64
JP_Sales 16598 non-null float64
Other_Sales 16598 non-null float64
Global_Sales 16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB
None
```

This shows that we have 11 columns in our dataset in total. That’s a lot of data! We only care about `NA_Sales`

and `Global_Sales`

of Xbox One titles though, so let’s extract those these information out. First, let’s first have a look at the `Platform`

column and see how “Xbox One” is designated in this dataset. This kind of information (metatdata like abbreviations, naming convention, metrics…) normally comes with the dataset, but unfortunately it’s not the case this time.

```
# get the unique values inside the 'Platform' column (or 'Pandas series')
data_raw['Platform'].unique()
```

```
array(['Wii', 'NES', 'GB', 'DS', 'X360', 'PS3', 'PS2', 'SNES', 'GBA',
'3DS', 'PS4', 'N64', 'PS', 'XB', 'PC', '2600', 'PSP', 'XOne', 'GC',
'WiiU', 'GEN', 'DC', 'PSV', 'SAT', 'SCD', 'WS', 'NG', 'TG16', '3DO',
'GG', 'PCFX'], dtype=object)
```

Cool! It seems that `'XOne'`

should be our Xbox One. We are now ready to make a copy of the data we would like to deal with, so the whole process will be cleaner. We will then make a scatter plot to see if `Global_Sales`

does have some intuitive correlation with `NA_Sales`

. We should always explore and play with the dataset visually before we move on, so that we can have an intuition that can help us make decisions.

```
# select Xbox One rows
data_xone = data_raw.loc[data_raw['Platform'] == 'XOne']
# extract NA_Sales and Global_Sales columns,
# we'd also like to create 2 string "constants" for our column names
COL_NA = 'NA_Sales'
COL_G = 'Global_Sales'
data = data_xone[[COL_NA, COL_G]]
# let's plot!
plt.figure(figsize=(8,6))
plt.title('Xbox One Sales Exploration')
plt.xlabel('North American Sales in Millions')
plt.ylabel('Global Sales in Millions')
plt.scatter(data[COL_NA], data[COL_G], c='red', alpha=0.3)
plt.show()
```

Awesome! It looks already super clear to our human eyes/brains that those two attributes do have a strong linear relationship!

From our sneak peek with the scatter plot, we know that `NA_Sales`

and `Global_Sales`

may fit nicely into a linear model. That is something like $y=m\cdot x+b$, where in this case $x$ is our `NA_Sales`

data, and $y$ the `Global_Sales`

we want to predict based on $x$. Our mission now is to find out the combination of parameters - slope $m$ and y-interception $b$ - which makes our line as “neutral” as possible to all the data points.

To evaluate how good our parameters are, we will need to define a loss function. Here we use “mean squared error” (MSE) to calculate the loss, which is basically a simple function defined as below.

$MSE=\dfrac {1}{n}\sum ^{n}_{i=1}\left( P_{i}-\hat {P}_{i}\right)^{2}$

Note that $\hat{P}$ here is our global sales prediction, and $P$ is the actual North American sales number of the sample. In Python, we could implement the MSE like below. I’m going to write two functions here, one for making predictions, one for calculating the MSE.

```
def global_sales_estimator(m, b):
"""Returns a estimator function"""
def predict(na_sales):
return m * na_sales + b
return predict
def calc_mse(estimator_func, dataframe, label_input, label_real):
"""
To predict pressure, let label_input='NA_Sales'
and label_real='Global_Sales'
"""
total_loss = 0.0
for index, row in dataframe.iterrows():
p_hat = estimator_func(row[label_input])
loss = (row[label_real] - p_hat) ** 2
total_loss += loss
return total_loss / dataframe.shape[0]
```

To test if this stuff all works, let’s select some arbitrary parameters, say, `m=2, b=0`

, calculate the MSE. Note that the “f-strings” are new features available from Python 3.6 released in December 2016… which actually ain’t that new at all 😉. (You should always use f-strings! There is an excellent post about them.)

```
test_m = 2
test_b = 0
test_estimator = global_sales_estimator(test_m, test_b)
test_mse = calc_mse(test_estimator, data, COL_NA, COL_G)
# using formatted string literals (Python 3.6 feature)
print(f'For m={test_m} and b={test_b}, our loss is {test_mse}')
```

`For m=2 and b=0, our loss is 0.1450413145539906`

Let’s also draw our line over that scatter plot.

```
plt.figure(figsize=(8,6))
plt.title('Xbox One Sales Exploration')
plt.xlabel('North American Sales in Millions')
plt.ylabel('Global Sales in Millions')
plt.scatter(data[COL_NA], data[COL_G], c='red', alpha=0.3)
plt.plot(data[COL_NA], test_estimator(data[COL_NA]), c='blue', alpha=0.8)
plt.show()
```

That’s a bit off. Apparently North Americans are not buying half of all Xbox One games in the world! Well, let’s ask our machine to learn the model itself, using gradient descent!

Remember I explained how gradient descent works in the beginning? Our mission is to find the “valley” of our loss function, without actually calculating and comparing every point on the loss function.

Since we are working on a linear regression model using MSE, the loss function is a quadratic function which is pretty easy for us to differentiate manually. However, we have 2 parameters in this case, $m$ and $b$. So Let’s calculate the partial derivatives in respect to both parameters $m$ and $b$ - the minimum loss should occur when both parameters are the best pick.

$\dfrac {dMSE}{dm}=\frac{2}{n} \sum_{i=1}^{n} x_i \left(b + m x_i - y_i\right)$

$\dfrac {dMSE}{db}=\frac{2}{n} \sum_{i=1}^{n} \left(b + m x_i - y_i\right)$

That was the first step. The second step is that once we figure out which way is downhill (gradient is negative), we should update our parameters into that direction for some “distance”. That’s where the concept of *learning rate* comes in, it simply means how far we would like to advance every time we pause and ponder our current gradient. You can think of it actually as “update resolution” instead of “learning rate”. With a higher learning rate (lower resolution), we can advance downhill more quickly, thus “learning faster”; however, our approximation of the minimum would be less accurate this way. We also want to relax our pace when we are closer to the valley - we will do this by multiplying learning rate and the gradient together.

Let’s write some code!

```
def partial_d(m, b, dataframe, label_input, label_real):
"""returns a pair of partial derivatives in respect to m and b"""
n = dataframe.shape[0]
s_m = 0
s_b = 0
for index, row in dataframe.iterrows():
s_m += row[label_input] * (b + m * row[label_input] - row[label_real])
s_b += b + m * row[label_input] - row[label_real]
return ((2 / n) * s_m), ((2 / n) * s_b)
def gradient_descent_step(current_m, current_b, dataframe, learning_rate):
"""returns new pair of m and b parameters we should try next time"""
n = dataframe.shape[0]
dm, db = partial_d(current_m, current_b, dataframe, COL_NA, COL_G)
return (current_m - learning_rate * dm), (current_b - learning_rate * db)
```

Now we have defined the math part, let’s ask the machine to learn on itself and see how it works! We will run an infinite loop and print out the loss of every iteration/step here, until the loss is not improving anymore, that is, we have reached the best approximated (local) minimum of our loss function. Let’s choose a learning rate of `0.1`

, and ask the machine to start with our guess `m=2, b=0`

. As you might have wondered, the choice of learning rate and starting point has a strong influence on our result, and worth another post of its own, so we will not be discussing it here for now.

```
def learn_linear_model(m_start, b_start, dataframe, learning_rate):
# starting point
m = m_start
b = b_start
estimator = global_sales_estimator(m, b)
loss = calc_mse(estimator, dataframe, COL_NA, COL_G)
all_steps = [{'m': m, 'b': b, 'loss': loss}]
while(True):
new_m, new_b = gradient_descent_step(m, b, dataframe, learning_rate)
new_estimator = global_sales_estimator(new_m, new_b)
new_loss = calc_mse(new_estimator, data, COL_NA, COL_G)
if new_loss > loss:
break
m, b, estimator, loss = new_m, new_b, new_estimator, new_loss
all_steps.append({'m': m, 'b': b, 'loss': loss})
return all_steps
training_process = learn_linear_model(2, 0, data, 0.1)
iterations = len(training_process)
m = training_process[-1]['m']
b = training_process[-1]['b']
loss = training_process[-1]['loss']
estimator = global_sales_estimator(m, b)
print(f'After {iterations} iterations, our program has finally found an optimal combo: \n'
f'm = {m} and b = {b}, the loss is {loss}')
```

```
After 235 iterations, our program has finally found an optimal combo:
m = 1.5757523651031329 and b = 0.04682234460848791, the loss is 0.057121976578976816
```

Pretty neat! Let’s plot out our new findings!

```
plt.figure(figsize=(8,6))
plt.title('Xbox One Sales Exploration')
plt.xlabel('North American Sales in Millions')
plt.ylabel('Global Sales in Millions')
plt.scatter(data[COL_NA], data[COL_G], c='red', alpha=0.3)
plt.plot(data[COL_NA], estimator(data[COL_NA]), c='blue', alpha=0.8)
plt.show()
```

Quite accurate!

Awesome! Now we have found our best fitting linear model by writing our own gradient descent function from scratch, but what does the learning process actually look like? How was our model actually improving? Let’s play with some visualizations! Visualization is always my favorite tool in data science.

Let’s recap, so what have we done so far? We used gradient descent to find the minimum of our loss function. Indeed. Let’s plot a 3D graph here to illustrate the relationship of `m`

, `b`

and `loss`

.

```
# we'll need additional modules to plot 3D graphs
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(8,6))
ax = fig.gca(projection='3d')
# let's plot the relationship of m, b and loss within the following ranges
M = np.arange(1.0, 3.0, 0.01) # [1, 3] with resolution 0.01
B = np.arange(-0.05, 0.10, 0.001) # [-0.05, 0.10] with resolution 0.001
X, Y = np.meshgrid(M, B)
Z = calc_mse(global_sales_estimator(X, Y), data, COL_NA, COL_G)
surf = ax.plot_surface(X, Y, Z, cmap='Oranges_r')
ax.set_title('Exploring the Relationship of m, b and loss')
ax.set_xlabel('m')
ax.set_ylabel('b')
ax.set_zlabel('Loss')
plt.show()
```

Also, how is the accuracy of our model improving over every iteration?

```
plt.figure()
plt.title('Loss vs. Iteration')
plt.xlabel('Iteration')
plt.ylabel('Loss')
Y = [step['loss'] for step in training_process]
X = range(0, len(Y))
plt.plot(X, Y, c='red')
plt.show()
```

As you can see, our model was not improving that much after 50 iterations. This is because in our implementation, we only asked the training to stop when the loss is really as minimal as we can get with our training rate. Most of the time, we don’t need to find that “absolute best” approximation. Finding a “good enough” model, an approximation where the loss is not improving that much can often save us tons of time and electricity bills.

I hope you have enjoyed our journey implementing gradient descent from scratch! Please share this post with your friends if you find it helpful! Please also leave a comment below if you have spotted an error or there’s something that need to be improved.

Thanks for reading, until next time!