With $n$ the number of observations, $x^{(i)}$ the value of the independant variable associated with
the observation $y^{(i)}$. Note that in Equation \ref{eq:cost} we average by $2n$ and not $n$. This
is because it simplify the partial derivatives expression as we will see below. This is a pure
cosmetic approach which do not impact the gradient decent (see [[https://math.stackexchange.com/questions/884887/why-divide-by-2m][here]] for more informations). The next
step is to $min_w J(w)$ for each weight $w_i$ (performing the gradient decent, see [[https://towardsdatascience.com/gradient-descent-demystified-bc30b26e432a][here]]). Thus we