2600: The Problem of Gradient Descent as First Order Taylor's Approximation

Arthur Hau and Gemini 2.0

May 2, 2025

Taylor's Theorem has been invoked in a lot of real world applications as a basis for applying first-order or higher order Taylor series approximation to functions using a polynomial.

Suppose there is a well behaved function \(f\) such that its derivatives \(f^{i}\) exists for all \(i = 1, \ldots, n\). Roughly, Taylor's Theorem states that there exists \(c \in (a, x)\) such that \[ f(x) = f(a) + f'(a) (x-a) + (1/2) f''(a) (x-a)^2 + \ldots + \frac{1}{n !} f^n (c) (x - a)^n, \label{eq:1} \] In the simplest first-order approximation case, we have

\[ f(x) = f(a) + f'(a) (x-a) + (1/2) f''(c) (x-a)^2. (1) \]

where \(\frac{1}{n !} f^n (c) (x - a)^n\) is often called the error of approximation.

How should we interpret equation (1). For this equation to be a good approximation first we need to check the signs of the first and the second derivative of \(f\). Since we have arbitrarily chosen \(a < x\), for \(f(a) + f'(a) (x-a)\) to be a good approximation of \(f(b)\), we need to check the signs of both \(f'(a)\) and \(f''(a)\). We have several cases to consider given \(x-a > 0\),

Case 1 \(f'(\eta) > 0\) and \(f''(\eta) > 0\) for all \(\eta \in [a,x]\).

Since \(a < x\), adding \(f'(a) (x-a)\) to \(f(a)\) results in an increase in the approximated value of \(f(x)\) which will be a good approximation if \(f(x) < f(a)\) which is true only if

\(f''>0\).

Case 2 \(f'(\eta) < 0\) and \(f''(\eta) > 0\) for all \(\eta \in [a,x]\).

This is the case in which the first order Taylor series is a bad approximation. Adding \(f'(a) (x-a)\) to \(f(a)\) results in a decrease in the approximated value of \(f(x)\), namely \(f(a)\), which give a worse approximation because \(f(x) > f(a)\) given \(f''> 0\). Reducing the approximated value from \(f(a)\) downward is unwise.

Case 3 \(f'(\eta) > 0\) and \(f''(\eta) < 0\) for all \(\eta \in [a,x]\).

\(a < x\) implies that adding adding \(f'(a) (x-a)\) to \(f(a)\) increases the approximately value of \(f(x)\).

\(a<x\), \(f'>0\), and \(f'<0'\) imply that \(f(x) > f(a)\). So, the increase in the approximated value improves upon the original approximated value \(f(a)\).

Case 4 \(f'(\eta) < 0\) and \(f''(\eta) < 0\) for all \(\eta \in [a,x]\).

I will leave this as an exercise to the reader.

In the case of \(f(x)=1/x\), \(f'(x) = -1/x^2\) which is negative for all \(x>0\) and positive for \(x<0\). On the other hand, \(f''(x) = 2/x^3\) which is positive for all \(x\).

Let's start by taking a look at this statement. Most proofs of Taylor's Theorem rely on a declaration of a statement similar to the following for any arbitrary \(n\):

\[ F(x,n) = f(b) - f(x) - f'(x) (b-x) - \frac{f''(x)}{2!} (b-x^2) - \ldots - \frac{f^{n-1}(x)}{(n-1!)}(b-x)^{n-1}. (2)\]

What's wrong with this seemingly innocent statement? Let's rewrite this as a sequence.

\[ S_n = F(x,n) - f(b) + f(x) + f'(x) (b-x) + \frac{f''(x)}{2!} (b-x^2) + \ldots + \frac{f^{n-1}(x)}{(n-1!)}(b-x)^{n-1}. (3) \]

Does the sequence \(S_n\) converge or diverge? Why is this important? It is important because in mathematical induction, we need to assume that a statement like (2) holds for any "arbitrary" \(n\). Whether (2) holds for any arbitrary \(n\) depends on whether \(S_n\) is a convergent or a divergent sequence given in (3).

It is well known that in the case of \(f(x)=1/x\), Taylor series expansion gives

\[ f(x) = \frac{1}{x}= \frac{1}{b} - \frac{(x-b)}{b^2} + \frac{(x-b)^2}{b^3} - \frac{(x-b)^3}{b^4} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n}{b^{n+1}} (x-b)^n . \]

Let's rewrite it in the form of (2) and we have

\[ F(x,n) = \frac{1}{b} - \sum_{i=n-1}^{\infty} \frac{(-1)^i}{b^{i+1}} (x-b)^i \label{eq:7} \]

In our sequence notation, (3) becomes

\[ S_n = F(x,n) - \frac{1}{b} + \sum_{i = n -1}^{\infty} \frac{(-1)^i}{b^{i+1}} (x-b)^i \label{eq:8} \]

Now, it is obvious that the declaration of (2) is equivalent to saying that \(S_n = 0\) which is obviously absurd. And it is now well known that the Taylor series expansion of most functions that contains quotients of polynomials do not converge. It should now be clear that the reason being Taylor ignored the signs of \(f'\) and \(f''\) when he used his series as an approximation of \(f\). For many \(f\)'s, the Taylor series which involve only polynomials give rise to bad approximate regardless of whether it is high or low order.

Conclusion

Taylor created an approximation formula of any function \(f\) consisting of a polynomial of order \(n\)

which is called the Taylor series. This series is commonly acknowledged to be highly useful in many

real world applications. However, by ignoring the second and higher derivatives of a function \(f\),

even the first-order Taylor series approximation can be misleading. If we somehow use our knowledge of the second derivative of the function, we can change the sign of the first-order term and come up with a better approximation. This is exactly what computer simulations do. Instead of setting a large step using \(x-a\) for the first-order term. A computer will use a small step \(h\). In case the sign is wrong, the computer will take the next step in the opposite direction. If instead the computer has computed the sign of \(f''\), the sign of \(h\) will be set correctly even in the first iteration.

Friday, June 19, 2026

The Problem of Gradient Descent as First Order Taylor's Approximation

No comments:

Post a Comment