Chapter 1 数学基础 | Mathematical Preliminaries¶

约 1160 个字 1 张图片预计阅读时间 6 分钟

本讲概述

本讲主要介绍数值计算中的误差类型和误差分析，以及IEEE 754浮点数表示。

1.2 舍入误差和计算机算术 | Roundoff Errors and Computer Arithmetic¶

1.2.1 截断误差和舍入误差 | Truncation Error and Roundoff Error¶

截断误差: 使用截断的(或者说有限的)求和来近似无穷级数的和->产生的误差
- 理论： $e=\sum\limits_{n=0}^{\infty} \frac{1}{n!}$
- 计算机: $e=\sum\limits_{n=0}^{N} \frac{1}{n!}$
舍入误差: 当计算机执行实数计算时产生的误差。这是因为计算机中的算术运算涉及到的数字只有有限位数。
- 理论： $0.3333333 \dots$
- 计算机: $0.333333$

1.2.2 截断法和舍入法 | Chopping and Rounding¶

使用舍入法或截断法产生的误差都叫做舍入误差。

Chopping: 0.1119 -> 0.111，直接截断
Rounding: 0.1119 -> 0.112，四舍五入

函数表示

Given a real number $y = 0. d_{1} d_{2} d_{3} \dots d_{k} d_{k + 1} \dots \times 10^{n}$ (which is called Normalized decimal floating-point form of a real number)

f l (y) = {\begin{cases} 0. d_{1} d_{2} d_{3} \dots d_{k} \times 10^{n} & /* chopping */ \\ c h o p (y + 5 \times 10^{n - (k + 1)}) & /* rounding */ \end{cases}

1.2.3 绝对误差与相对误差 | Absolute error and relative error¶

Denote $p^{*}$ as the approximation of $p$ .

Absolute error: $| p^{*} - p |$
Relative error: $\frac{| p^{*} - p |}{| p |}$

近似产生的误差

chopping:

\frac{| y - f l (y) |}{| y |} = \frac{0. d_{k + 1} d_{k + 2} \dots \times 10^{n - k}}{0. d_{1} d_{2} d_{3} \dots d_{k} \times 10^{n}} = \frac{0. d_{k + 1} d_{k + 2} \dots}{0. d_{1} d_{2} d_{3} \dots d_{k}} \times 10^{- k} < \frac{1}{0.1} \times 10^{- k} = 10^{- k + 1}

Rounding:

\frac{| y - f l (y) |}{| y |} \leq \frac{0.5 \times 10^{n - k}}{0. d_{1} d_{2} d_{3} \dots d_{k} \times 10^{n}} = \frac{0.5}{0.1} \times 10^{- k} = 0.5 \times 10^{- k + 1}

误差产生

两个近乎相等的数相减二导致有效数字的相消，例如： $a = 0.123456789, b = 0.123456788$ ，两者本来有9位有效数字，但是相减后只剩下1位有效数字。
如果有限位的表示或计算产生了误差，则除以一个小数（或者乘以一个大数）会放大绝对误差。

误差计算举例：

Evaluate $f (x) = x^{3} - 6.1 x^{2} + 3.2 x + 1.5$ at $x = 4.71$ using 3-digit arithmetic(3位有效数字).

	$x$	$x^{2}$	$x^{3}$	$6.1 x^{2}$	$3.2 x$
exact	$4.71$	$22.1841$	$104.487111$	$135.32301$	$15.072$
Chopping	$4.71$	$22.1$	$104.$	$6.1 * 22.1 = 134.81 \approx 134.$	$15.0$
Rounding	$4.71$	$22.2$	$105.$	$6.1 * 22.2 = 135.42 \approx 135.$	$15.1$

Exact value: $f (4.71) = 104.487111 - 135.32301 + 15.072 + 1.5 = - 14.263899$

Chopping: $f (4.71) = 104 - 134 + 15.0 + 1.5 = - 13.5$

Relative error: $\frac{| - 14.263899 + 13.5 |}{| - 14.263899 |} \approx 5.35 %$

Rounding: $f (4.71) = 105 - 135 + 15.1 + 1.5 = - 13.4$

Relative error: $\frac{| - 14.263899 + 13.4 |}{| - 14.263899 |} \approx 6.06 %$

可见，有时候舍入误差比截断误差更大。

但现在的误差还是太大了！介绍——秦九韶算法！

乘法较多会导致式子的误差较大，此时可以使用秦九韶算法（其实就是不断提取 $x$ ）来减少乘法的次数。

把一个多项式 $f (x) = a_{n} x^{n} + a_{n - 1} x^{n - 1} + \dots + a_{1} x + a_{0}$ 写成如下形式：

f (x) = ((\dots (a_{n} x + a_{n - 1}) x + a_{n - 2}) x + \dots) x + a_{1}) x + a_{0}

然后从最内层开始计算。

将上例改写成秦九韶算法：

f (x) = ((x - 6.1) x + 3.2) x + 1.5

Chopping:

\begin{aligned} ((4.71 - 6.1) 4.71 + 3.2) 4.71 + 1.5 \\ = & (- 1.39 * 4.71 + 3.2) 4.71 + 1.5 \\ = & (- 6.54 + 3.2) 4.71 + 1.5 \\ = & - 3.34 * 4.71 + 1.5 \\ = & - 15.7 + 1.5 \\ = & - 14.2 \end{aligned}

Relative error: $\frac{| - 14.263899 + 14.2 |}{| - 14.263899 |} \approx 0.44 %$

Rounding:

\begin{aligned} ((4.71 - 6.1) 4.71 + 3.2) 4.71 + 1.5 \\ = & (- 1.39 * 4.71 + 3.2) 4.71 + 1.5 \\ = & (- 6.55 + 3.2) 4.71 + 1.5 \\ = & - 3.35 * 4.71 + 1.5 \\ = & - 15.8 + 1.5 \\ = & - 14.3 \end{aligned}

Relative error: $\frac{| - 14.263899 + 14.3 |}{| - 14.263899 |} \approx 0.25 %$

可见，此时误差明显减小了。

1.3 算法和收敛性 | Algorithms and Convergence¶

稳定性 | Stability¶

一个算法，如果初始数据的小变化会导致最终结果的小变化，则称为稳定(stable)；否则称为不稳定(unstable)。

一个算法，如果只有在某些初始数据的选择下才是稳定的，则称为条件稳定(conditionally stable)。

误差增长 | The growth of errors¶

假设 $E_{0} > 0$ 是初始误差(initial error)， $E_{n}$ 是第 $n$ 步的误差。

如果 $E_{n} \approx C n E_{0}$ ，则称为线性增长(linear growth)。

线性增长的误差通常是无法避免的，当 $C$ 和 $E_{0}$ 都很小的时候，结果通常是可以接受的。

如果 $E_{n} \approx C^{n} E_{0}$ ，则称为指数增长(exponential growth)。

指数增长的误差应该避免，因为即使 $E_{0}$ 很小， $C^{n}$ 也会变得很大。这会导致不可接受的不准确性。

收敛速度 | Convergence rate¶

使用 $O$ 符号来表示收敛速度。

假设 $lim_{h \to 0} F (h) = L$ ， $lim_{h \to 0} G (h) = 0$ 。如果存在正常数 $K$ 使得

| F (h) - L | \leq K | G (h) |

对于足够小的 $h$ 成立，则记为 $F (h) = L + O (G (h))$

我们通常取 $G (h) = h^{p}, (p > 0)$ ，并对最大的 $p$ 值感兴趣。

求当 $h \to 0$ 时下列函数的收敛速度

$F (h) = \frac{\sin h}{h}$ , $lim_{h \to 0} \frac{\sin h}{h} = 1$

解: $F (h) - L = \frac{\sin h}{h} - 1 = \frac{\sin h - h}{h} \sim \frac{h - \frac{h^{3}}{6} + o (h^{3}) - h}{h} \sim - \frac{h^{2}}{6} \leq K | h^{2} |$

所以收敛速度为 $O (h^{2})$ 。

1.4 IEEE 754 FLOATING POINT REPRESENTATION¶

二进制的科学计数法： $x = \pm 1. d_{1} d_{2} d_{3} \dots d_{k} \times 2^{n}$ ，这里的 $d_{i}$ 是二进制的数字, $1. d_{1} d_{2} d_{3} \dots d_{k}$ 为有效位数(Significand)。

在计算机里有两种表示方法，分别是Single(32位)和double(64位)。均为下图的形式。

二进制

x = (- 1)^{S} \times (1 + F r a c t i o n) \times 2^{E x p o n e n t - B i a s}

$S$ 代表符号位(sign bit)。 $S$ 为 $0$ 表示非负数， $S$ 为 $1$ 表示负数。
有效位数( $S i g n i f i c a n d = 1 + F r a c t i o n$ ) 中 $1$ 是默认的，我们只存储小数点后面的位数(即Fraction)。
$F r a c t i o n$ 代表尾数部分(也称作mantissa)。
Single: $23$ $b i t s$
Double: $52$ $b i t s$
$E x p o n e n t$ 表示指数部分(也称作characteristic)。
- Single: $8$ $b i t s$
- Double: $11$ $b i t s$
- excess representation(偏移表示法) $=$ actual exponent $-$ bias，其中：
  - Single: $b i a s = 127$ $b i t s$
  - Double: $b i a s = 1023$ $b i t s$
- 指数为全 $0$ 和全 $1$ 时用作特殊值处理

Single & Double的值域

Single的值域

最小值：
Exponent = 00000001, Fraction = 000...000
$x = \pm (1 + 0.0) \times 2^{1 - 127} = \pm 1.0 \times 2^{- 126} \approx \pm 1.2 \times 10^{- 38}$
最大值：
Exponent = 11111110, Fraction = 111...111
$x = \pm (1 + 0.111 \dots 111) \times 2^{254 - 127} \approx \pm 2.0 \times 2^{127} \approx \pm 3.4 \times 10^{38}$

Double的值域

最小值：
Exponent = 00000000001, Fraction = 000...000
$x = \pm (1 + 0.0) \times 2^{1 - 1023} = \pm 1.0 \times 2^{- 1022} \approx \pm 2.2 \times 10^{- 308}$
最大值：
Exponent = 11111111110, Fraction = 111...111
$x = \pm (1 + 0.111 \dots 111) \times 2^{2046 - 1023} \approx \pm 2.0 \times 2^{1023} \approx \pm 1.8 \times 10^{308}$

Single & Double的十进制精度

Single的十进制精度

最小分度: $2^{- 23}$
$\log_{10} 2^{- 23} \approx - 6.92$
所以约为小数点后六位有效数字

Double的值域

最小分度: $2^{- 52}$
$\log_{10} 2^{- 52} \approx - 15.65$
所以约为小数点后十六位有效数字

十进制->IEEE754 浮点数

Represent $- 0.75$

Convert to binary: $0.75 = 0.11$
Normalize: $0.11 = 1.1 \times 2^{- 1}$
Sign bit: $S = 1$
Exponent:
- Single: $- 1 + 127 = 126 = 01111110$
- Double: $- 1 + 1023 = 1022 = 01111111110$
Fraction:
- Single: $0.1 = 100 \dots 00$ ，共23位
- Double: $0.1 = 100 \dots 00000$ ，共52位 $∴$ IEEE 754 Representation:
- Single: $1 01111110 100 \dots 00$
- Double: $1 01111111110 100 \dots 00000$

IEEE754 浮点数->十进制

What number is represented by the single-precision float 11000000101000…00

Sign bit: $S = 1$
Exponent: $E = 10000001 = 129$
Actual exponent: $E - 127 = 2$
Fraction: $01000 \dots 00$
Significand: $1.01000 \dots 00 = 1.25$
$∴$ Number: $- 1.25 \times 2^{2} = - 5.0$