线性回归

一维

基于C1_W1_Lab03_Model_Representation_Soln

f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1}

Size (1000 sqft)	Price (1000s of dollars)
1.0	300
2.0	500

重要代码解释：
plt.style.use()
matplotlib 使用某种样式
plt.style.use()
参数可以是一个 URL 或者路径，指向自己定义的 mplstyle 文件
可以把自己的 mplstyle 文件放到 mpl_configdir/stylelib 文件夹下，这样就能通过文件的名称来使用定义的样式，其中 mpl_configdir 可以通过 matplotlib.get_configdir() 来查询
参数也可以是一个列表，这样就会整合多个 mplstyle 中的样式
plt.style.use('./deeplearning.mplstyle')的./表示上级目录下的文件，若当前目录文件名为ML则要'./ML/deeplearning.mplstyle'
直接用.shape可以快速读取矩阵的形状，使用shape[0]读取矩阵第一维度(通常是行)的长度,但是当数组或矩阵是一维时，只能使用shape[0]，返回的是数组或矩阵中元素的个数,python里np.array 的shape (n,)是一维数组，里面有n个元素而shape(n,1)是二维数组，n行1列
shape[0]
scatter()函数x,y通常为1维
 pyplot模块——图形的显示、关闭、重绘（show()、close()、draw()）
plot()
lengend()
matplotLib Legend添加图例：展示数据的信息
用法：
legend(): 默认获取各组数据的Label并展示在图框左上角

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./ML/deeplearning.mplstyle')
x_train=np.array([1.0,2.0])
y_train=np.array([300.0,500.0])
m=x_train.shape[0] #或者用m=len(x_train)
# print(m) #2
# Plot the data points
plt.scatter(x_train,y_train,c='r',marker='x',label='Actual Values')
#标题
plt.title("Housing Prices")
# Set the y-axis label
plt.ylabel('Price (in 1000s of dollars)')
# Set the x-axis label
plt.xlabel('Size (1000 sqft)')
# 显示图 plt.show()
#设置一维参数并调整使最后的输出可以拟合
w=200
b=100
# 拟合函数输出
def compute_model_output(x,w,b):
    """
    Computes the prediction of a linear model
    Args:
      x (ndarray (m,)): Data, m examples 
      w,b (scalar)    : model parameters  
    Returns
      y (ndarray (m,)): target values
    """
    m = x.shape[0]
    f_wb = np.zeros_like(x)
    for i in range(m):
        f_wb[i]=w*x[i]+b
    return f_wb
# 绘制输出
tmp_f_wb = compute_model_output(x_train, w, b)
# Plot our model prediction
plt.plot(x_train, tmp_f_wb, c='b',label='Our Prediction')

# Plot the data points
# plt.scatter(x_train, y_train, marker='x', c='r',label='Actual Values')        
# # Set the title
# plt.title("Housing Prices")
# # Set the y-axis label
# plt.ylabel('Price (in 1000s of dollars)')
# # Set the x-axis label
# plt.xlabel('Size (1000 sqft)')
# plt.legend()
# plt.show()
# Prediction并显示在图中
x_i = 1.2
cost_1200sqft = w * x_i + b    
print(f"${cost_1200sqft:.0f} thousand dollars")
plt.scatter(x_i, cost_1200sqft, marker='x', c='green',s=80,label='Prediction Values')  
plt.legend()      
plt.show()

成本函数

基于C1_W1_Lab04_Cost_function_Soln
一个变量的成本等式为：

J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}

成本是衡量模型在训练数据上的准确程度的一个指标
这个模块是matplotlib中的GUI模块，可以通过调整bottom来实时改变显示的结果 matplotlib.widgets
from lab_utils_uni import plt_intuition, plt_stationary, plt_update_onclick, soup_bowl用于引入lab_utils_uni.py内的各个函数 py中的指数运算使用**而不是^自定义函数plt_intuition详解：
w_array = np.arange(*w_range, 5) 相当于 w_array = np.arange(0,400, 5) 5为步长 @interact(w=(*w_range,10),continuous_update=False)这段代码提供交互功能，10为滑动的步长

def plt_intuition(x_train, y_train):

    w_range = np.array([200-200,200+200]) # 0-400
    tmp_b = 100

    w_array = np.arange(*w_range, 5) 
    cost = np.zeros_like(w_array)
    for i in range(len(w_array)):
        tmp_w = w_array[i]
        cost[i] = compute_cost(x_train, y_train, tmp_w, tmp_b)

    @interact(w=(*w_range,10),continuous_update=False)
    def func( w=150): #默认以w=150绘制函数
        f_wb = np.dot(x_train, w) + tmp_b

        fig, ax = plt.subplots(1, 2, constrained_layout=True, figsize=(8,4))
        fig.canvas.toolbar_position = 'bottom'

        mk_cost_lines(x_train, y_train, w, tmp_b, ax[0])
        plt_house_x(x_train, y_train, f_wb=f_wb, ax=ax[0])
        #对第二个图绘制曲线
        ax[1].plot(w_array, cost)
        cur_cost = compute_cost(x_train, y_train, w, tmp_b)
        ax[1].scatter(w,cur_cost, s=100, color=dldarkred, zorder= 10, label= f"cost at w={w}")
        ax[1].hlines(cur_cost, ax[1].get_xlim()[0],w, lw=4, color=dlpurple, ls='dotted')
        ax[1].vlines(w, ax[1].get_ylim()[0],cur_cost, lw=4, color=dlpurple, ls='dotted')
        ax[1].set_title("Cost vs. w, (b fixed at 100)")
        ax[1].set_ylabel('Cost')
        ax[1].set_xlabel('w')
        ax[1].legend(loc='upper center')
        fig.suptitle(f"Minimize Cost: Current Cost = {cur_cost:0.0f}", fontsize=12)
        plt.show()

总体代码：

import numpy as np
import matplotlib.pyplot as plt
from lab_utils_uni import plt_intuition, plt_stationary, plt_update_onclick, soup_bowl
plt.style.use('./deeplearning.mplstyle')
x_train=np.array([1,2])
y_train=np.array([300,500])
def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    
    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    """
    m=len(x)
    for i in range(m):
        cost=x[i]*w+b-y[i]
        cost_sum+=cost**2
    cost_total=(1 / (2 * m)) * cost_sum  
    return cost_total
plt_intuition(x_train,y_train)
#更多数据
x_train = np.array([1.0, 1.7, 2.0, 2.5, 3.0, 3.2])
y_train = np.array([250, 300, 480,  430,   630, 730,])
plt.close('all') 
fig, ax, dyn_items = plt_stationary(x_train, y_train)
updater = plt_update_onclick(fig, ax, x_train, y_train, dyn_items)
soup_bowl()

线性回归的梯度下降

C1_W1_Lab05_Gradient_Descent_Soln
$f_{w,b}(x^{(i)})$ :

f_{w,b}(x^{(i)}) = wx^{(i)} + b\tag{1}

J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2}

\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*}

\begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\ \end{align}

梯度下降需要同时更新，因此在代码中需要将等式左边的w写成temp_w，b写成temp_b以免交错更新
linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None）
作用：在指定的大间隔内（start，stop），返回固定间隔的数据间隔不一定是整数。他们返回num个等间距的样本。
append() 函数可以向列表末尾添加「任意类型」的元素 append()函数使用详解
下面的a = [1,2,3,4,5] a[-1]=5表示列表最后一项 Math.ceil() “向上取整”，即小数部分直接舍去，并向正数部分进1,Math.round() “四舍五入”，该函数返回的是一个四舍五入后的的整数,Math.floor() “向下取整” ，即小数部分直接舍去,输出迭代周期内的10次结果:先将迭代次数num_iters/10并舍小数取余数=0即可
总体代码

import math, copy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients
# Load our data set
x_train = np.array([1.0, 2.0])   #features
y_train = np.array([300.0, 500.0])   #target value
#Function to calculate the cost
def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost

    return total_cost
def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    m=x.shape[0]
    dj_dw=dj_db=0
    for i in range(m):
        dj_dw_temp=(1/m)*(w*x[i]+b-y[i])*x[i]
        dj_db_temp=(1/m)*(w*x[i]+b-y[i])
        dj_dw+=dj_dw_temp
        dj_db+=dj_db_temp
    return dj_dw,dj_db
plt_gradients(x_train,y_train, compute_cost, compute_gradient)
plt.show()
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    """
    Performs batch gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters  
      alpha (float):     Learning rate
      num_iters (int):   number of iterations to run gradient descent
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient
      
    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    w=w_in
    b=b_in
    J_history=[]
    p_history=[]
    for i in range(num_iters):
        dj_dw,dj_db=gradient_function(x,y,w,b)
        w=w-alpha*dj_dw
        b=b-alpha*dj_db
        if i<=100000:
            J_history.append(cost_function(x,y,w,b))
            p_history.append([w,b])
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}"))
    return w,b,J_history,p_history
plt.close()
w_i=0
b_i=0
internal=5000
alpha=0.02
w_f,b_f,J_h,p_h=gradient_descent(x_train,y_train,w_i,b_i,alpha,internal,compute_cost,compute_gradient)
print(f'迭代完成后最终w={w_f},b={b_f}')
# plot cost versus iteration  
fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True, figsize=(12,4))
ax1.plot(J_h)
ax2.plot(1000 + np.arange(len(J_h[1000:])), J_h[1000:]) #第1000次开始的成本
a=1000 + np.arange(len(J_h[1000:]))
b=J_h[1000:]
a1=len(a); b1=len(b)
print(f'第二张图的横坐标为{a},横纵坐标数组长度分别为{a1},{b1}') #打印第二张图的横坐标
ax1.set_title("Cost vs. iteration");  ax2.set_title("Cost vs. iteration (tail)")
ax1.set_ylabel('Cost')            ;  ax2.set_ylabel('Cost') 
ax1.set_xlabel('iteration step')  ;  ax2.set_xlabel('iteration step') 
plt.show()
# 使用迭代后的w,b预测
print(f"1000 sqft house prediction {w_f*1.0 + b_f:0.1f} Thousand dollars")
print(f"1200 sqft house prediction {w_f*1.2 + b_f:0.1f} Thousand dollars")
print(f"2000 sqft house prediction {w_f*2.0 + b_f:0.1f} Thousand dollars")

多元变量线性回归

基于C1_W2_Lab02_Multiple_Variable_Soln

Size (sqft)	Number of Bedrooms	Number of floors	Age of Home	Price (1000s dollars)
2104	5	1	45	460
1416	3	2	40	232
852	2	1	35	178

\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix}

notation:

$\mathbf{x}^{(i)}$ is vector containing example i. $\mathbf{x}^{(i)}$ $= (x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1})$
$x^{(i)}_j$ is element j in example i. The superscript in parenthesis indicates the example number while the subscript represents an element.

$\mathbf{w}$ $w$ is a vector with $n$ $n$ elements.
- Each element contains the parameter associated with one feature.
- in our dataset, n is 4.
- notionally, we draw this as a column vector

\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix}

$b$ is a scalar parameter.
The model's prediction with multiple variables is given by the linear model:

f_{\mathbf{w},b}(\mathbf{x}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b \tag{1}

or in vector notation:

f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \tag{2}

np.set_printoptions()用于控制Python中小数的显示精度。
np.set_printoptions(precision=None, threshold=None, linewidth=None, suppress=None, formatter=None)
1.precision：控制输出结果的精度(即小数点后的位数)，默认值为8
2.threshold：当数组元素总数过大时，设置显示的数字位数，其余用省略号代替(当数组元素总数大于设置值，控制输出值得个数为6个，当数组元素小于或者等于设置值得时候，全部显示)，当设置值为sys.maxsize(需要导入sys库)，则会输出所有元素
3.linewidth：每行字符的数目，其余的数值会换到下一行
4.suppress：小数是否需要以科学计数法的形式输出
5.formatter：自定义输出规则
x[:,n]表示在全部数组（维）中取第n个数据，直观来说，x[:,n]就是取所有集合的第n个数据, x[n,:]表示在n个数组（维）中取全部数据，直观来说，x[n,:]就是取第n集合的所有数据
二维数组中整行、列的读取，操作如下：
a = np.array([(1,2,3),(4,5,6),(7,8,9)])
print(a[:,0]) #读取第一列
print(a[0]) #读取第一行
w可以赋值成n维数组，n是列数dj_dw=np.zeros((n,)),这里的def compute_gradient(X, y, w, b): 要特别注意
总体代码：

import copy, math
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
np.set_printoptions(precision=2)  # reduced display precision on numpy arrays
X_train = np.array([[2104, 5, 1, 45], [1416, 3, 2, 40], [852, 2, 1, 35]])
y_train = np.array([460, 232, 178])
b_init = 785.1811367994083
w_init = np.array([ 0.39133535, 18.75376741, -53.36032453, -26.42131618])
print(X_train[0])
def predict_single_loop(x, w, b): 
    """
    single predict using linear regression
    
    Args:
      x (ndarray): Shape (n,) example with multiple features
      w (ndarray): Shape (n,) model parameters    
      b (scalar):  model parameter     
      
    Returns:
      p (scalar):  prediction
    """
    f_wb=np.dot(x,w)+b
    return f_wb
x_0=X_train[0,:]
Yp0=predict_single_loop(x_0,w_init,b_init)
print(f'初始w,b下的多元一阶线性预测值：{Yp0}')
def compute_cost(X, y, w, b): 
    """
    compute cost
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      cost (scalar): cost
    """
    m=len(X)
    f_wbi=0
    cost1=0
    for i in range(m):
        f_wbi=predict_single_loop(X[i],w,b)
        cost0=(1/(2*m))*(f_wbi-y[i])**2
        cost1=cost0+cost1
    return cost1
cost_init=compute_cost(X_train,y_train,w_init,b_init)
print(f'初始w,b下的多元一阶线性预测的成本为：{cost_init}')
def compute_gradient(X, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw=np.zeros((n,))
    dj_db=0       
    for i in range(m): 
        f_wbi=predict_single_loop(X[i],w,b)
        err=(1/m)*(f_wbi-y[i])
        for j in range(n):
            dj_dw[j]=err*X[i,j]+dj_dw[j]
        dj_db=err+dj_db
    return dj_dw,dj_db
tmp_dj_dw, tmp_dj_db = compute_gradient(X_train, y_train, w_init, b_init)
print(f'dj_db at initial w,b: {tmp_dj_db}')
print(f'dj_dw at initial w,b: \n {tmp_dj_dw}')
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X (ndarray (m,n))   : Data, m examples with n features
      y (ndarray (m,))    : target values
      w_in (ndarray (n,)) : initial model parameters  
      b_in (scalar)       : initial model parameter
      cost_function       : function to compute cost
      gradient_function   : function to compute the gradient
      alpha (float)       : Learning rate
      num_iters (int)     : number of iterations to run gradient descent
      
    Returns:
      w (ndarray (n,)) : Updated values of parameters 
      b (scalar)       : Updated value of parameter       
    """
    w=copy.deepcopy(w_in);b=b_in
    J_h=[]
    for i in range(num_iters):
        dj_dw, dj_db = gradient_function(X, y, w, b)
        w=w-alpha*dj_dw
        b=b-alpha*dj_db
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_h.append( cost_function(X, y, w, b))

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_h[-1]:8.2f}   ")
        
    return w, b, J_h #return final w,b and J history for graphing
        
# initialize parameters
initial_w = np.zeros_like(w_init)
initial_b = 0.
# some gradient descent settings
iterations = 9000
alpha = 5.0e-7
# run gradient descent 
w_final, b_final, J_hist = gradient_descent(X_train, y_train, initial_w, initial_b,
                                                    compute_cost, compute_gradient, 
                                                    alpha, iterations)
print(f"b,w found by gradient descent: {b_final:0.2f},{w_final} ")
m,_ = X_train.shape
# print(X_train[i])
for i in range(m):
    print(f"prediction: {np.dot(X_train[i], w_final) + b_final:0.2f}, target value: {y_train[i]}")
# plot cost versus iteration  
fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True, figsize=(12, 4))
ax1.plot(J_hist)
ax2.plot(100 + np.arange(len(J_hist[100:])), J_hist[100:])
ax1.set_title("Cost vs. iteration");  ax2.set_title("Cost vs. iteration (tail)")
ax1.set_ylabel('Cost')             ;  ax2.set_ylabel('Cost') 
ax1.set_xlabel('iteration step')   ;  ax2.set_xlabel('iteration step') 
plt.show()

特征缩放和学习率（多变量）

基于C1 W2 Lab03将通过使用 z 分数归一化进行特征缩放来提高梯度下降的性能。

Size (sqft)	Number of Bedrooms	Number of floors	Age of Home	Price (1000s dollars)
952	2	1	65	271.5
1244	3	2	64	232
1947	3	2	17	509.8
...	...	...	...	...

首先载入房屋信息数据，共4列其中3个特征，X = data[:,:4];y = data[:,4]X取除了最后一列价格以外的矩阵，y是取最后一列价格 sharey=True表示ax图共享y轴坐标 python绘图时fig和ax参数的功能和关系，fig是大画板，ax是其中的小画板，对其中的一个小画板ax设置横坐标要用ax[i].set_ylabel("Price (1000's)") pyplot模块坐标轴标签设置（xlabel()、ylabel()）
为了外部模块中的plot_cost_i_w函数绘图方便，这里调用的gradient_descent返回的hist是一个字典,np.ceil(ndarray)用于计算大于等于改值的最小整数，例如save_interval = np.ceil(num_iters/10000)中num_iters=9000则输出为1，num_iters=11000则输出为2，通过i == 0 or i % save_interval == 0使得各值记录在案，最后通过格式化字符串输出,这里的hist["params"].append([w,b]);hist["grads"].append([dj_dw,dj_db])元素形式如下[[0, 1], [1, 1], [2, 1], [3, 1], [4, 1]] linspace函数
该子函数代码如下：

#This version saves more values and is more verbose than the assigment versons
def gradient_descent_houses(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X : (array_like Shape (m,n)    matrix of examples 
      y : (array_like Shape (m,))    target value of each example
      w_in : (array_like Shape (n,)) Initial values of parameters of the model
      b_in : (scalar)                Initial value of parameter of the model
      cost_function: function to compute cost
      gradient_function: function to compute the gradient
      alpha : (float) Learning rate
      num_iters : (int) number of iterations to run gradient descent
    Returns
      w : (array_like Shape (n,)) Updated values of parameters of the model after
          running gradient descent
      b : (scalar)                Updated value of parameter of the model after
          running gradient descent
    """
    
    # number of training examples
    m = len(X)
    
    # An array to store values at each iteration primarily for graphing later
    hist={}
    hist["cost"] = []; hist["params"] = []; hist["grads"]=[]; hist["iter"]=[];
    
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    save_interval = np.ceil(num_iters/10000) # prevent resource exhaustion for long runs

    print(f"Iteration Cost          w0       w1       w2       w3       b       djdw0    djdw1    djdw2    djdw3    djdb  ")
    print(f"---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|")

    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db,dj_dw = gradient_function(X, y, w, b)   

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw               
        b = b - alpha * dj_db               
      
        # Save cost J,w,b at each save interval for graphing
        if i == 0 or i % save_interval == 0:     
            hist["cost"].append(cost_function(X, y, w, b))
            hist["params"].append([w,b])
            hist["grads"].append([dj_dw,dj_db])
            hist["iter"].append(i)

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            #print(f"Iteration {i:4d}: Cost {cost_function(X, y, w, b):8.2f}   ")
            cst = cost_function(X, y, w, b)
            print(f"{i:9d} {cst:0.5e} {w[0]: 0.1e} {w[1]: 0.1e} {w[2]: 0.1e} {w[3]: 0.1e} {b: 0.1e} {dj_dw[0]: 0.1e} {dj_dw[1]: 0.1e} {dj_dw[2]: 0.1e} {dj_dw[3]: 0.1e} {dj_db: 0.1e}")
       
    return w, b, hist #return w,b and history for graphing

特征缩放三种不同的技术：
特征缩放，实质上是将每个特征除以用户选择的值，以产生介于 -1 和 1 之间的范围。
均值归一化：$x_i := \dfrac{x_i - \mu_i}{max - min} $
我们将在下面使用 Z 分数归一化：

x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} \tag{4}

\begin{align} \mu_j &= \frac{1}{m} \sum_{i=0}^{m-1} x^{(i)}_j \tag{5}\\ \sigma^2_j &= \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2 \tag{6} \end{align}

实施说明：归一化特征时，重要的是存储用于归一化的值 - 用于计算的平均值和标准偏差。学习参数后从模型中，我们经常想预测我们没有的房子的价格以前见过。给定一个新的 x 值（客厅面积和床位数量房间），我们必须首先使用均值和标准差对 X 进行归一化我们之前从训练集中计算出来的。
这里是对m examples, n features的X整体进行Z归一化，用到numpy.mean()-平均数函数和Numpy.std() - 标准差函数 np.ptp()函数实现的功能等同于np.max(array) - np.min(array)。
结果看起来不错。需要注意的几点：
对于多个特征，我们不能再有一个图来显示结果与特征。
生成图时，使用了归一化特征。使用从规范化训练集中学习的参数的任何预测也必须规范化。即后续需要预测的数据也要进行归一化处理
总体代码

import numpy as np
np.set_printoptions(precision=2)
import matplotlib.pyplot as plt
dlblue = '#0096ff'; dlorange = '#FF9300'; dldarkred='#C00000'; dlmagenta='#FF40FF'; dlpurple='#7030A0'; 
plt.style.use('./deeplearning.mplstyle')
from lab_utils_multi import  load_house_data, compute_cost, run_gradient_descent 
from lab_utils_multi import  norm_plot, plt_contour_multi, plt_equal_scale, plot_cost_i_w
# 载入数据
X_train,y_train=load_house_data()
X_features = ['size(sqft)','bedrooms','floors','age']  
# 让我们通过绘制每个特征与价格来查看数据集及其特征。
fig,ax=plt.subplots(1,4,figsize=(12,3),sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X_train[:,i],y_train,c='b')
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel('Price')
# plt.show()
# #set alpha to 9.9e-7
_, _, hist = run_gradient_descent(X_train, y_train, 10, alpha = 1e-7)
plot_cost_i_w(X_train, y_train, hist)
# num_iters=11000
# for i in range(num_iters):
#     save_interval = np.ceil(num_iters/10000)
#     print(save_interval)
# b=1
# hist={}
# hist["cost"] = []; hist["params"] = []; hist["grads"]=[]; hist["iter"]=[];
# for w in range(5):
#     hist["params"].append([w,b])
# print(hist["params"])
def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    
    Args:
      X (ndarray): Shape (m,n) input data, m examples, n features
      
    Returns:
      X_norm (ndarray): Shape (m,n)  input normalized by column
      mu (ndarray):     Shape (n,)   mean of each feature
      sigma (ndarray):  Shape (n,)   standard deviation of each feature
    """
    mu=np.mean(X,axis=0)
    sigma=np.std(X,axis=0)
    X_norm=(X-mu)/sigma
    return X_norm, mu, sigma
# 让我们看一下 Z 分数归一化所涉及的步骤。下图显示了逐步的转换。size特证相对于age
X_norm, mu, sigma=zscore_normalize_features(X_train)
X_mean = (X_train - mu)
fig,ax=plt.subplots(1, 3, figsize=(12, 3))
ax[0].scatter(X_train[:,0], X_train[:,3])
ax[0].set_xlabel(X_features[0]); ax[0].set_ylabel(X_features[3]);
ax[0].set_title("unnormalized")
ax[0].axis('equal')

ax[1].scatter(X_mean[:,0], X_mean[:,3])
ax[1].set_xlabel(X_features[0]); ax[0].set_ylabel(X_features[3]);
ax[1].set_title(r"X - $\mu$")
ax[1].axis('equal')

ax[2].scatter(X_norm[:,0], X_norm[:,3])
ax[2].set_xlabel(X_features[0]); ax[0].set_ylabel(X_features[3]);
ax[2].set_title(r"Z-score normalized")
ax[2].axis('equal')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
fig.suptitle("distribution of features before, during, after normalization")
plt.show()
"""
上图显示了两个训练集参数“age”和“sqft”之间的关系。这些以相同的比例绘制。

左：非规范化：“size（sqft）”特征的值范围或方差远大于年龄范围
中间：第一步查找会移除每个要素的平均值或平均值。这将留下以零为中心的要素。很难看出“年龄”功能的差异，但“size（sqft）”显然在零附近。
右：第二步除以方差。这使得两个要素都以零为中心，比例相似
"""
# normalize the original features
# X_norm, X_mu, X_sigma = zscore_normalize_features(X_train)
print(f"X_mu = {mu}, \nX_sigma = {sigma}")
print(f"Peak to Peak range by column in Raw        X:{np.ptp(X_train,axis=0)}")   
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")

fig,ax=plt.subplots(1, 4, figsize=(12, 3))
for i in range(len(ax)):
    norm_plot(ax[i],X_train[:,i],)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("count");
fig.suptitle("distribution of features before normalization")
plt.show()
fig,ax=plt.subplots(1,4,figsize=(12,3))
for i in range(len(ax)):
    norm_plot(ax[i],X_norm[:,i],)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("count"); 
fig.suptitle(f"distribution of features after normalization")
plt.show()

w_norm, b_norm, hist = run_gradient_descent(X_norm, y_train, 1000, 1.0e-1, )
#predict target using normalized features
m=X_norm.shape[0];yp=np.zeros(m)
for i in range(m):
    yp[i]=np.dot(X_norm[i],w_norm)+b_norm
 # plot predictions and targets versus original features 
fig,ax=plt.subplots(1,4,figsize=(12,3),sharey=True) 
for i in range(len(ax)):
    ax[i].scatter(X_norm[:,i],y_train,label='target')
    ax[i].scatter(X_norm[:,i],yp,label='predict',c=dlorange)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("Price"); ax[0].legend();
fig.suptitle("target versus prediction using z-score normalized model")
plt.show()
# First, normalize out example.
x_house = np.array([1200, 3, 1, 40])
x_house_norm,_,_=zscore_normalize_features(x_house)
x_house_predict = np.dot(x_house_norm, w_norm) + b_norm
print(f'该房屋的预测价格为{x_house_predict}')
plt_equal_scale(X_train, X_norm, y_train)

特征工程和多项式回归

基于C1_W2_Lab04_FeatEng_PolyReg_Soln本例适用于要素/数据是非线性的组合 Python中的“ @”（@）符号有什么作用？简而言之，它用于装饰器语法和矩阵乘法。 a @ b相当于dot(a, b) numpy中reshapel函数的三种常见相关用法
reshape(1,-1)转化成1行：
reshape(2,-1)转换成两行：
reshape(-1,1)转换成1列：
reshape(-1,2)转化成两列
在科学记数法中，一个数被写成一个1与10之间的实数（尾数）与一个10的幂的积，为了得到统一的表达方式，该尾数并不包括10：
782300=7.823×10^5
0.00012=1.2×10^(−4)
10000=1×10^4
在电脑或计算器中一般用E或e（英语Exponential）来表示10的幂：比如1e-6表示1乘以10的负6次方
7.823E5=782300
1.2e−4=0.00012
若用一般的方法，将一个数的所有数位都写出，在表示非常大或非常小的数时，将难以清楚知道它的大小，有时亦会浪费很多空间。使用科学记数法写的数的数量级、精确度和数值都非常明确。
阅读代码这里的梯度下降X是要二维数组而y为一维需要变换一下
X : (array_like Shape (m,n) matrix of examples
y : (array_like Shape (m,)) target value of each example
array_like Shape (m,n)是二维，array_like Shape (m,)是一维
好吧，正如预期的那样，不太合适。需要的是类似的东西
或多项式要素。为此，您可以修改输入数据以设计所需的要素。如果将原始数据交换为将价值，那么你可以实现
.让我们试试吧。换成以下：X=X**2也就是等价替换降维为一维线性,发现曲线几乎重合 np.c_ 用于连接两个矩阵使用model_w,model_b = run_gradient_descent_feng(X, y1, iterations=9000, alpha=1e-5)发现报错 RuntimeWarning: overflow encountered in scalar add cost = cost + (f_wb_i - y[i])**2
原因是alpha数据不对导致计算式数值溢出，且对于高阶数据别忘了进行归一化处理以防止溢出特别是在拟合非线性曲线的时候在w_z,b_z=run_gradient_descent_feng(X1, y1, iterations=100000, alpha=1e-7)中使用alpha=1e-7由于值太小导致1W次迭代也难以下降成本换成alpha=1e-1曲线拟合，足以说明alpha值的重要性
总体代码

import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import zscore_normalize_features, run_gradient_descent_feng
np.set_printoptions(precision=2)  # reduced display precision on numpy arrays
# 创建原始数据
x=np.arange(0,20,1)
x_train=x.reshape(-1,1)
y=1+x**2
print(type(y),type(x_train))
print(f'y的形状为{y.shape}，x的形状为{x.shape},x_train的形状为{x_train.shape}')
print(f'训练集合为{x_train}，输出为{y}')
w1,b1=run_gradient_descent_feng(x_train,y,iterations=9000,alpha =1e-5)
print(f'梯度下降后的w1,b1为{w1}，{b1}')
plt.scatter(x,y,marker='o',c='b',label='target shujv')
plt.scatter(x,np.dot(x_train,w1)+b1,marker='x',c='r',label='predict shujv')
plt.legend()
plt.show()
# 等价降维
x_train2=x**2;x_train2=x_train2.reshape(-1,1)
w2,b2=run_gradient_descent_feng(x_train2,y,iterations=9000,alpha =1e-7)
plt.scatter(x,y,marker='o',c='b',label='target shujv')
plt.scatter(x,np.dot(x_train,w1)+b1,marker='x',c='r',label='predict shujv')
plt.scatter(x,np.dot(x_train2,w2)+b2,marker='+',c='r',label='yhpredict shujv')
plt.legend()
plt.show()


y1=x**2
X=np.c_[x, x**2, x**3]
print(X)
model_w,model_b = run_gradient_descent_feng(X, y1, iterations=9000, alpha=1e-7)
plt.scatter(x, y1, marker='x', c='r', label="Actual Value"); plt.title("x, x**2, x**3 features")
plt.plot(x, X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()
X_feature=['X','X^2','X^3']
# 下面，很明显，针对目标值映射x^2的功能是线性的。然后，线性回归可以使用该功能轻松生成模型。
fig,ax=plt.subplots(1,3,figsize=(9,3),sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X[:,i],y)
    ax[i].set_xlabel(X_feature[i])
ax[0].set_ylabel("y")
plt.show()
# Z归一化
X1=zscore_normalize_features(X,rtn_ms=False)
print(f'X峰值原来的差是{np.ptp(X,axis=0)}，各个特种Z归一化后的差是{np.ptp(X1,axis=0)}')
w_z,b_z=run_gradient_descent_feng(X1, y1, iterations=100000, alpha=1e-1)
plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("x, x**2, x**3 features")
plt.plot(x, X@model_w + model_b, label="Predicted Value")
plt.plot(x,X1@w_z + b_z, label="Z-Predicted Value")
plt.xlabel("x"); plt.ylabel("y"); 
plt.legend(); plt.show()
# 拟合cos
y2=np.cos(x/2);X2=np.c_[x, x**2, x**3,x**4, x**5, x**6, x**7, x**8, x**9, x**10, x**11, x**12, x**13]
X2=zscore_normalize_features(X2)
w_c,b_c=run_gradient_descent_feng(X2, y2, iterations=2000000, alpha = 1e-1)
plt.title("NIHE-cos");plt.scatter(x, y2, marker='x', c='r', label="Actual Value"); 
plt.plot(x, X2@w_c + b_c, label="Predicted Value")
plt.xlabel("x"); plt.ylabel("y"); 
plt.legend(); plt.show()

分类

基于C1_W3_LAB01_Classification
分类问题的示例包括将电子邮件识别为垃圾邮件或非垃圾邮件，或者确定肿瘤是恶性还是良性。特别是，这些是二元分类的示例，其中有两种可能的结果。结果可以用“积极”/“消极”成对来描述，例如“是/否”、“真”/“假”或“1”/“0”。
分类数据集图通常使用符号来指示示例的结果。在下面的图中，“X”用于表示正值，而“O”表示负结果。
可以试试先打印 X_train[y_train == i] 这个，这是一个推导式，返回 X_train 中满足 y_train ==i 条件的记录。后面那个 [0] 操作则是获取索引位置为 0 的数据。本代码中pos = y_train == 1的作用也在于此，打印print(pos)结果为[False False False True True True] 经典错误plt.xlabel="x";plt.ylabel="y"应该是plt.xlabel("x");plt.ylabel("y") plt设置标题plt.suptitle('one variable plot') ax设置标题ax[0].set_title('one variable plot') ax对于坐标轴的操作会多set_而plt没有比如plt.ylim(-0.2,1.1),plt和ax的区别（1）plt.plot()先生成一个figure画布，然后在这个画布上隐式生成的画图区域上画图
（2）ax.plot()同时生成了fig和ax对象，然后用ax对象在其区域上画图，推荐使用该方式
一维的分类图

import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
plt.rcParams['font.size'] = 8
from lab_utils_common import *
from plt_one_addpt_onclick import plt_one_addpt_onclick

x_train = np.array([0., 1, 2, 3, 4, 5])
y_train = np.array([0,  0, 0, 1, 1, 1])
x_train2 = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y_train2 = np.array([0, 0, 0, 1, 1, 1])
pos = y_train == 1
neg = y_train == 0
print(f'这里的x_train[pos]={x_train[pos]}等价于x_train[y_train == 1]={x_train[y_train == 1]}')
print(pos)
# ax.scatter(x_train[pos],y_train[pos],marker='x',c='r')
# ax.scatter(x_train[neg],y_train[neg],marker='o',c='b')
# ax.set_xlabel="x";ax.set_ylabel="y"
plt.scatter(x_train[pos],y_train[pos],marker='x',c='r',label='y=1')
plt.scatter(x_train[neg],y_train[neg],marker='o',c='b',label='y=0')
plt.suptitle('one variable plot');plt.ylim(-0.2,1.1)
plt.xlabel("x");plt.ylabel("y")
plt.legend()
plt.show()

Sigmoid函数的逻辑回归

正如讲座视频中所讨论的，对于分类任务，我们可以从使用线性回归模型开始,这可以通过使用“sigmoid 函数”来实现，该函数将所有输入值映射到 0 到 1 之间的值。如果输入是数字数组，我们希望将 sigmoid 函数应用于输入数组中的每个值。

g(z) = \frac{1}{1+e^{-z}}\tag{1}

sigmoid 应用于熟悉的线性回归模型，如下所示：

f_{\mathbf{w},b}(\mathbf{x}) = g(\mathbf{w} \cdot \mathbf{x} + b ) \tag{2}

where

g(z) = \frac{1}{1+e^{-z}}\tag{3}

and

z = \mathbf{w} \cdot \mathbf{x} + b \tag{4}

让我们将逻辑回归应用于肿瘤分类的分类数据示例。
首先，加载参数的示例和初始值。请尝试以下步骤：
单击“运行逻辑回归”以找到给定训练数据的最佳逻辑回归模型
请注意，生成的模型与数据拟合得非常好。
请注意，橙色线是”z'在上文（4）中。它与线性回归模型中的线不匹配。通过应用阈值。
勾选“切换 0.5 阈值”上的框，以显示应用阈值时的预测值。
这些预测看起来不错，预测与数据匹配现在，在大肿瘤大小范围（接近10）中添加更多数据点，并重新运行线性回归。
与线性回归模型不同，该模型继续做出正确的预测

import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
plt.rcParams['font.size'] = 8
from plt_one_addpt_onclick import plt_one_addpt_onclick
from lab_utils_common import draw_vthresh
def sigmoid(z):
    g=1/(1+np.exp(-z))
    return g
z=np.arange(-10,20)
print(z)
g_z=sigmoid(z)
print(g_z)
print(np.c_[z,g_z])
fig,ax = plt.subplots(1,1,figsize=(5,3))
# Plot z vs sigmoid(z)
ax.plot(z, g_z, c="b")

ax.set_title("Sigmoid function")
ax.set_ylabel('sigmoid(z)')
ax.set_xlabel('z')
draw_vthresh(ax,0)
plt.show()

决策边界

假设您想在此数据上训练一个逻辑回归模型，其形式为
$f(x) = g(w_0x_0+w_1x_1 + b)$
where $g(z) = \frac{1}{1+e^{-z}}$ , which is the sigmoid function

Let's say that you trained the model and get the parameters as $b = -3, w_0 = 1, w_1 = 1$ . That is,
$f(x) = g(x_0+x_1-3)$
回想一下，对于逻辑回归，模型表示为

f_{\mathbf{w},b}(x) = g(\mathbf{w} \cdot \mathbf{x} + b) \tag{1}

where $g(z)$ is known as the sigmoid function and it maps all input values to values between 0 and 1:

g(z) = \frac{1}{1+e^{-z}}\tag{2}

and $\mathbf{w} \cdot \mathbf{x}$ is the vector dot product:

\mathbf{w} \cdot \mathbf{x} = w_0 x_0 + w_1 x_1

We interpret the output of the model ( $f_{\mathbf{w},b}(x)$ ) as the probability that $y=1$ given $x$ and parameterized by $w$ and $b$ .
Therefore, to get a final prediction ( $y=0$ or $y=1$ ) from the logistic regression model, we can use the following heuristic -
if $f_{\mathbf{w},b}(x) >= 0.5$ , predict $y=1$
if $f_{\mathbf{w},b}(x) < 0.5$ , predict $y=0$
Let's plot the sigmoid function to see where $g(z) >= 0.5$
如您所见, $g(z) >= 0.5$ for $z >=0$
对于逻辑回归模型 $z = \mathbf{w} \cdot \mathbf{x} + b$ 因此

如果 $\mathbf{w} \cdot \mathbf{x} + b >= 0$
，模型预测 $y=1$
如果 $\mathbf{w} \cdot \mathbf{x} + b < 0$
，模型预测 $y=0$
现在，让我们回到我们的示例，了解逻辑回归模型如何进行预测。

我们的逻辑回归模型具有以下形式

$f(x) = g(-3 + x_0+x_1)$
从你上面学到的内容中，你可以看到这个模型预测 $y=1$
如果 $-3 + x_0+x_1 >= 0$

让我们看看这在图形上是什么样子的。我们将从绘图开始 $-3 + x_0+x_1 = 0$
，相当于 $x_1 = 3 - x_0$ .
$-3 + x_0+x_1 = 0$ 这条线称为决策边界下方y=0
总体代码

import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
plt.rcParams['font.size'] = 8
from lab_utils_common import plot_data, dlc, sigmoid, draw_vthresh
from lab_utils_multi import zscore_normalize_features, run_gradient_descent_feng
np.set_printoptions(precision=2)
X = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y = np.array([0, 0, 0, 1, 1, 1])
print(y.shape);w,b=run_gradient_descent_feng(X,y,iterations=10000, alpha = 1e-3)
# 打印多元回归函数
wb=''
for i in range(len(w)):
    wb+=f'({w[i]}*x[{i}])+({b})'
print(wb)
x0=np.arange(0,6)
x1=3-x0
fig,ax=plt.subplots(1,1,figsize=(5,5))
ax.plot(x0,x1,c='b');ax.axis([0,5,0,5])
ax.set_xlabel('$x_0$');ax.set_ylabel('$x_1$')
ax.fill_between(x0,x1,facecolor='green', alpha=0.3)
plot_data(X,y,ax)
plt.show()

逻辑回归损失

C1_W3_Lab05_Cost让我们使用平方误差成本获得成本的曲面图

J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2

where

f_{w,b}(x^{(i)}) = sigmoid(wx^{(i)} + b )

该结果很不平滑
损失是单个示例与其目标值之差的度量，而
成本是训练集损失的度量
这是定义的： $loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ 是单个数据点的成本，即：
$$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$

逻辑回归成本

成本是训练集损失的度量（之和），对于逻辑回归，成本函数的形式为

J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) \right] \tag{1}

where

$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, 在上文已经给出
where m is the number of training examples in the data set and:

\begin{align} f_{\mathbf{w},b}(\mathbf{x^{(i)}}) &= g(z^{(i)})\tag{3} \\ z^{(i)} &= \mathbf{w} \cdot \mathbf{x}+ b\tag{4} \\ g(z^{(i)}) &= \frac{1}{1+e^{-z^{(i)}}}\tag{5} \end{align}

compute_cost_logistic
在循环外创建一个变量来存储成本
遍历训练集中的所有示例。
计算每个训练样本的损失
计算，上面的等式（4）z_i
预测 sigmoid 函数在哪里，上面的公式（3）。Sigmoid （5）是一个库函数。f_wb_ig
计算此示例的损失，如上式（2）
将此成本添加到循环外创建的总成本变量
获取所有迭代的成本总和，并返回总数除以示例数。
gz=sigmoid(np.dot(w,X[i])+b)也是要写在循环内的因为g(zi)算出来是一个数那么w,x[i]需要一个行和列是一维
总体如下

import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
plt.rcParams['font.size'] = 8
from lab_utils_common import  plot_data, sigmoid, dlc
import numpy as np
X = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y = np.array([0, 0, 0, 1, 1, 1])
fig,ax = plt.subplots(1,1,figsize=(4,4))
plot_data(X, y, ax)

# Set both axes to be from 0-4
ax.axis([0, 4, 0, 3.5])
ax.set_ylabel('$x_1$', fontsize=12)
ax.set_xlabel('$x_0$', fontsize=12)
plt.show()
def compute_cost_logistic(X, y, w, b):
    """
    Computes cost

    Args:
      X (ndarray): Shape (m,n) matrix of examples with n features
      y (ndarray): Shape (m,)  target values
      w (ndarray): Shape (n)   parameters for prediction   
      b (scalar):              parameter  for prediction 
      
    Returns:
      cost (scalar): cost
    """
    m=X.shape[0]
    loss=0
    for i in range(m):
        gz=sigmoid(np.dot(w,X[i])+b)
        loss_i=-y[i]*np.log(gz)-(1-y[i])*np.log(1-gz)
        loss+=loss_i
    loss=(1/m)*loss
    return loss    
# 检查测试
w=np.array([1,1]);b=-3;print(y.shape,w.shape,w)
print(compute_cost_logistic(X, y, w, b))

逻辑回归的梯度下降

C1_W3_Lab6 在本练习中，您将：探索逻辑回归的梯度下降更新代码。在熟悉的数据集上探索梯度下降reshape(-1,)能将数据处理成(m,)的一维形式
回想一下梯度下降算法利用梯度计算：

\begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1} \; & \text{for j := 0..n-1} \\ & \; \; \; \; \;b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\ &\rbrace \end{align*}

Where each iteration performs simultaneous updates on $w_j$ for all $j$ , where

\begin{align*} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)})x_{j}^{(i)} \tag{2} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)}) \tag{3} \end{align*}

m is the number of training examples in the data set
$f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target
对于逻辑回归模型
$z = \mathbf{w} \cdot \mathbf{x} + b$
$f_{\mathbf{w},b}(x) = g(z)$
where $g(z)$ is the sigmoid function:
$g(z) = \frac{1}{1+e^{-z}}$
The gradient descent algorithm implementation has two components:

The loop implementing equation (1) above. This is gradient_descent below and is generally provided to you in optional and practice labs.
The calculation of the current gradient, equations (2,3) above. This is compute_gradient_logistic below. You will be asked to implement this weeks practice lab.
Implements equation (2),(3) above for all $w_j$ and $b$ .
There are many ways to implement this. Outlined below is this:
initialize variables to accumulate dj_dw and dj_db
for each example
- calculate the error for that example $g(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) - \mathbf{y}^{(i)}$
- for each input value $x_{j}^{(i)}$ $x_{j}^{(i)}$ in this example,
  - multiply the error by the input $x_{j}^{(i)}$ , and add to the corresponding element of dj_dw. (equation 2 above)
- add the error to dj_db (equation 3 above)
divide dj_db and dj_dw by total number of examples (m)
note that $\mathbf{x}^{(i)}$ in numpy X[i,:] or X[i] and $x_{j}^{(i)}$ is X[i,j]
在求偏导的compute_gradient_logistic子函数中对dj_dw: (ndarray Shape (n,))用dj_dw=np.zeros_like(n)报错，用dj_dw = np.zeros((n,))正常,经查前者生成的是0维数据()，后者是一维数组(n,)因此前者维度不匹配报错，或者用np.zeros_like(X[0])即可生成于一维数组(n,)这里的n是X的列数
梯度下降代码下面的代码实现公式（1）如下。花点时间查找例程中的函数并将其与上面的等式进行比较
在此子函数中使用for j in range(n): w[j]=w[j]-alpha*dj_dw多此一举，因为上文中返回的dj_dw本身就是一维数组(n,)类型用w=w-alpha*dj_dw就可以一次更新所有的w[j]j从0-n-1，if i % math.ceil(num_iters/10)==0:和if num_iters<=100000:是要在for循环以内的这样才能实时计算保存cost和输出
绘制分界线，因为z=w0*x0+w1*x1 以z=0为临界线，所以w0*x0+w1*x1=0分别令x0=0，x1=0，在(x0,x1)二维平面中得到两点用plot绘制两点间的直线连接即可
能计算二维X (ndarray): Shape (m,n)的梯度下降函数也能用于计算一维，将一维x_train = np.array([0., 1, 2, 3, 4, 5])转换为二维一列的列数组即可，列数为1是因为特征只有一个x_train=x_train.reshape(-1,1)
总代码

import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
plt.rcParams['font.size'] = 8
import copy, math
from lab_utils_common import  dlc, plot_data, plt_tumor_data, sigmoid, compute_cost_logistic
from plt_quad_logistic import plt_quad_logistic, plt_prob
#让我们从决策边界实验室中使用的相同两个特征数据集开始。
X_train = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y_train = np.array([0, 0, 0, 1, 1, 1])
#和以前一样，我们将    
# pos = y_train == 1
# neg = y_train == 0
# pos = pos.reshape(-1,)  #work with 1D or 1D y vectors
# neg = neg.reshape(-1,)
# print(pos,neg,pos.shape,neg.shape)
fig,ax=plt.subplots(1,1,figsize=(6,6))
plot_data(X_train,y_train,ax)
ax.set_xlabel('$x_0$');ax.set_ylabel('$x_1$');
plt.show()
def compute_gradient_logistic(X, y, w, b): 
    """
    Computes the gradient for linear regression 
 
    Args:
      X : (ndarray Shape (m,n)) variable such as house size 
      y : (ndarray Shape (m,))  actual value 
      w : (ndarray Shape (n,))  parameters of the model      
      b : (scalar)              parameter of the model   
    Returns
      dj_dw: (ndarray Shape (n,)) The gradient of the cost w.r.t. the parameters w. 
      dj_db: (scalar)             The gradient of the cost w.r.t. the parameter b. 
    """
    m,n=X.shape
    dj_db=0
    dj_dw=np.zeros_like(X[0])  # 或dj_dw = np.zeros((n,))
    for i in range(m):
        fwb_i=sigmoid(np.dot(X[i],w)+b)
        err=fwb_i-y[i]
        dj_db+=(1/m)*err
        for j in range(n):
            dj_dw[j]+=(1/m)*err*X[i,j]
    return dj_dw,dj_db
# 测试
X_tmp = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y_tmp = np.array([0, 0, 0, 1, 1, 1])
w = np.array([2.,3.])
b = 1.
dj_dw, dj_db = compute_gradient_logistic(X_tmp, y_tmp, w, b)
print(f"dj_db, non-vectorized version: {dj_db}" )
print(f"dj_dw, non-vectorized version: {dj_dw.tolist()}" )
dj_dw1=np.zeros_like(2);dj_dw2=np.zeros(2);print(dj_dw1,dj_dw2,dj_dw1.shape,dj_dw2.shape)
def gradient_descent(X, y, w_in, b_in, alpha, num_iters): 
    """
    Performs batch gradient descent
    
    Args:
      X (ndarray): Shape (m,n)    matrix of examples 
      y (ndarray): Shape (m,)     target value of each example
      w_in (ndarray): Shape (n,)  Initial values of parameters of the model
      b_in (scalar):              Initial value of parameter of the model
      alpha (float):              Learning rate
      num_iters (int):            number of iterations to run gradient descent
      
    Returns:
      w (ndarray): Shape (n,)     Updated values of parameters
      b (scalar):                 Updated value of parameter 
    """
    w=copy.deepcopy(w_in)
    b=b_in
    m,n=X.shape
    J_history = [] #存储成本
    for i in range(num_iters):
        dj_dw, dj_db = compute_gradient_logistic(X, y, w, b)  
        b=b-alpha*dj_db 
        w=w-alpha*dj_dw
        if num_iters<=100000:
            J_history.append(compute_cost_logistic(X, y, w, b))
        if i % math.ceil(num_iters/10)==0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]} ")
    return w, b, J_history, #return final w,b and J history for graphing
w_in  = np.zeros_like(X_train[0])
b_in  = 0.
alpha = 0.1
num_iters = 10000
w_out, b_out, _ = gradient_descent(X_train, y_train, w_in, b_in, alpha, num_iters) 
print(f"\nupdated parameters: w:{w_out}, b:{b_out}")        
fig,ax = plt.subplots(1,1,figsize=(5,4))
# plot the probability 
plt_prob(ax, w_out, b_out)

# Plot the original data
ax.set_ylabel(r'$x_1$')
ax.set_xlabel(r'$x_0$')   
ax.axis([0, 4, 0, 3.5])
plot_data(X_train,y_train,ax)   
# Plot the decision boundary
x0 = -b_out/w_out[1]
x1 = -b_out/w_out[0]
print(w_out[0],w_out[1],x0,x1)
ax.plot([0,x0],[x1,0], c=dlc["dlblue"], lw=1)
plt.show() 
# 用一维的测试通用性 
x_train = np.array([0., 1, 2, 3, 4, 5])
y_train = np.array([0,  0, 0, 1, 1, 1])
x_train=x_train.reshape(-1,1)
print(x_train)
w_in1=1;b_in1=1
w_out1, b_out1, _ = gradient_descent(x_train, y_train, w_in1, b_in1, alpha, num_iters) 
print(f"\nupdated parameters: w:{w_out1}, b:{b_out1}")

正则化成本和梯度

正则化可修复过度拟合
在本实验中：
使用正则化项扩展以前的线性和逻辑成本函数。
重新运行前面的过度拟合示例，并添加了正则化项。
上面的幻灯片显示了线性回归和逻辑回归的成本和梯度函数。注意：
成本
线性回归和逻辑回归之间的成本函数差异很大，但向方程添加正则化是相同的。
梯度
线性回归和逻辑回归的梯度函数非常相似。They differ only in the implementation of $f_{wb}$ .
正则化线性回归的成本函数
成本函数正则化线性回归的公式为：
The equation for the cost function regularized linear regression is:

J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2 \tag{1}

where:

f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{2}

Compare this to the cost function without regularization (which you implemented in a previous lab), which is of the form:

J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2

The difference is the regularization term,
$\frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2$
包括这个术语激励梯度下降以最小化参数的大小。请注意，在此示例中，参数 $b$
未规范化。这是标准做法。
下面是等式（1）和（2）的实现。请注意，这在本课程中使用了标准模式，即所有示例。for loop m
这里的正则项和成本函数是并行相加不要写嵌套循环，不要把j嵌套在i内这样会多加i倍的正则项,线性回归的fwb不要用sigmoid又不是逻辑回归
有一个经典错误for j in range(n): J_cost1+=J_cost0+(lambda_/(2*m))*(w[j]**2)这个语句会把J_cost0加j次而实际只要加一次
正则化逻辑回归的成本函数
对于正则化逻辑回归，成本函数的形式为

J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2 \tag{3}

where:

f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = sigmoid(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) \tag{4}

Compare this to the cost function without regularization (which you implemented in a previous lab:

J(\mathbf{w},b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\right]

As was the case in linear regression above, the difference is the regularization term, which is
$\frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2$

Including this term incentives gradient descent to minimize the size of the parameters. Note, in this example, the parameter $b$ is not regularized. This is standard practice.
梯度下降与正则化
运行梯度下降的基本算法不会随着正则化而改变，它是：

\begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1} \; & \text{for j := 0..n-1} \\ & \; \; \; \; \;b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\ &\rbrace \end{align*}

Where each iteration performs simultaneous updates on $w_j$ for all $j$ .

What changes with regularization is computing the gradients.
The gradient calculation for both linear and logistic regression are nearly identical, differing only in computation of $f_{\mathbf{w}b}$ .

\begin{align*} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m} w_j \tag{2} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} \end{align*}

m is the number of training examples in the data set
$f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target
For a linear regression model
$f_{\mathbf{w},b}(x) = \mathbf{w} \cdot \mathbf{x} + b$
For a logistic regression model
$z = \mathbf{w} \cdot \mathbf{x} + b$
$f_{\mathbf{w},b}(x) = g(z)$
where $g(z)$ is the sigmoid function:
$g(z) = \frac{1}{1+e^{-z}}$

The term which adds regularization is the $\frac{\lambda}{m} w_j $.
在def compute_gradient_linear_reg(X, y, w, b, lambda_): 中语句dj_dw+=(1/m)*(fwb_i-y[i])*X[i,j]+(lambda_/m)*w[j]dj_dw应该用dj_dw[j]如果没有下标他会把所有项相加，后一项+(lambda_/m)*w[j]不应该嵌套在i的循环内因为没有累加符号
总代码

import numpy as np
import matplotlib.pyplot as plt
from  plt_overfit import overfit_example, output
from lab_utils_common import sigmoid
np.set_printoptions(precision=8)

def compute_cost_linear_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    m,n=X.shape;fwb_i=J_cost0=J_cost1=0
    for i in range(m):
        fwb_i=np.dot(X[i],w)+b
        J_cost0+=(1/(2*m))*((fwb_i-y[i])**2)
    for j in range(n):
        J_cost1+=(lambda_/(2*m))*(w[j]**2)
    total_cost=J_cost1+J_cost0
    return total_cost
# 测试
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)
def compute_cost_logistic_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    m,n=X.shape;Jz_cost0=Jz_cost1=0
    for i in range(m):
        gz=sigmoid(np.dot(X[i],w)+b)
        Jz_cost0+=(1/m)*(-y[i]*np.log(gz)-(1-y[i])*np.log(1-gz))
    for j in range(n):
        Jz_cost1+=(lambda_/(2*m))*(w[j]**2)
    total_cost=Jz_cost0+Jz_cost1
    return total_cost
#测试
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)
print("Regularized cost:", cost_tmp)
#用于正则化线性回归的梯度函数
def compute_gradient_linear_reg(X, y, w, b, lambda_): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n=X.shape
    dj_db=fwb_i=0
    dj_dw=np.zeros((n,))
    for i in range(m):
        fwb_i=np.dot(X[i],w)+b
        dj_db+=(1/m)*(fwb_i-y[i])
        for j in range(n): 
            dj_dw[j]+=(1/m)*(fwb_i-y[i])*X[i,j]
    for j in range(n): 
        dj_dw[j]+=(lambda_/m)*w[j]
    return dj_dw,dj_db


#测试
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_dw_tmp, dj_db_tmp =  compute_gradient_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

正则化逻辑回归的梯度

带正则化的梯度
正则化成本函数的梯度是成本相对于参数的偏导数 $w$ 和 $b$ :

\begin{align*} \frac{\partial J(\mathbf{w})}{\partial b} &= \frac{1}{m} \sum_{i=0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag {1} \\ \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \left( \frac{1}{m} \sum_{i=0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} w_j \quad\, \mbox{for $j=0...(n-1)$} \tag {2} \end{align*}

您将实现一个名为的函数，该函数将返回compute_gradient_reg
$\frac{\partial J(\mathbf{w},b)}{\partial w},\frac{\partial J(\mathbf{w},b)}{\partial b}$
只是这里的f套用的是sigmoid函数,最后b输出dj_db: [0.34]可以用dj_db[0]截取
print(f'{x1:.18f}')中.18f作用是输出小数点右边的18位数字

神经元和层Tensorflow

基于C2_W1_Labo1，在本实验中，我们将探索神经元/单元和层的内部工作原理。特别是，该实验室将与您在课程 1 中掌握的模型、回归/线性模型和逻辑模型进行比较。该实验室将介绍Tensorflow，并演示如何在该框架中实现这些模型。

未激活的神经元 - 回归/线性模型

回归/线性模型
由未激活的神经元实现的功能与课程 1 中的线性回归相同：

f_{\mathbf{w},b}(x^{(i)}) = \mathbf{w}\cdot x^{(i)} + b \tag{1}

我们可以定义一个具有一个神经元或单元的层，并将其与熟悉的线性回归函数进行比较。
linear_layer = tf.keras.layers.Dense(units=1, activation = 'linear', )
activation = 'linear'代表该层为线性模型输出【未激活】
w, b= linear_layer.get_weights()返回权重，神经网络的权重也就是w,b。默认没有权重可初始化a1 = linear_layer(X_train[0].reshape(1,1))神经网络的输入必须是2维数组，以X_train[0]为例要.reshape(1,1)。这些权重w随机初始化为小数字，偏差b默认初始化为零。
a1 = linear_layer(X_train[0].reshape(1,1))和alin = np.dot(set_w,X_train[0].reshape(1,1)) + set_b也就是此时神经网络做了线性回归运算
总代码

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Sequential
from tensorflow.keras.losses import MeanSquaredError, BinaryCrossentropy
from tensorflow.keras.activations import sigmoid
from lab_utils_common import dlc
from lab_neurons_utils import plt_prob_1d, sigmoidnp, plt_linear, plt_logistic
plt.style.use('./deeplearning.mplstyle')
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)
# 数据
# 我们将使用课程 1 中的一个例子，即房价的线性回归
X_train=np.array([[1.0],[2.0]],dtype=np.float32)
Y_train=np.array([[300.0],[500.0]],dtype=np.float32)  
fig, ax = plt.subplots(1,1)
ax.scatter(X_train, Y_train, marker='x', c='r', label="Data Points")
ax.legend( fontsize='xx-large')
ax.set_ylabel('Price (in 1000s of dollars)', fontsize='xx-large')
ax.set_xlabel('Size (1000 sqft)', fontsize='xx-large')
plt.show()
# 我们可以定义一个具有一个神经元或单元的层，并将其与熟悉的线性回归函数进行比较。
linear_layer=tf.keras.layers.Dense(units=1,activation='linear')
linear_layer.get_weights()  
a1 = linear_layer(X_train[0].reshape(1,1))
print(a1)
w,b=linear_layer.get_weights()  
print(f'w={w},b={b}')
set_w=np.array([[200]])
set_b=np.array([100])
linear_layer.set_weights([set_w,set_b])
print(linear_layer.get_weights())
a1 = linear_layer(X_train[0].reshape(1,1))
print(a1)
alin = np.dot(set_w,X_train[0].reshape(1,1)) + set_b
print(alin)
prediction_tf=linear_layer(X_train)
prediction_np=np.dot(X_train,set_w)+set_b
plt_linear(X_train, Y_train, prediction_tf, prediction_np)

具有Sigmoid形激活的神经元

由具有 sigmoid 激活的神经元/单元实现的功能与课程 1 中的逻辑回归相同：

f_{\mathbf{w},b}(x^{(i)}) = g(\mathbf{w}x^{(i)} + b) \tag{2}

where $$g(x) = sigmoid(x)$$
Let's set $w$ and $b$ to some known values and check the model.
与上一段不同的是activation = 'sigmoid'后应写成logistic_layer
总代码

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Sequential
from tensorflow.keras.losses import MeanSquaredError, BinaryCrossentropy
from tensorflow.keras.activations import sigmoid
from lab_utils_common import dlc
from lab_neurons_utils import plt_prob_1d, sigmoidnp, plt_linear, plt_logistic
plt.style.use('./deeplearning.mplstyle')
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
#具有S形激活的神经元
X_train=np.array([0,1,2,3,4,5],dtype=np.float32).reshape(-1,1)
Y_train=np.array([0,0,0,1,1,1],dtype=np.float32).reshape(-1,1)
pos = Y_train == 1
neg = Y_train == 0
X_train[pos]
print(X_train[pos])
fig,ax = plt.subplots(1,1,figsize=(4,3))
ax.scatter(X_train[pos], Y_train[pos], marker='x', s=80, c = 'red', label="y=1")
ax.scatter(X_train[neg], Y_train[neg], marker='o', s=100, label="y=0", facecolors='none', 
              edgecolors=dlc["dlblue"],lw=3)
ax.set_ylim(-0.08,1.1)
ax.set_ylabel('y', fontsize=12)
ax.set_xlabel('x', fontsize=12)
ax.set_title('one variable plot')
ax.legend(fontsize=12)
plt.show()
model=Sequential(
    [
        tf.keras.layers.Dense(1,input_dim=1,activation='sigmoid',name='L1')
    ]
)
model.summary()
logistic_layer = model.get_layer('L1')
w,b = logistic_layer.get_weights()
print(w,b) #初始随机的w和b=0
print(w.shape,b.shape)
set_w=np.array([[2]])
set_b=np.array([-4.5])
logistic_layer.set_weights([set_w,set_b])
print(logistic_layer.get_weights())
a1 = model.predict(X_train[0].reshape(1,1))
print(a1)
alog = sigmoidnp(np.dot(set_w,X_train[0].reshape(1,1)) + set_b)
print(alog)
plt_logistic(X_train, Y_train, model, set_w, set_b, pos, neg)

双层S型神经网络“咖啡烘焙”网络

基于C2_W1_LAB02
关于plt.scatter(X[y==1,0], X[y==1,1])的解读
axis=0：在第一维操作,axis=1：在第二维操作,axis=-1：在最后一维操作
np.tile()函数的作用
构建模型tf.keras.Input(shape=(2,)),这两的shape和print(Xt[0].shape)中的对应 astype(int)意思是转换为Int
总代码

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from lab_utils_common import dlc
from lab_coffee_utils import load_coffee_data, plt_roast, plt_prob, plt_layer, plt_network, plt_output_unit
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)
X,Y = load_coffee_data()
print(X.shape, Y.shape)
print(X,Y)
plt_roast(X,Y)

print(f"Temperature Max, Min pre normalization: {np.max(X[:,0]):0.2f}, {np.min(X[:,0]):0.2f}")
print(f"Duration    Max, Min pre normalization: {np.max(X[:,1]):0.2f}, {np.min(X[:,1]):0.2f}")
norm_l = tf.keras.layers.Normalization(axis=-1)
norm_l.adapt(X)  # learns mean, variance
Xn = norm_l(X)
print(f"Temperature Max, Min post normalization: {np.max(Xn[:,0]):0.2f}, {np.min(Xn[:,0]):0.2f}")
print(f"Duration    Max, Min post normalization: {np.max(Xn[:,1]):0.2f}, {np.min(Xn[:,1]):0.2f}")
#平铺/复制我们的数据以增加训练集大小并减少训练周期的数量。
Xt=np.tile(Xn,(1000,1))
Yt=np.tile(Y,(1000,1))
print(Xt[0].shape)
print(Xt.shape,Yt.shape)  
#让我们建立讲座中描述的“咖啡烘焙网络”。有两层具有 S 形激活，如下所示：
tf.random.set_seed(1234)  # applied to achieve consistent results
model = Sequential(
    [
        tf.keras.Input(shape=(2,)),
        Dense(3, activation='sigmoid', name = 'layer1'),
        Dense(1, activation='sigmoid', name = 'layer2')
     ]
)
model.summary()
W1, b1 = model.get_layer("layer1").get_weights()
W2, b2 = model.get_layer("layer2").get_weights()
print(f"W1{W1.shape}:\n", W1, f"\nb1{b1.shape}:", b1)
print(f"W2{W2.shape}:\n", W2, f"\nb2{b2.shape}:", b2)
# 以下陈述将在第 2 周详细描述。目前：

# 该语句定义一个损失函数并指定编译优化。model.compile
# 该语句运行梯度下降并将权重拟合到数据。model.fit
model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01),
)

model.fit(
    Xt,Yt,            
    epochs=10,
)
W1, b1 = model.get_layer("layer1").get_weights()
W2, b2 = model.get_layer("layer2").get_weights()
print("W1:\n", W1, "\nb1:", b1)
print("W2:\n", W2, "\nb2:", b2)
W1 = np.array([
    [-8.94,  0.29, 12.89],
    [-0.17, -7.34, 10.79]] )
b1 = np.array([-9.87, -9.28,  1.01])
W2 = np.array([
    [-31.38],
    [-27.86],
    [-32.79]])
b2 = np.array([15.54])
model.get_layer("layer1").set_weights([W1,b1])
model.get_layer("layer2").set_weights([W2,b2])
X_test = np.array([
    [200,13.9],  # postive example
    [200,17]])   # negative example
X_testn = norm_l(X_test)
predictions = model.predict(X_testn)
print("predictions = \n", predictions)
#纪元和批次
# 在上面的语句中，数字设置为 10。这指定整个数据集应在训练期间应用 10 次。在训练期间，你会看到描述训练进度的输出，如下所示：compileepochs

# Epoch 1/10
# 6250/6250 [==============================] - 6s 910us/step - loss: 0.1782
# 第一行 描述模型当前正在运行的纪元。为了提高效率，训练数据集被分解为“批次”。Tensorflow 中批处理的默认大小为 32。我们的扩展数据集中有 200000 个示例或 6250 个批次。第 2 行的符号描述已执行的批次。Epoch 1/106250/6250 [====
yhat = np.zeros_like(predictions)
for i in range(len(predictions)):
    if predictions[i] >= 0.5:
        yhat[i] = 1
    else:
        yhat[i] = 0
print(f"decisions = \n{yhat}")

np构建“咖啡烘焙”网络

基于C2_W1_LAB03
lambda 函数
def my_dense(a_in, W, b, g):输入的a_in实际上是X[i]即X的第i行
def my_sequential(x, W1, b1, W2, b2):用于将各层神经网络串联起来
def my_predict(X, W1, b1, W2, b2):将X的各行遍历带入my_dense得到输出，此处输出shape为(m,1)
总代码

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from lab_utils_common import dlc, sigmoid
from lab_coffee_utils import load_coffee_data, plt_roast, plt_prob, plt_layer, plt_network, plt_output_unit
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)
X,Y = load_coffee_data()
print(X.shape, Y.shape)
plt_roast(X,Y)
# 规范化数据
# 为了匹配前面的实验室，我们将对数据进行规范化。有关更多详细信息，请参阅该实验室
print(f"Temperature Max, Min pre normalization: {np.max(X[:,0]):0.2f}, {np.min(X[:,0]):0.2f}")
print(f"Duration    Max, Min pre normalization: {np.max(X[:,1]):0.2f}, {np.min(X[:,1]):0.2f}")
norm_l = tf.keras.layers.Normalization(axis=-1)
norm_l.adapt(X)  # learns mean, variance
Xn = norm_l(X)
print(f"Temperature Max, Min post normalization: {np.max(Xn[:,0]):0.2f}, {np.min(Xn[:,0]):0.2f}")
print(f"Duration    Max, Min post normalization: {np.max(Xn[:,1]):0.2f}, {np.min(Xn[:,1]):0.2f}")
# 在第一个可选实验室中，您在 NumPy 和 Tensorflow 中构建了一个神经元，并注意到它们的相似性。
# 一层仅包含多个神经元/单元。如讲座中所述，可以利用 for 循环访问层中的每个单元 （），
# 并执行该单元 （） 的权重的点积，并将单元 （） 的偏差相加以形成。然后可以将激活函数应用于该结果。
# 让我们在下面尝试构建一个“密集层”子例程。jW[:,j]b[j]zg(z)
def my_dense(a_in, W, b, g):
    """
    Computes dense layer
    Args:
      a_in (ndarray (n, )) : Data, 1 example 
      W    (ndarray (n,j)) : Weight matrix, n features per unit, j units
      b    (ndarray (j, )) : bias vector, j units  
      g    activation function (e.g. sigmoid, relu..)
    Returns
      a_out (ndarray (j,))  : j units|
    """
    m=W.shape[1]
    a_out=np.zeros((m,))
    for j in range(m):
        a_out[j]=g(np.dot(a_in,W[:,j])+b[j])
    return a_out
#下面的单元利用上面的子程序构建一个三层神经网络。my_dense
def my_sequential(x, W1, b1, W2, b2):
    a1=my_dense(x,W1,b1,sigmoid)
    a2=my_dense(a1,W2,b2,sigmoid)
    return a2
W1_tmp = np.array( [[-8.93,  0.29, 12.9 ], [-0.1,  -7.32, 10.81]] )
b1_tmp = np.array( [-9.82, -9.28,  0.96] )
W2_tmp = np.array( [[-31.18], [-27.59], [-32.56]] )
b2_tmp = np.array( [15.41] )
def my_predict(X, W1, b1, W2, b2):
    m=X.shape[0]
    p=np.zeros((m,1))
    for i in range(m):
        p[i]=my_sequential(X[i],W1,b1,W2,b2)
    return p
X_tst = np.array([
    [200,13.9],  # postive example
    [200,17]])   # negative example
X_tstn = norm_l(X_tst)  # remember to normalize
predictions = my_predict(X_tstn, W1_tmp, b1_tmp, W2_tmp, b2_tmp)
yhat=(predictions>0.5).astype(int)
print(f"decisions = \n{yhat}")
netf= lambda x : my_predict(norm_l(x),W1_tmp, b1_tmp, W2_tmp, b2_tmp)
plt_network(X,Y,netf)

多类分类

多类分类基于C2_W2_Multiclass
神经网络通常用于对数据进行分类。例如神经网络：
拍摄照片并将照片中的主题分类为{狗，猫，马，其他}
取一个句子并对其元素的“词性”进行分类：{名词，动词，形容词等。
这种类型的网络在其最后一层中将有多个单元。每个输出都与一个类别相关联。将输入示例应用于网络时，具有最高值的输出是预测的类别。如果输出应用于 softmax 函数，则 softmax 的输出将提供输入在每个类别中的概率。
在本实验中，您将看到一个在 Tensorflow 中构建多类网络的示例。然后，我们将看看神经网络如何做出预测。
让我们从创建一个四类数据集开始。
输出层的神经元数量于分类类别数量相同，输出的最高值对应预测的类别。
softmax详解 make_blobs聚类数据生成器
 sklearn模型中random_state参数
make_blobs中参数centers=[[0,0]，[2.0,2.0]]是什么意思？
centers表示要生成的样本中心（类别）数，或者是确定的中心点。
形状 [n_centers，n_features] 的 int 或数组，可选 (默认值 = 3)
要生成的中心数量或固定的中心位置。centers=[[0,0]，[2.0,2.0]]表示中心点的位置坐标为（0，0）和（2，2）。
np.unique() 介绍
对于一维数组或者列表，np.unique() 函数去除其中重复的元素，并按元素由小到大返回一个新的无元素重复的元组或者列表。
模型
图像本练习将使用如下所示的 2 层网络。与二元分类网络不同，该网络有四个输出，每个类一个。给定一个输入示例，具有最高值的输出是输入的预测类。
下面是如何在Tensorflow中构建此网络的示例。请注意，输出层使用 a 而不是激活。虽然可以在输出层中包含 softmax，但如果在训练期间将线性输出传递给损失函数，则在数值上更稳定。如果模型用于预测概率，则可以在该点应用 softmax。linearsoftmax
该网络有两层第一层用ReLU函数第二层输出用softmax 4神经元输出
model = Sequential(
[
Dense(2, activation = 'relu', name = "L1"),
Dense(4, activation = 'linear', name = "L2")
]
)
虽然是softmax 4神经元输出但是此处还是用'linear'，把softmax函数交给 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)处理
SparseCategoricalCrossentropy函数用于计算多分类问题的交叉熵，计算公式如下：参数from_logits=False表示输出的logits需要经过激活函数的处理，默认为False。这样相比activation = 'softmax'配合默认false更精确
使用print(model.predict(X_train));print(model.predict(X_train).shape)可以直接打印输出层的输出，形状为（100,4）但在本例中还需找到每一维度X_train[i]4元素输出的最大值用来判断在哪个类中最有可能借助np.argmax()
用lambda定义函数model_predict = lambda Xl: np.argmax(model.predict(Xl),axis=1)此时model_predict相当于一个函数(variable)def model_predict(XL:Any)->Any
经过训练后每一层的w,b都带有权重了使用adam梯度下降得出

model.fit(
    X_train,y_train,
    epochs=200
)
l1 = model.get_layer("L1")
W1,b1 = l1.get_weights()
print(W1,b1)

第一层神经网络是2个神经元是因为正好可以用ReLU把4种类被分为类 0 和 1 与类 2 和 3 。
总代码

import numpy as np
import matplotlib.pyplot as plt
# %matplotlib widget
from sklearn.datasets import make_blobs
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
np.set_printoptions(precision=2)
from lab_utils_multiclass_TF import *
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)
# 我们将使用Scikit-Learn函数制作一个包含4个类别的训练数据集，
# 如下图所示。make_blobs
# make 4-class dataset for classification
classes = 4
m = 100
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
std = 1.0
#这里的y是一维数组(m,)从0-3，4分类
X_train, y_train = make_blobs(n_samples=m, centers=centers, cluster_std=std,random_state=30)
'''
每个点代表一个训练示例。轴 （x0，x1） 是输入，
颜色表示与示例关联的类。训练完成后，模型将呈现一个新示例 （x0，x1），并将预测类。
生成时，此数据集代表了许多现实世界的分类问题。
有多个输入要素 （x0,...,xn） 和多个输出类别。
训练模型以使用输入特征来预测正确的输出类别。
'''
plt_mc(X_train,y_train,classes, centers, std=std)
# show classes in data set
print(f"unique classes {np.unique(y_train)}")
# show how classes are represented
print(f"class representation {y_train[:10]}")
# show shapes of our dataset
print(f"shape of X_train: {X_train.shape}, shape of y_train: {y_train.shape}")
tf.random.set_seed(1234)  # applied to achieve consistent results
model = Sequential(
    [
        Dense(2, activation = 'relu',   name = "L1"),
        Dense(4, activation = 'linear', name = "L2")
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(0.01)
)
model.fit(
    X_train,y_train,
    epochs=200
)
print(model.predict(X_train))
print(model.predict(X_train).shape)
#make a model for plotting routines to call
# model_predict = lambda Xl: np.argmax(model.predict(Xl),axis=1)
# print(model_predict)
'''
上面，决策边界显示了模型如何对输入空间进行分区。
这个非常简单的模型在对训练数据进行分类时没有遇到任何问题。
它是如何做到这一点的？让我们更详细地看一下网络。
下面，我们将从模型中提取经过训练的权重，
并用它来绘制每个网络单元的函数。
再往下，对结果有更详细的解释。
您无需了解这些细节即可成功使用神经网络，
但更直观地了解这些层如何组合以解决分类问题可能会有所帮助。
'''
# gather the trained parameters from the first layer
l1 = model.get_layer("L1")
W1,b1 = l1.get_weights()
print(W1,b1)
# gather the trained parameters from the output layer
l2 = model.get_layer("L2")
W2, b2 = l2.get_weights()
print(W2,b2)
# plot the function of the first layer
plt_layer_relu(X_train, y_train.reshape(-1,), W1, b1, classes)  
# create the 'new features', the training examples after L1 transformation
Xl2 = np.zeros_like(X_train)
Xl2 = np.maximum(0, np.dot(X_train,W1) + b1)
plt_output_layer_linear(Xl2, y_train.reshape(-1,), W2, b2, classes,
                        x0_rng = (-0.25,np.amax(Xl2[:,0])), x1_rng = (-0.25,np.amax(Xl2[:,1])))

整流线性单元ReLU

本周，引入了一种新的激活，即整流线性单元（ReLU）。基于C2_W2_Relu

a = max(0,z) \quad\quad\text { ReLU function}

使用.suptitle()设置图像标题为图像添加一个居中标题。 .tight_layoutt会自动调整子图参数，使之填充整个图像区域。这是个实验特性，可能在一些情况下不工作。它仅仅检查坐标轴标签、刻度标签以及标题的部分。
.axvline()函数用于在轴上添加一条垂直线。.axhline()函数用于在轴上添加一条水平线。
sigmoid 最适合开/关或二进制情况。ReLU提供连续的线性关系。此外，它有一个“关闭”范围，其中输出为零。 “关闭”功能使ReLU成为非线性激活。为什么需要这样做？让我们在下面检查一下。
所示函数由线性部分（分段线性）组成。斜率在线性部分是一致的，然后在过渡点突然变化。在过渡点处，将添加新的线性函数，当添加到现有函数时，将产生新的斜率。新函数在转换点添加，但不参与该点之前的输出。非线性激活函数负责在转换点之前和之后禁用输入，有时在转换点之后禁用输入。下面的练习提供了一个更具体的例子。
y[50:100]=0代表第50-第100个数据为0 tf.keras.losses.MeanSquaredError() # 均方差损失函数

softmax 函数

在本实验中，我们将探索 softmax 函数。此函数在 Softmax 回归和神经网络中用于解决多类分类问题。
在 softmax 回归和具有 Softmax 输出的神经网络中，生成 N 个输出，并选择 0 个输出作为预测类别。在这两种情况下，向量都是由应用于 softmax 函数的线性函数生成的。 softmax 函数将转换为概率分布，如下所述。应用 softmax 后，每个输出将介于 1 和 1 之间，输出将加到 <>，以便它们可以解释为概率。较大的输入将对应于较大的输出概率。
softmax 函数可以编写：

a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1}

Cost：
与 Softmax 相关的损失函数，即交叉熵损失，为：
\begin{equation}
L(\mathbf{a},y)=\begin{cases}
-log(a_1), & \text{if $y=1$ }.\
&\vdots\
-log(a_N), & \text{if $y=N$ }
\end{cases} \tag{3}
\end{equation}
注意上面（3）中，只有与目标对应的线会导致损失，其他线为零。为了编写成本方程，我们需要一个“指标函数”，当索引与目标匹配时，该函数将为 1，否则为零。

\mathbf{1}\{y == n\} = =\begin{cases} 1, & \text{if $y==n$}.\\ 0, & \text{otherwise}. \end{cases}

现在的成本是：
J(\mathbf{w},b) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{N} 1\left{y^{(i)} == j\right} \log \frac{e^{z{(i)}j}}{\sum{k=1}^N e^{z{(i)}_k} }\right]
哪里m是示例的数量，N是输出的数量。这是所有损失的平均值。
对于print(p_nonpreferred [:,:2])默认是取到第二列为止，而对于print(p_nonpreferred [:2])默认是取到第二行为止
如果在训练过程中将softmax和损失结合起来，可以获得更稳定和准确的结果。可以发现结果有很大不同，是因为将softmax和损失结合起来后模型的输出不在等于概率的输出，如果所需的输出是概率，则输出应再次由softmax处理。
如果仅要选择最有可能的类别，不需要 softmax。可以使用 np.argmax（）直接找到找到最大输出的索引，因为呈现出正相关性，可以发现最大的输出类别索引是一致的也就表明了正相关性
利用softmax简单4分类总代码

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from IPython.display import display, Markdown, Latex
from sklearn.datasets import make_blobs
# %matplotlib widget
from matplotlib.widgets import Slider
from lab_utils_common import dlc
from lab_utils_softmax import plt_softmax
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)
'''
本实验将讨论在Tensorflow中实现softmax的两种交叉熵损失的方法，
“明显”方法和“首选”方法。前者是最直接的，而后者在数值上更稳定。
让我们首先创建一个数据集来训练多类分类模型。
'''
centers=[[-5,2],[-2,-2],[1,2],[5,-2]]  
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)
'''
方法一输出层直接用softmax激活
下面的模型是使用 softmax 实现的，
作为最终密集层中的激活。 
损失函数在指令中单独指定。compile
损失函数 .上述（3）中所述的损失。
在此模型中，softmax发生在最后一层。损失函数接受softmax输出，
这是一个概率向量。SparseCategoricalCrossentropy
'''
# model = Sequential(
#     [ 
#         Dense(25, activation = 'relu'),
#         Dense(15, activation = 'relu'),
#         Dense(4, activation = 'softmax')    # < softmax activation here
#     ]
# )
# model.compile(
#     loss=tf.keras.losses.SparseCategoricalCrossentropy(),
#     optimizer=tf.keras.optimizers.Adam(0.001),
# )

# model.fit(
#     X_train,y_train,
#     epochs=10
# )
# p_nonpreferred = model.predict(X_train)
# print(p_nonpreferred,p_nonpreferred.shape)
# print(p_nonpreferred [:2])
# print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
'''
从讲座中回忆说，如果在训练过程中将softmax和损失结合起来，
可以获得更稳定和准确的结果。这是由此处显示的“首选”组织启用的。  
在首选组织中，最后一层具有线性激活。由于历史原因，
这种形式的输出称为 logits。损失函数有一个附加参数：。
这会通知损失函数，softmax 操作应包含在损失计算中。
这允许优化实施。from_logits = True
'''
model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')    # < softmax activation here
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(0.001),
)

model.fit(
    X_train,y_train,
    epochs=10
)
'''
请注意，在首选模型中，输出不是概率，但范围可以从大的负数到大的正数。
在执行预期概率的预测时，必须通过 softmax 发送输出。 
让我们看一下首选的模型输出：,这只是模型的输出不是概率的输出与第一种方法不同  
'''
p_nonpreferred = model.predict(X_train)
print(p_nonpreferred,p_nonpreferred.shape)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))  
#找到前五组数据最大可能的类别   
for i in range(5):
    print(f'{p_nonpreferred[i]},最大可能的类别（4分类）：{np.argmax(p_nonpreferred[i])}')
# 使用softmax转为概率预测  
soft_nonpreferred=tf.nn.softmax(p_nonpreferred).numpy()
print(soft_nonpreferred,soft_nonpreferred.shape)
print(soft_nonpreferred [:2])
print("largest Pridectvalue", np.max(soft_nonpreferred), "smallest Pridectvalue", np.min(soft_nonpreferred))  
for i in range(5):
          print(f'{soft_nonpreferred[i]},最大可能的类别[概率版]（4分类）：{np.argmax(soft_nonpreferred[i])}')

高偏差和高方差

偏差体现的是偏离目标的程度，方差体现的是离散的程度。对于训练集来说，其预测误差会随着多项式项数的增加而下降，也就是拟合的效果会越来越好，但与此同时，同样的参数对于验证集来说，其预测误差会先下降，在到达低点之后在上升参考
减少偏差的一种方法是只使用更大的神经网络，更大的神经网络，我的意思是更多的隐藏层或每层更多的隐藏单元。
因此，Jcv和Jtrain中的巨大差距表明您可能存在高方差问题，如果您存在高方差问题，那么尝试修复它的一种方法是获取更多数据。
例如，我们已经看到，如果您的算法具有高偏差而不是高方差，那么在蜜罐项目（收集更多数据）上花费数易和数月可能不是最有成果的方向；但是，如果您的算法具有高方差，
那么收集更多数据可能会有很大帮助。

Recall召回率

Recall召回率即判断正确的概率，通俗的讲：Precision很高：说话靠谱；Recall很高：遗漏率很低
事实上，如果你想预测y等于1只有当你非常有信心时，你甚至可以将分界阈值提高到0.9，这会导致更高的精度，所以无论何时你预测患者患有这种疾病，你都可能是正
确的并且这会给你一个非常高的精度。
统计学上的F1评分，就是准确率和召回率的调和平均数,根据F1评分的大小来确定准确率和召回率是否合适

应用机器学习的建议

评估学习算法（多项式回归）
图像假设你创建了一个机器学习模型，并且发现它非常适合你的训练数据。你说完了？差一点。创建模型的目标是能够预测新示例的值。
在部署模型之前，如何测试模型在新数据上的性能？
答案分为两部分：
将原始数据集拆分为“训练”和“测试”集。
使用训练数据拟合模型的参数
使用测试数据在新数据上评估模型
开发一个误差函数来评估模型。
拆分数据集:
讲座建议保留20-40%的数据集进行测试。让我们使用函数train_test_split来执行拆分。运行以下单元格后仔细检查形状。
seed( ) 用于指定随机数生成时所用算法开始的整数值，如果使用相同的seed( )值，则每次生成的随即数都相同，如果不设置这个值，则系统根据时间来自己选择这个值，此时每次生成的随机数因时间差异而不同pnumpy.random.seed()的使用
 Python不重复批量随机抽样 random.sample() 和 numpy.random.choice() 的优缺点
 train_test_split
模型评估的误差计算，线性回归
评估线性回归模型时，需要平均预测值和目标值的平方误差。

J_\text{test}(\mathbf{w},b) = \frac{1}{2m_\text{test}}\sum_{i=0}^{m_\text{test}-1} ( f_{\mathbf{w},b}(\mathbf{x}^{(i)}_\text{test}) - y^{(i)}_\text{test} )^2 \tag{1}

下面，创建一个函数来评估线性回归模型的数据集上的误差:

def eval_mse(y, yhat):
    """ 
    Calculate the mean squared error on a data set.
    Args:
      y    : (ndarray  Shape (m,) or (m,1))  target value of each example
      yhat : (ndarray  Shape (m,) or (m,1))  predicted value of each example
    Returns:
      err: (scalar)             
    """
    m = len(y)
    err = 0.0
    for i in range(m):
    ### START CODE HERE ### 
        err_i=(1/(2*m))*(y[i]-yhat[i])**2
        err+=err_i
    ### END CODE HERE ### 
    return(err)

比较训练和测试数据的性能
让我们构建一个高次多项式模型，以最大程度地减少训练误差。这将使用中的linear_regression函数。如果您想查看详细信息，该代码位于导入的实用程序文件中。以下步骤如下：
创建并拟合模型。（“fit”是训练或运行梯度下降的另一个名称）。
计算训练数据上的误差。
计算测试数据上的误差。
PolynomialFeatures详解
生成数据集 x_ideal, y_ideal用于绘制理想曲线这里是x_train**2 + c，x_train, y_train为返回的数据集可再次分割成test和train或者调用两次分割成cv,test,train

def gen_data(m, seed=1, scale=0.7):
    """ generate a data set based on a x^2 with added noise """
    c = 0
    x_train = np.linspace(0,49,m)
    np.random.seed(seed)
    y_ideal = x_train**2 + c
    y_train = y_ideal + scale * y_ideal*(np.random.sample((m,))-0.5)
    x_ideal = x_train #for redraw when new data included in X
    return x_train, y_train, x_ideal, y_ideal

为了生成训练，测试，交叉验证三个数据集。我们将使用两次train_test_split，以获得三个拆分：

1
2
3

#split the data using sklearn routine 
X_train, X_, y_train, y_ = train_test_split(X,y,test_size=0.40, random_state=1)
X_cv, X_test, y_cv, y_test = train_test_split(X_,y_,test_size=0.50, random_state=1)

max()方法返回给定参数的最大值，参数可以为序列
ax.legend(loc='upper left') loc顾名思义是位置的意思
找到最佳度数：
在之前的实验中，您发现可以使用多项式创建能够拟合复杂曲线的模型（请参阅课程 1、第 2 周特征工程和多项式回归实验室）。此外，您还演示了通过增加多项式的次数，可以创建过拟合。（请参阅课程 1，第 3 周，过度拟合实验）。让我们在这里使用这些知识来测试我们区分过度拟合和欠拟合的能力。
让我们反复训练模型，使用循环每次迭代增加多项式的次数，用lmodel.mse计算每个次数下的误差存入数组中再使用numpy.argmin找到给出axis方向最小值的下标以便找到最合适的次数，节选代码如下：

for degree in range(max_degree):
  lmodel=lin_model(degree+1) #避免出现非0次
  lmodel.fit(X_train, y_train)
  #误差应该以cv组为基准
  yhat = lmodel.predict(X_cv)
  err_cv[degree] = lmodel.mse(y_cv, yhat)
  #同理可得train组的误差
  yhat = lmodel.predict(X_train)
  err_train[degree] = lmodel.mse(y_train, yhat)
optimal_degree = np.argmin(err_cv)+1
print(f"最合适的次数是{optimal_degree}")

这代码跑了两次：一次2，一次5。~~暂时不清楚是什么原因导致的~~，答：由于train_test_split在分割数据集时会随机打乱数据，所以对不同目的的任务最好设置不同的random_state
调整正则化。
在以前的实验中，您已利用正则化来减少过度拟合。与度数类似，可以使用相同的方法来调整正则化参数 lambda()以便找到最合适的数值。
让我们从高次多项式开始并改变正则化参数来演示这一点。
二维的数组使用np.zeros要有里外两层小括号否则可能会报错
使用双重循环同时找到相对合适的次数和lambda，发现与分别找出来的结果一致~~所以多此一举~~有可能是这两个参数具有一定的独立性
总代码

import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import relu,linear
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)

from public_tests_a1 import * 

tf.keras.backend.set_floatx('float64')
from assigment_utils import *

tf.autograph.set_verbosity(0)
# Generate some data
X,y,x_ideal,y_ideal = gen_data(18, 2, 0.7)
print("X.shape", X.shape, "y.shape", y.shape,"y_ideal",y_ideal)  
print(x_ideal,y_ideal)  
#split the data using sklearn routine 
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=1)
print("X_train.shape", X_train.shape, "y_train.shape", y_train.shape)
print("X_test.shape", X_test.shape, "y_test.shape", y_test.shape)

fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(x_ideal, y_ideal, "--", color = "orangered", label="y_ideal", lw=1)
ax.set_title("Training, Test",fontsize = 14)  
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.scatter(X_train, y_train, color = "red",           label="train")
ax.scatter(X_test, y_test,   color = dlc["dlblue"],   label="test")
ax.legend(loc='upper left')
plt.show()  
# UNQ_C1
# GRADED CELL: eval_mse
def eval_mse(y, yhat):
    """ 
    Calculate the mean squared error on a data set.
    Args:
      y    : (ndarray  Shape (m,) or (m,1))  target value of each example
      yhat : (ndarray  Shape (m,) or (m,1))  predicted value of each example
    Returns:
      err: (scalar)             
    """
    m = len(y)
    err = 0.0
    for i in range(m):
    ### START CODE HERE ### 
        err_i=(1/(2*m))*(y[i]-yhat[i])**2
        err+=err_i
    ### END CODE HERE ### 
    
    return(err)
#比较训练和测试数据的性能
# create a model in sklearn, train on training data
degree = 10
lmodel = lin_model(degree) #生成最高次数为10次的模型
lmodel.fit(X_train, y_train)

# predict on training data, find training error
yhat = lmodel.predict(X_train)
err_train = lmodel.mse(y_train, yhat)

# predict on test data, find error
yhat = lmodel.predict(X_test)
err_test = lmodel.mse(y_test, yhat)
#打印误差发现训练集上的计算误差大大小于测试集的计算误差。
print(f"training err {err_train:0.2f}, test err {err_test:0.2f}")
#绘制数据范围内的预测
x=np.linspace(0,int(X.max()),200)
y_pred=lmodel.predict(x)
print(f"x={x}",f"y_pred={y_pred}")
plt_train_test(X_train, y_train, X_test, y_test, x, y_pred, x_ideal, y_ideal, degree)
# Generate  data
X,y,x_ideal,y_ideal=gen_data(40,5,0.7)
print("X.shape", X.shape, "y.shape", y.shape)
#split the data using sklearn routine 训练	百分之60 其余各 百分之20
X_train, X_, y_train, y_ = train_test_split(X,y,test_size=0.4, random_state=2)
X_test, X_cv, y_test, y_cv = train_test_split(X_,y_,test_size=0.5, random_state=2)
print("X_train.shape", X_train.shape, "y_train.shape", y_train.shape)
print("X_cv.shape", X_cv.shape, "y_cv.shape", y_cv.shape)
print("X_test.shape", X_test.shape, "y_test.shape", y_test.shape)

fig, ax = plt.subplots(1,1,figsize=(4,4))
ax.plot(x_ideal, y_ideal, "--", color = "orangered", label="y_ideal", lw=1)
ax.set_title("Training, CV, Test",fontsize = 14)
ax.set_xlabel("x")
ax.set_ylabel("y")

ax.scatter(X_train, y_train, color = "red",           label="train")
ax.scatter(X_cv, y_cv,       color = dlc["dlorange"], label="cv")
ax.scatter(X_test, y_test,   color = dlc["dlblue"],   label="test")
ax.legend(loc='upper left')
plt.show()  

max_degree = 9
err_train = np.zeros(max_degree)    
err_cv = np.zeros(max_degree)     
#以下代码以便于画出在不同次数下的曲线 
x = np.linspace(0,int(X.max()),100)  
y_pred = np.zeros((100,max_degree))  #columns are lines to plot
print(y_pred.shape)
for degree in range(max_degree):
  lmodel=lin_model(degree+1) #避免出现非0次
  lmodel.fit(X_train, y_train)
  #误差应该以cv组为基准
  yhat = lmodel.predict(X_cv)
  err_cv[degree] = lmodel.mse(y_cv, yhat)
  #同理可得train组的误差
  yhat = lmodel.predict(X_train)
  err_train[degree] = lmodel.mse(y_train, yhat)

  """
  以下代码给各变量赋值以便绘图
  """
  y_pred[:,degree] = lmodel.predict(x)  
optimal_degree = np.argmin(err_cv)+1
print(f"最合适的次数是{optimal_degree}")
plt.close("all")
plt_optimal_degree(X_train, y_train, X_cv, y_cv, x, y_pred, x_ideal, y_ideal, 
                   err_train, err_cv, optimal_degree, max_degree)

lambda_range = np.array([0.0, 1e-6, 1e-5, 1e-4,1e-3,1e-2, 1e-1,1,10,100])
# 读取数组长度以便存储和比较err_
num_steps=len(lambda_range)
degree=10  
err_cv1=np.zeros(num_steps)
err_test1=np.zeros(num_steps)
err_train1=np.zeros(num_steps)
x=np.linspace(0,int(X.max()),100)
y_pred=np.zeros((100,num_steps))
for i in range(num_steps):
    lmodel=lin_model(degree, regularization = False, lambda_=lambda_range[i]) #避免出现非0次
    lmodel.fit(X_train,y_train)
    yhat = lmodel.predict(X_cv)
    err_cv1[i]=lmodel.mse(y_cv,yhat)
    yhat = lmodel.predict(X_test)
    err_test1[i]=lmodel.mse(y_test,yhat)
    y_pred[:,i]=lmodel.predict(x)
    yhat = lmodel.predict(X_train)
    err_train1[i] = lmodel.mse(y_train, yhat)
optimal_lambda=lambda_range[np.argmin(err_cv1)]
optimal_reg_idx = np.argmin(err_cv) 
print(f"最合适的lambda是{optimal_lambda}")
plt.close("all")
plt_tune_regularization(X_train, y_train, X_cv, y_cv, x, y_pred, err_train1, err_cv1, optimal_reg_idx, lambda_range) 
"""
使用双重循环找到相对合适的次数和lambda
难点是y_pred的形状变了
""" 
lambda_range = np.array([0.0, 1e-6, 1e-5, 1e-4,1e-3,1e-2, 1e-1,1,10,100])
# 读取数组长度以便存储和比较err_
n=num_steps=len(lambda_range)
# 从一到十
m=max_degree=9  
# mn=num_steps*max_degree
err_cv2=np.zeros((m,n))
err_test2=np.zeros((m,n))
err_train2=np.zeros((m,n))
x=np.linspace(0,int(X.max()),100)
y_pred=np.zeros((100,m,n))
for j in range(max_degree):
  for i in range(num_steps):
      lmodel=lin_model(j+1, regularization = False, lambda_=lambda_range[i]) #避免出现非0次
      lmodel.fit(X_train,y_train)
      yhat = lmodel.predict(X_cv)
      err_cv2[j,i]=lmodel.mse(y_cv,yhat)
      yhat = lmodel.predict(X_test)
      err_test2[j,i]=lmodel.mse(y_test,yhat)
      y_pred[:,j,i]=lmodel.predict(x)
      yhat = lmodel.predict(X_train)
      err_train2[j,i] = lmodel.mse(y_train, yhat)
      # 此时i*j已经把序号糅合在一起了怎么单独定位到i，j；所以把err设置为二维比较好,并把y_pred大胆的设置成3维
# def min2wsz(a[][]):
min = err_cv2[0,0]
row = colum=0
for j in range(max_degree):
  for i in range(num_steps):
    if (err_cv2[j,i] < min):
      min = err_cv2[j,i]
      row = j
      colum = i
# i2=np.argmin(err_cv2,axis=1)
optimal_lambda=lambda_range[colum]
optimal_reg_idx = colum 
optimal_degree = row+1
print(f"最合适的次数是{optimal_degree}")
print(f"最合适的lambda是{optimal_lambda}")
# plt_optimal_degree(X_train, y_train, X_cv, y_cv, x, y_pred, x_ideal, y_ideal, 
#                    err_train2, err_cv2, optimal_degree, max_degree)
# plt_tune_regularization(X_train, y_train, X_cv, y_cv, x, y_pred, err_train2, err_cv2, optimal_reg_idx, lambda_range)

决策树模型

纯度

当样本中狗和猫的比例为1:1时熵(杂质)最大，纯度最小；反之，当样本中全为狗或者全为猫时杂质最小，该曲线类似于翻转的二次函数
H(p1)=-p1log2(p1)-p0log2(P0)=-p1log2(p1)-(1-p1)log2(1-p1)
只有Log的底数为2才满足

信息增益

熵减少了，信息就增加了俗称信息增益,信息增益越大往往杂质减少的越多分割越明确
可以从root根处的熵H(0.5)-下一级加权熵计算信息增益，当树深到一定层数或者信息增益小于阈值可以考虑停止

回归树

在构建回归树时，我们不是试图减少熵，这是我们对分类问题的杂质度量，而是尝
试减少每个数据子集的值Y的权重方差。因此，就像前我们会为回归树选择能为您提供最大信息增益的特征一样您将选择能为您提供最大方差减少的特征，这就是为什么您选择耳朵形状作为分割特征的缘故。

决策树应用

1-软件包
首先，让我们运行下面的单元格来导入此分配期间需要的所有包。
numpy 是在 Python 中处理矩阵的基本包。
matplotlib 是一个著名的库，用于在 Python 中绘制图形。
utils.py包含此分配的帮助程序函数。您不需要修改此文件中的代码
2 - 问题陈述
假设您正在创办一家种植和销售野生蘑菇的公司。
由于并非所有蘑菇都可以食用，因此您希望能够根据其物理属性来判断给定的蘑菇是可食用的还是有毒的您有一些可用于此任务的现有数据。
你能用这些数据来帮助你确定哪些蘑菇可以安全销售吗？
注意：使用的数据集仅用于说明目的。它并不意味着作为识别食用蘑菇的指南。
3 - 数据集
首先加载此任务的数据集。您收集的数据集如下：

Cap Color	Stalk Shape	Solitary	Edible
Brown	Tapering	Yes	1
Brown	Enlarging	Yes	1
Brown	Enlarging	No	0
Brown	Enlarging	No	0
Brown	Tapering	Yes	1
Red	Tapering	Yes	0
Red	Enlarging	No	0
Brown	Enlarging	Yes	1
Red	Tapering	No	1
Brown	Enlarging	No	0

因此：
X_train每个示例包含三个功能
棕色（值表示“棕色”帽颜色，表示“红色”帽颜色）10
锥形（值表示“锥形茎形状”，表示“放大”茎形状）10
单独（值表示“是”，指示“否”）10
y_train是蘑菇是否可以食用
y = 1表示可食用
y = 0表示有毒
对于二维数据len(X_train)输出的是行数，X_train[:5]是取前五行相当于X_train[:5,:]
4 - 诊断树复习器
在本练习实验室中，您将基于提供的数据集构建决策树。

回想一下，构建决策树的步骤如下：

从根节点的所有示例开始
计算所有可能特征的信息增益以进行拆分，然后选择信息增益最高的特征
根据所选要素分割数据集，并创建树的左右分支
继续重复拆分过程，直到满足停止条件
在本实验中，您将实现以下函数，这些函数允许您使用具有最高信息增益的功能将节点拆分为左右分支

计算节点处的熵
根据给定特征将节点上的数据集拆分为左右分支
计算对给定特征进行拆分的信息增益
选择信息获取最大化的功能
然后，我们将使用您实现的帮助程序函数通过重复拆分过程直到满足停止条件来构建决策树

对于本实验，我们选择的停止标准是将最大深度设置为 2
4.1 计算熵
首先，您将编写一个名为的帮助程序函数，用于计算节点上的熵（杂质度量）。compute_entropy

该函数接受一个 numpy 数组（），该数组指示该节点中的示例是可食用的（）还是有毒的（y10)
完成以下功能以：compute_entropy()

计算
，这是可食用示例的比例（即具有值 = 在1y)
然后将熵计算为

H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1)

注意
日志以基数计算
出于实施目的，
.也就是说，如果或，将熵设置为p_1 = 0;p_1 = 10
确保检查节点上的数据是否为空（即）。如果是，则返回len(y) != 00

习题1
请按照前面的说明完成该功能。compute_entropy()
如果您遇到困难，可以查看下面单元格后面显示的提示，以帮助您实现。
为计算p1先要numpy自带的方法统计数组中0元素或图像中0像素值的个数

import numpy as np
a = np.array([[0,1,2],[2,0,0]])
cnt_array = np.where(a,0,1)
print(np.sum(cnt_array))

等于0的元素置为1，求和即可
除了以上，统计py数组中某个数a的个数推荐方法：len(y[y == a])故p1 = len(y[y == 1]) / len(y)
这里三种情况：y可能全为0或者1；y的数组不存在；y为正常情况.计算Log直接用np.logA(x)即可其中A是底数

def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
y (narray): Numpy一维数组，表示节点上的每个示例是否为
可食用(' 1 ')或有毒(' 0 ')
       
    Returns:
        entropy (float):该节点的熵
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    l=len(y)
    ### START CODE HERE ###
    #统计y数组中1的个数
    if l!=0:
        t0=np.sum(np.where(y,0,1))
        if t0==0 or t0==l:
           entropy=0 
        else:    
            t1=l-t0
            p1=t1/l
            entropy=-p1*np.log2(p1)-(1-p1)*np.log2(1-p1)      
    ### END CODE HERE ###        
    
    return entropy

4.2 拆分数据集
接下来，您将编写一个名为的帮助程序函数，该函数接收节点和要拆分的特征的数据，并将其拆分为左右分支。稍后在实验室中，你将实现代码来计算拆分的好坏程度。split_dataset
该函数接收训练数据、该节点上的数据点索引列表以及要拆分的特征。
它拆分数据并在左侧和右侧分支返回索引的子集。
例如，假设我们从根节点（所以）开始，我们选择在特征上拆分，即示例是否有棕色上限。node_indices = [0,1,2,3,4,5,6,7,8,9]
然后，函数的输出是，并且left_indices = [0,1,2,3,4,7,9] right_indices = [5,6,8]
习题2
请完成下面显示的功能split_dataset()
对每个索引node_indices
如果该要素的索引处的值为 1，则将索引添加到X1left_indices
如果该要素的索引处的值为 0，则将索引添加到X0right_indices
4.3 计算信息增益
接下来，您将编写一个名为的函数，该函数接收训练数据、节点上的索引和要拆分的特征，并返回拆分获得的信息。information_gain
习题3
请完成下面显示的函数进行计算compute_information_gain()

\text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right}))

$H(p_1^\text{node})$ 是上一级节点处的熵
$H(p_1^\text{left})$ and $H(p_1^\text{right})$ 是分裂后产生的左分支和右分支的熵
$w^{\text{left}}$ and $w^{\text{right}}$ 分别是左右分支中示例的比例


def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    ### START CODE HERE ###
    
    # Weights 
    w_left=len(X_left)/len(X_node)
    w_right=len(X_right)/len(X_node)
    HP1_node=compute_entropy(y_node)
    HP1_left=compute_entropy(y_left)
    HP1_right=compute_entropy(y_right)
    #Weighted entropy
     
    #Information gain                                                   
    information_gain = HP1_node-(w_left*HP1_left+w_right*HP1_right)
    ### END CODE HERE ###  
    
    return information_gain

获得最佳拆分
现在，让我们编写一个函数，通过计算每个特征的信息增益来获取要拆分的最佳特征，就像我们上面所做的那样，并返回提供最大信息增益的特征
习题4
请完成如下所示的功能。get_best_split()
该函数接收训练数据，以及该节点上数据点的索引
函数的输出提供最大信息增益的功能
您可以使用该函数循环访问特征并计算每个特征的信息如果您遇到困难，可以查看下面单元格后面显示的提示，以帮助您实现。compute_information_gain()
使用X.shape[1]可以读取第二维的长度即特征的数量，X.shape[0]可以读取第一维的长度即示例的数量
为了找到最大信息增益，我们可将各个特征的信息增益存到一个数组中，然后找到最大值的索引值python获得list或numpy数组中最大元素对应的索引


def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    best_i=[]
    # You need to return the following variables correctly
    best_feature = -1
    
    ### START CODE HERE ###
    for i in range(num_features):
        best_i.append(compute_information_gain(X,y,node_indices,i)) 
    best_feature=best_i.index(max(best_i))
        
    ### END CODE HERE ##    
   
    return best_feature

构建树
在本节中，我们使用您在上面实现的函数来生成决策树，方法是依次选择要拆分的最佳特征，直到达到停止条件（最大深度为 2）。
总代码

import numpy as np
import matplotlib.pyplot as plt
from public_tests import *

X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])
print("First few elements of X_train:\n", X_train[:5,:])
print("Type of X_train:",type(X_train))
print("First few elements of y_train:", y_train[:5])
print("Type of y_train:",type(y_train))
print ('Number of training examples (m):', len(X_train))


# UNQ_C1
# GRADED FUNCTION: compute_entropy

def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
y (narray): Numpy一维数组，表示节点上的每个示例是否为
可食用(' 1 ')或有毒(' 0 ')
       
    Returns:
        entropy (float):该节点的熵
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    l=len(y)
    ### START CODE HERE ###
    #统计y数组中1的个数
    if l!=0:
        t0=np.sum(np.where(y,0,1))
        if t0==0 or t0==l:
           entropy=0 
        else:    
            t1=l-t0
            p1=t1/l
            entropy=-p1*np.log2(p1)-(1-p1)*np.log2(1-p1)      
    ### END CODE HERE ###        
    
    return entropy
# Compute entropy at the root node (i.e. with all examples)
# Since we have 5 edible and 5 non-edible mushrooms, the entropy should be 1"

print("Entropy at root node: ", compute_entropy(y_train)) 

# UNIT TESTS
compute_entropy_test(compute_entropy)

# UNQ_C2
# GRADED FUNCTION: split_dataset

def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into
    left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (ndarray):  包含活动索引的列表。即，在此步骤中考虑的样本。
        feature (int):           Index of feature to split on s所要分割元素的索引
    
    Returns:
        left_indices (ndarray): Indices with feature value == 1
        right_indices (ndarray): Indices with feature value == 0
    """
    
    # You need to return the following variables correctly
    left_indices = []
    right_indices = []
    
    ### START CODE HERE ###
    for i in node_indices:
        if X[i][feature]==1:
            left_indices.append(i)
        elif X[i][feature]==0:
            right_indices.append(i)
    ### END CODE HERE ###
        
    return left_indices, right_indices
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# Feel free to play around with these variables
# The dataset only has three features, so this value can be 0 (Brown Cap), 1 (Tapering Stalk Shape) or 2 (Solitary)
feature = 0

left_indices, right_indices = split_dataset(X_train, root_indices, feature)

print("Left indices: ", left_indices)
print("Right indices: ", right_indices)

# UNIT TESTS    
split_dataset_test(split_dataset) 
# UNQ_C3
# GRADED FUNCTION: compute_information_gain

def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    ### START CODE HERE ###
    
    # Weights 
    w_left=len(X_left)/len(X_node)
    w_right=len(X_right)/len(X_node)
    HP1_node=compute_entropy(y_node)
    HP1_left=compute_entropy(y_left)
    HP1_right=compute_entropy(y_right)
    #Weighted entropy
     
    #Information gain                                                   
    information_gain = HP1_node-(w_left*HP1_left+w_right*HP1_right)
    ### END CODE HERE ###  
    
    return information_gain 
info_gain0 = compute_information_gain(X_train, y_train, root_indices, feature=0)
print("Information Gain from splitting the root on brown cap: ", info_gain0)
    
info_gain1 = compute_information_gain(X_train, y_train, root_indices, feature=1)
print("Information Gain from splitting the root on tapering stalk shape: ", info_gain1)

info_gain2 = compute_information_gain(X_train, y_train, root_indices, feature=2)
print("Information Gain from splitting the root on solitary: ", info_gain2)

# UNIT TESTS
compute_information_gain_test(compute_information_gain)

# UNQ_C4
# GRADED FUNCTION: get_best_split

def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    best_i=[]
    # You need to return the following variables correctly
    best_feature = -1
    
    ### START CODE HERE ###
    for i in range(num_features):
        best_i.append(compute_information_gain(X,y,node_indices,i)) 
    best_feature=best_i.index(max(best_i))
        
    ### END CODE HERE ##    
   
    return best_feature
best_feature = get_best_split(X_train, y_train, root_indices)
print("Best feature to split on: %d" % best_feature)

# UNIT TESTS 以上代码测试通过
# get_best_split_test(get_best_split)
# Not graded
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return #直接退出函数
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)

K 均值聚类

1 - 实现 K 均值
K-means算法是一种自动聚类以区分相似物的方法数据点在一起。
具体来说，你得到一个训练集 $\{x^{（1）}， ...， x^{（m）}\}$ ，你想要将数据分组到几个有凝聚力的“集群”中。
K 均值是一个迭代过程，它
首先随机初始质心，然后细化此猜测
重复将示例分配给其最近的质心，然后
根据分配重新计算质心。
在伪代码中，K 均值算法如下所示：
# Initialize centroids
# K is the number of clusters
centroids = kMeans_init_centroids(X, K)

for iter in range(iterations):
# Cluster assignment step:
# Assign each data point to the closest centroid.
# idx[i] corresponds to the index of the centroid
# assigned to example i
idx = find_closest_centroids(X, centroids)

# Move centroid step: 
# Compute means based on centroid assignments
centroids = compute_means(X, idx, K)

算法的内循环重复执行两个步骤：
（i）将每个训练样本 $x^{（i）}$ 分配给其最近的质心
（ii）使用分配给它的点重新计算每个质心的平均值。
$K$ 均值算法将始终收敛到质心的一组最终均值。
但是，收敛解可能并不总是理想的即有可能出现局部最优而不是整体最优，并且取决于质心的初始设置。
因此，在实践中，K-means算法通常使用不同的随机初始化运行几次。
从不同的随机初始化中选择这些不同解决方案的一种方法是选择具有最低成本函数值（失真）的解。
1.1 寻找最近的质心
在 K 均值算法的“聚类分配”阶段，算法将每个训练样本 $x^{（i）}$ 分配给其最接近的质心，给定质心的当前位置。
习题1
您的任务找到距离数据集最近的质心。find_closest_centroids
此函数获取数据矩阵和所有位置，内部质心Xcentroids
它应该输出一个一维数组（具有与相同数量的元素），该数组保存每个训练示例的最接近质心的索引（以 $\{1,...,K\}$ 为单位的值，其中 $K$ 是质心总数）。
idxX
具体来说，对于我们设置的每个示例 $x^{（i）}$ $$c^{（i）} ：= j \quad \mathrm{that ; minimizes} \quad ||x^{（i）} - \mu_j||^2，$$ 哪里
$c^{（i）}$ 是最接近 $x^{（i）}$ 的质心索引（对应于起始代码中的），并且idx[i]
$\mu_j$ 是第 $j$ 'th 质心的位置（值）。（存储在起始代码中）centroids
如果您遇到困难，可以查看下面单元格后面显示的提示，以帮助您实现。
这里utils的loaddata和之前的文件不一样,基于from的相对引用比较困难，直接重命名即可
这里有m个X（i）每个都是二维数组，而随机产生的质心centroids也是二维数组，那么这里的距离就是二维坐标下的距离
报错： index 2 is out of bounds for axis 1 with size 2 是因为jvli=(X[i,0]-centroids[j,0])**2+(X[i,1]-centroids[j,1])**2代码的列索引溢出，列索引也是从0开始，print(X[1,0])即可检查数据,为啥使用jvliz0=[]
配合jvliz0.append(jvli)结果是错误的呢？原因是在双重循环中.append函数一直在列表末尾添加元素,使得存储Xi到质心距离的列表包含了全部的X,导致错误,在末尾添重新赋值为空即可jvliz0=[]
更推荐以下代码

# UNQ_C1
# GRADED FUNCTION: find_closest_centroids
def find_closest_centroids(X, centroids):
    """
    Computes the centroid memberships for every example
    
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): k centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    
    """

    # Set K ，K is total number of centroids
    K = centroids.shape[0]
    m,n=X.shape
    # You need to return the following variables correctly
    idx = np.zeros(m, dtype=int)
    jvliz0=[]
    # jvliz0=np.zeros(K, dtype=int)
    ### START CODE HERE ###
    #计算每个Xi到K个质心的距离，寻找最短的那个质心
    for i in range(m):
        for j in range(K):
            jvli=float((X[i,0]-centroids[j,0])**2+(X[i,1]-centroids[j,1])**2)
            # jvli = np.linalg.norm(X[i] - centroids[j]) 
            jvliz0.append(jvli)
            # jvliz0[j]=jvli
            # print(jvliz0)
    
        idx[i]=np.argmin(jvliz0)
        jvliz0=[] #关键
    ### END CODE HERE ###
    
    return idx

1.2 计算质心均值
将每个点分配给附近的质心后，第二阶段对于每个质心，算法重新计算训练集各类别的平均值并重新赋值给质心
习题2
请完成以下操作以重新计算每个质心的值compute_centroids
具体来说，对于每个质心 $\mu_k$ 我们设置

\mu_k = \frac{1}{|C_k|} \sum_{i \in C_k} x^{(i)}

$C_k$ 是分配给质心的示例集 $k$
$|C_k|$ 是集合中的示例数 $C_k$
具体来说，如果两个例子说 $x^{(3)}$ 和 $x^{(5)}$ 被分配给质心 $k=2$ , 那么你应该更新 $\mu_2 = \frac{1}{2}(x^{(3)}+x^{(5)})$
python循环赋值变量名的方法，python在循环中创建并使用不同变量名

K=10
for i  in range(K):
    exec(f'X_cz{i}=[1,2]')
print(X_cz0) #[1,2]

实际上# Find closest centroids using initial_centroids idx = find_closest_centroids(X, initial_centroids)这段代码中的idx表示对X的分类结果，因此我们可以通过X[idx=k]来遍历对应编号为k的X数据集坐标类似前文中的plt.scatter(X[y==1,0], X[y==1,1])，X[y==1,0]其中1到m个数据集按行排列，y==1表示取y标记为1的X所在行，,0表示取第一列也就是X轴数据，,1表示取第二列也就是Y轴数据。numpy mean()函数详解
mean()函数的功能是求取平均值，经常操作的参数是axis，以mn的矩阵为例：
axis不设置值，对mn个数求平均值，返回一个实数
axis = 0：压缩行，对各列求均值，返回1n的矩阵
axis = 1: 压缩列，对各行求均值，返回m1的矩阵
重新分配质心子函数

def compute_centroids(X, idx, K):
    """
    Returns the new centroids by computing the means of the 
    data points assigned to each centroid.
    
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    # Useful variables
    m, n = X.shape
    
    # You need to return the following variables correctly
    centroids = np.zeros((K, n))
    # for i  in range(K):
    #     exec(f'X_cz{i}=[]')
    ### START CODE HERE ###
    # for i in range(m):
    #     for j in range(K):
    #         if idx[i]==j:
    #             exec(f'X_cz{j}.append(X[i])')
    for k in range(K):
        points=X[idx==k]
        centroids[k] = np.mean(points, axis = 0)
    ### END CODE HERE ## 
    return centroids

2 - 示例数据集上的 K 均值
完成上述两个函数（和）后，下一步是运行玩具 2D 数据集上的 K 均值算法可帮助您了解如何 K-means有效。find_closest_centroidscompute_centroids
我们鼓励您查看下面的函数（）以了解它是如何工作的。run_kMeans
请注意，代码调用您在循环中实现的两个函数。
当您运行下面的代码时，它将产生一个可视化，逐步完成算法的进度每次迭代。
Numpy对逻辑表达式判别不清楚，它可以返回False如果等号两边两个式子是数值相等，也可以返回True因为等号两边两个式子是逻辑相等参考
因此语句if centroidsnow.all()==centroidsold.all()在old和new不完全相等的情况下也会判断为真,比较有多个元素的两数组建议用any(),any() 函数用于判断给定的可迭代参数 iterable 是否全部为 False，则返回 False，如果有一个为 True，则返回 True也就是有一个不相等就返回ture我们在此处判断false即可

1
2
3

if (centroidsnow-centroidsold).any()==False:
    print(f"共迭代{i+1}次收敛")
    break

3 - 随机初始化
示例数据集的质心初始分配旨在让您看到如图 1 所示的相同图。在实践中，初始化质心的一个好策略是从训练集。
在练习的这一部分中，您应该了解如何实现该函数。kMeans_init_centroids
代码首先随机打乱示例的索引（使用）。np.random.permutation()
然后，它选择第一个 $K$ 基于索引随机排列的示例。
这允许随机选择示例，而不会有两次选择同一示例的风险。
注意：您无需为这部分练习实现任何内容。
4 - 使用 K 均值进行图像压缩
在本练习中，您将对图像压缩应用 K 均值。
以图像的简单 24 位颜色表示形式
，每个像素表示为三个 8 位无符号整数（范围从 0 到 255），指定红色、绿色和蓝色强度值。此编码通常称为 RGB 编码。
我们的图像包含数千种颜色，在练习的这一部分中，您将减少颜色到16种颜色。
通过进行这种减少，可以有效地表示（压缩）照片。
具体来说，您只需要存储 16 种选定颜色的 RGB 值，对于图像中的每个像素，您现在只需要存储该位置的颜色索引（只需要 4 位即可表示 16 种可能性）。
在本部分中，您将使用 K-means 算法选择将用于表示压缩图像的 16 种颜色。
具体而言，您将原始图像中的每个像素视为数据示例，并使用 K-means 算法查找在三维 RGB 空间中对像素进行最佳分组（聚类）的 16 种颜色。
计算完图像上的聚类质心后，将使用 16 种颜色替换原始图像中的像素。
4 - 使用 K 均值进行图像压缩
在本练习中，您将对图像压缩应用 K 均值。
以图像的简单 24 位颜色表示形式
，每个像素表示为三个 8 位无符号整数（范围从 0 到 255），指定红色、绿色和蓝色强度值。此编码通常称为 RGB 编码。
我们的图像包含数千种颜色，在练习的这一部分中，您将减少颜色到16种颜色。
通过进行这种减少，可以有效地表示（压缩）照片。
具体来说，您只需要存储 16 种选定颜色的 RGB 值，对于图像中的每个像素，您现在只需要存储该位置的颜色索引（只需要 4 位即可表示 16 种可能性）。
在本部分中，您将使用 K-means 算法选择将用于表示压缩图像的 16 种颜色。
具体而言，您将原始图像中的每个像素视为数据示例，并使用 K-means 算法查找在三维 RGB 空间中对像素进行最佳分组（聚类）的 16 种颜色。
计算完图像上的聚类质心后，将使用 16 种颜色替换原始图像中的像素。
存储图像的变量original_img是一个三维矩阵

前两个索引标识像素位置
第三个索引表示红色、绿色或蓝色
例如，给出第 50 行和第 33 列处像素的蓝色强度。original_img[50, 33, 2]
处理数据
要调用，您需要首先将矩阵转换为二维矩阵。run_kMeansoriginal_img
下面的代码重塑了矩阵以创建一个original_img $m \times 3$
像素颜色矩阵（其中 $m=16384 = 128\times128$ )
find_closest_centroids函数40行在处理图片时报错，错误为ValueError: cannot convert float NaN to integer，加上Int还是报错，使用另一种方法也是报错
由于这里的距离实际上是由上一行中的公式得出的，而将公式变为np.linalg.norm(X[i] - centroids[j])没有报错，由于np.linalg.norm函数输出的值为float因此这里将公式改为jvli=float((X[i,0]-centroids[j,0])**2+(X[i,1]-centroids[j,1])**2)无报错问题解决，这个例子告诉我们一个算法在简单例子上能成功复杂的却不一定,这句话还有一个问题就是默认X只有两列，对于图像这种三列以上的数据处理不好
4.3 压缩图像
找到顶部后 $K=16$ 颜色来代表图像，您现在可以使用该函数将每个像素位置分配给其最近的质心。find_closest_centroids
这允许您使用每个像素的质心分配来表示原始图像。
请注意，您显著减少了描述图像所需的位数。
原始图像需要 24 位才能用于每个 $128\times128$
像素位置，导致总大小 $128 \times 128 \times 24 = 393,216$ 位。
新的表示形式需要一些 16 种颜色的字典形式的开销存储，每种颜色需要 24 位（24位是因为RGB三通道每个通道8位），但图像本身只需要每个像素位置4位因为2^4=16种颜色。
因此，最终使用的位数为 $16 \times 24 + 128 \times 128 \times 4 = 65,920$
位，相当于将原始图像压缩约 6 倍。
把图像先reshape成m*3是因为3代表RGB通道，RGB通道三个列与颜色相关所以每个元素可用Kmeans算法压缩成16种颜色代替，m是总像素数按列取可取到每个像素，原来的图像形状为(128, 128, 3)三维
经测试，vscode和pycharm相对路径不统一，疑似vscode项目路径有问题因为我这里pycharm路径是对的
子函数run_kMeans编写有问题导致报错报错为处理图象时index溢出,原因是该函数的return 返回的变量名没改而前文的变量名改了，经测试该函数经过86次迭代收敛总代码为

import numpy as np
import matplotlib.pyplot as plt
from utils2 import *

# %matplotlib inline
X=load_data()
#test
# X1=X[:5]
# print(X1[1])
# print(X[1,0])

# UNQ_C1
# GRADED FUNCTION: find_closest_centroids
def find_closest_centroids(X, centroids):
    """
    Computes the centroid memberships for every example
    
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): k centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    
    """

    # Set K ，K is total number of centroids
    K = centroids.shape[0]
    m,n=X.shape
    # You need to return the following variables correctly
    idx = np.zeros(m, dtype=int)
    # jvliz0=[]
    # jvliz0=np.zeros(K, dtype=int)
    ### START CODE HERE ###
    #计算每个Xi到K个质心的距离，寻找最短的那个质心
    for i in range(m):
        jvliz0=[] #关键
        for j in range(K):
            # jvli=float((X[i,0]-centroids[j,0])**2+(X[i,1]-centroids[j,1])**2)
            jvli = np.linalg.norm(X[i] - centroids[j]) 
            #jvli=np.sum(np.power((centroids-X[i]),2),1)
            jvliz0.append(jvli)
            # jvliz0[j]=jvli
            # print(jvliz0)
    
        idx[i]=np.argmin(jvliz0)

    ### END CODE HERE ###
    
    return idx
# Select an initial set of centroids (3 Centroids)
initial_centroids = np.array([[3,3], [6,2], [8,5]])

# Find closest centroids using initial_centroids
idx = find_closest_centroids(X, initial_centroids)

# Print closest centroids for the first three elements
print("First three elements in idx are:", idx[:3])
# UNIT TEST
from public_tests2 import *

find_closest_centroids_test(find_closest_centroids)
# K=10
# for i  in range(K):
#     exec(f'X_cz{i}=[1,2]')
# print(X_cz0)
# UNQ_C2
# GRADED FUNCTION: compute_centpods
# points=X[idx==0]
# print(points)
def compute_centroids(X, idx, K):
    """
    Returns the new centroids by computing the means of the 
    data points assigned to each centroid.
    
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    # Useful variables
    m, n = X.shape
    
    # You need to return the following variables correctly
    centroids = np.zeros((K, n))
    # for i  in range(K):
    #     exec(f'X_cz{i}=[]')
    ### START CODE HERE ###
    # for i in range(m):
    #     for j in range(K):
    #         if idx[i]==j:
    #             exec(f'X_cz{j}.append(X[i])')
    for k in range(K):
        points=X[idx==k]
        centroids[k] = np.mean(points, axis = 0)
    ### END CODE HERE ## 
    return centroids
K = 3
centroids = compute_centroids(X, idx, K)
print("The centroids are:", centroids)
# You do not need to implement anything for this part

def run_kMeans(X, initial_centroids, max_iters=10, plot_progress=False):
    """
    Runs the K-Means algorithm on data matrix X, where each row of X
    is a single example
    """
    
    # Initialize values
    m,n=X.shape
    K=initial_centroids.shape[0]
    centroidsnow=initial_centroids
    centroidsold=centroidsnow

    idx = np.zeros(m)
    
    # Run K-Means
    for i in range(max_iters):
        print(f"当前迭代次数{i+1}，最大次数{max_iters},预计还剩{max_iters-i-1}次")
        idx=find_closest_centroids(X, centroidsnow)
        centroidsold=centroidsnow
        centroidsnow = compute_centroids(X, idx, K)
        #打印比较质心位置以判断收敛情况
        print(centroidsnow==centroidsold)

         # Optionally plot progress
        if plot_progress:
            plot_progress_kMeans(X, centroidsnow, centroidsold, idx, K, i+1)


        if (centroidsnow-centroidsold).any()==False:
            print(f"共迭代{i+1}次收敛")
            break

        
     

        # Initialize values
    # m, n = X.shape
    # K = initial_centroids.shape[0]
    # centroids = initial_centroids
    # previous_centroids = centroids
    # idx = np.zeros(m)
    #
    # # Run K-Means
    # for i in range(max_iters):
    #
    #     #Output progress
    #     print("K-Means iteration %d/%d" % (i, max_iters-1))
    #
    #     # For each example in X, assign it to the closest centroid
    #     idx = find_closest_centroids(X, centroids)
    #
    #     # Optionally plot progress
    #     if plot_progress:
    #         plot_progress_kMeans(X, centroids, previous_centroids, idx, K, i)
    #         previous_centroids = centroids
    #
    #     # Given the memberships, compute new centroids
    #     centroids = compute_centroids(X, idx, K)
    plt.show()
    return centroidsnow, idx
# Load an example dataset
X = load_data()

# Set initial centroids
initial_centroids = np.array([[3,3],[6,2],[8,5]])
#K = 3

# Number of iterations
max_iters = 10

centroids, idx = run_kMeans(X, initial_centroids, max_iters, plot_progress=True)

# You do not need to modify this part
# 这个随机算法是先把X打乱然后把前K个X直接作为初试质心
def kMeans_init_centroids(X, K):
    """
    This function initializes K centroids that are to be 
    used in K-Means on the dataset X
    
    Args:
        X (ndarray): Data points 
        K (int):     number of centroids/clusters
    
    Returns:
        centroids (ndarray): Initialized centroids
    """
    
    # Randomly reorder the indices of examples
    randidx = np.random.permutation(X.shape[0])
    
    # Take the first K examples as centroids
    centroids = X[randidx[:K]]
    
    return centroids
# Load an image of a bird这一步只是加载而已
original_img = plt.imread('./data/bird_small.png')
# 可视化图像
#您可以使用下面的代码可视化刚刚加载的图像
plt.imshow(original_img)
"""
检查变量的维度
与往常一样，您将打印出变量的形状以更熟悉数据。
"""
print("Shape of original_img is:", original_img.shape)
# Divide by 255 so that all values are in the range 0 - 1
original_img = original_img / 255
# Reshape the image into an m x 3 matrix where m = number of pixels
# (in this case m = 128 x 128 = 16384)
# Each row will contain the Red, Green and Blue pixel values
# This gives us our dataset matrix X_img that we will use K-Means on.
# X_img大小为m*3 *3是因为R G B三个通道  
X_img = np.reshape(original_img, (original_img.shape[0] * original_img.shape[1], 3))
# Run your K-Means algorithm on this data
# You should try different values of K and max_iters here
K = 16                       
max_iters = 100
# Using the function you have implemented above. 
initial_centroids_t=kMeans_init_centroids(X_img,K)
# Run K-Means - this takes a couple of minutes
centroids_t,idx_t=run_kMeans(X_img,initial_centroids_t,max_iters)
# idx_t=find_closest_centroids(X_img,centroids_t)
print("Shape of idx_t:", idx_t.shape)
print("Closest centroid for the first five elements:", idx_t[:5])
# Represent image in terms of indices
X_recovered = centroids_t[idx_t,:] 
# Reshape recovered image into proper dimensions
X_recovered = np.reshape(X_recovered, original_img.shape) 
# Display original image
fig, ax = plt.subplots(1,2, figsize=(8,8))
plt.axis('off')

ax[0].imshow(original_img*255)
ax[0].set_title('Original')
ax[0].set_axis_off()



# Display compressed image
ax[1].imshow(X_recovered*255)
ax[1].set_title('Compressed with %d colours'%K)
ax[1].set_axis_off()
plt.show()

异常检测与监督学习对比

异常检测包括检测是否存在一种全新的飞机发动机故障方式，这在您的数据集中从未见过当你理想地应用监督学习时，你会希望有足够多的正面例子让算法了解正面例子
是什么样的，通过监督学习，我们倾向于假设未来的正例很可能与训练集中的正例相似，而不是一种全新的错误（正例）。

协作过滤推荐系统

有点类似之前的线性回归预测房价，这里是根据用户的喜好w和电影特征x^i预测用户对该片的打分
符号：
|General
Notation | Description| Python (if any) |
|:-------------|:------------------------------------------------------------||
| $r(i,j)$ | scalar; = 1 if user j rated game i = 0 otherwise ||
| $y(i,j)$ | scalar; = rating given by user j on game i (if r(i,j) = 1 is defined) ||
| $\mathbf{w}^{(j)}$ | vector; parameters for user j ||
| $b^{(j)}$ | scalar; parameter for user j ||
| $\mathbf{x}^{(i)}$ | vector; feature ratings for movie i ||
| $n_u$ | number of users |num_users|
| $n_m$ | number of movies | num_movies |
| $n$ | number of features | num_features |
| $\mathbf{X}$ | matrix of vectors $\mathbf{x}^{(i)}$ | X |
| $\mathbf{W}$ | matrix of vectors $\mathbf{w}^{(j)}$ | W |
| $\mathbf{b}$ | vector of bias parameters $b^{(j)}$ | b |
| $\mathbf{R}$ | matrix of elements $r(i,j)$ | R |
2 - 推荐系统Image
在本实验中，您将实现协作过滤学习算法，并将其应用于电影分级数据集。协作过滤推荐系统的目标是生成两个向量：对于每个用户，一个体现用户电影品味的“参数向量”。对于每部电影，一个相同大小的特征向量，它体现了电影的一些描述。两个向量的点积加上偏差项应产生用户可能对该电影的评分的估计值。
下图详细介绍了如何学习这些向量。
现有评级以矩阵形式提供，如图所示。
Y包含评级;0.5 到 5 分，分 0.5 步进。如果电影尚未评级，则为 0。
R电影评分为 1。电影在行中，用户在列中。每个用户都有一个参数向量
$w^{user}$ 和偏见。每部电影都有一个特征向量
.通过使用现有的用户/电影评级作为训练数据来同时学习这些向量。上面显示了一个训练示例： $\mathbf{w}^{(1)} \cdot \mathbf{x}^{(1)} + b^{(1)} = 4$
.值得注意的是，特征向量 $x^{movie}$
必须满足所有用户，而用户向量 $w^{user}$ 必须满足所有电影，也就是用户向量 $w^{user}$ 的维数和特征向量 $x^{movie}$ 的特征数相同，使其支持正常内积。这是此方法名称的来源 - 所有用户协作生成评级集。
一旦学习了特征向量和参数，它们就可以用来预测用户如何对未分级的电影进行评分。如上图所示。该等式是预测用户 1 对电影零的评分的示例。
在本练习中，您将实现计算协同过滤的函数目标函数。实现目标函数后，您将使用 TensorFlow 自定义训练循环来学习用于协同过滤的参数。第一步是详细说明将在实验室中使用的数据集和数据结构。

3 - 电影评分数据集图像
该数据集派生自 MovieLens“ml-latest-small”数据集。
[F.麦克斯韦·哈珀和约瑟夫·康斯坦。2015. MovieLens 数据集：历史和背景。ACM 交互式智能系统汇刊（TiiS） 5， 4： 19：1–19：19.https://doi.org/10.1145/2827872]

原始数据集有 9000 部电影，由 600 位用户评分。自 2000 年以来，数据集的大小已缩小，以专注于 0 年以来的电影。该数据集由 5.5 到 0 的评分组成，以 5.<> 步为增量。缩减后的数据集具有 $n_u = 443$ users 和 $n_m= 4778$ 电影。
下面，您将电影数据集加载到变量中 $Y$ 和 $R$
矩阵 $Y$ （一 $n_m \times n_u$ matrix）存储评级 $y^{(i,j)}$ .矩阵 $R$
是一个二值指标矩阵，其中 $R(i,j) = 1$ 如果用户 $j$ 评价了电影 $i$ 和 $R(i,j)=0$ 否则。
在本部分练习中，您还将使用矩阵, $\mathbf{X}$ , $\mathbf{W}$ 和 $\mathbf{b}$ :

\mathbf{X} = \begin{bmatrix} --- (\mathbf{x}^{(0)})^T --- \\ --- (\mathbf{x}^{(1)})^T --- \\ \vdots \\ --- (\mathbf{x}^{(n_m-1)})^T --- \\ \end{bmatrix} , \quad \mathbf{W} = \begin{bmatrix} --- (\mathbf{w}^{(0)})^T --- \\ --- (\mathbf{w}^{(1)})^T --- \\ \vdots \\ --- (\mathbf{w}^{(n_u-1)})^T --- \\ \end{bmatrix},\quad \mathbf{ b} = \begin{bmatrix} b^{(0)} \\ b^{(1)} \\ \vdots \\ b^{(n_u-1)} \\ \end{bmatrix}\quad

我们将从加载电影收视率数据集开始，以了解数据的结构。我们将加载Y和R替换为电影数据集。
我们还将加载X，W和b使用预先计算的值。这些值将在稍后的实验室中学习，但我们将使用预先计算的值来开发成本模型。

#  From the matrix, we can compute statistics like average rating.
a=Y[0, R[0, :].astype(bool)]
print(a)
tsmean =  np.mean(a)

上面语句的作用难以理解用嵌套法逐一击破：
.astype(bool)是什么意思？
正如astype的中文意思，作为布尔类型，也就是true or false
R中的列数与W的行数也就是用户数一致,R中数据也只有0/1推测R是代表用户是否评价了这部电影的标记，R[0, :].astype(bool)中的TRUE数量正好是5个，a最后输出的元素也是5个
np.array可使用 shape。而对于列表list，却不能使用shape.T1、将列表转为array格式，然后使用shape即可！list_shape = np.array(list_01).shape
如果R[0, :].astype(bool)全为False则a没有输出也就是a为Nan,因此可知Y[0, R[0, :].astype(bool)]是对应Y数组第0行，443列中对应R[0, :].astype(bool)为TRUE的列也就是参与评分用户中给电影1的评分情况。
4 - 协同过滤学习算法图像
现在，您将开始实施协作过滤学习算法。您将从实现目标函数开始。

电影设置中的协同过滤算法建议考虑了一组n
-维参数向量 $\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)}$ , $\mathbf{w}^{(0)},...,\mathbf{w}^{(n_u-1)}$
和 $b^{(0)},...,b^{(n_u-1)}$
，其中模型预测电影的分级 $i$
按用户j
如 $y^{(i,j)} = \mathbf{w}^{(j)}\cdot \mathbf{x}^{(i)} + b^{(i)}$
.给定一个数据集，其中包含一些用户对某些电影产生的一组评分，你希望了解参数向量 $\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},...,\mathbf{w}^{(n_u-1)}$ 和 $b^{(0)},...,b^{(n_u-1)}$ 产生最佳拟合（最小化平方误差）。您将在 cofiCostFunc 中完成代码以计算成本协同过滤功能。
4.1 协同过滤成本函数
协同过滤成本函数由下式给出

J({\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},b^{(0)},...,\mathbf{w}^{(n_u-1)},b^{(n_u-1)}})= \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 +\underbrace{ \frac{\lambda}{2} \sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2 + \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2 }_{regularization} \tag{1}

1）中的第一个总结是“对于所有人i,j,哪里r(i,j)等于1，可以写成:

= \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 +\text{regularization}

您现在应该编写 cofiCostFunc（协作过滤成本函数）来返回此成本。
练习 1
for 循环实现：
首先使用 for 循环实现成本函数。考虑分两步开发成本函数。首先，开发不正则化的成本函数。下面提供了一个不包括正则化的测试用例，用于测试您的实现。一旦工作，添加正则化并运行包含正则化的测试。请注意，您应该为用户计算成本j和电影i仅当R(i,j)=1时
正则化只是对 W 数组和 X 数组的每个元素进行平方，并将它们对所有平方元素求和。您可以使用 np.square（）和 np.sum（）。
把numpy.ndarray格式的数组表示成R(i)形式会报错因为要用中括号
协同过滤成本函数的代码化：

def cofi_cost_func(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    nm, nu = Y.shape
    J = 0
    # ### START CODE HERE ###
    for j in range(nu):
        Wj=W[j,:]
        #b只能取一个数
        bj=b[0,j]
        for i in range(nm):
            Xi=X[i, :]
            Yij=Y[i, j]
            r=R[i,j]
            J+=np.square(r*(np.dot(Wj,Xi)+bj-Yij))
            
    J += lambda_ * (np.sum(np.square(W)) + np.sum(np.square(X)))
    J=(1/2)*J

    ### END CODE HERE ###

    return J

r=R[i,j]这句如果改为R=R[i,j]会报错因为传进来的形参是R而且上下文中多次定义了R，np.square平方操作的时候括号位置别写错，这里的正则化只是对 W 数组和 X 数组的每个元素进行平方所以直接用np.square就行。
矢量化实现
创建矢量化实现进行计算非常重要
，因为它稍后会在优化过程中多次调用。所使用的线性代数不是本系列的重点，因此提供了实现。如果您是线性代数方面的专家，请随时创建您的版本，而无需引用下面的代码。
运行下面的代码，并验证它是否生成与非矢量化版本相同的结果。

def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

矢量化就是加快了运行速度因为减少了for循环

学习电影推荐图像

完成实施协作筛选成本后函数，你可以开始训练你的算法来制作为自己推荐电影。
在下面的单元格中，您可以输入自己的电影选择。然后，该算法将为您提出建议！我们根据自己的喜好填写了一些值，但是在您按照我们的选择进行操作后，您应该更改它以符合您的口味。数据集中所有电影的列表位于文件电影列表中。
其中movieList, movieList_df = load_Movie_List_pd()中第一个参数用于打印电影标题，第二个参数用于打印csv文件内容包括序号
以下代码用于存储被我评价的电影对应的序号,并按升序排列my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]
我们的新建的按照序号对电影评价代码如下：

movieList, movieList_df = load_Movie_List_pd()
# print(movieList_df)

my_ratings = np.zeros(num_movies)          #  Initialize my ratings

# Check the file small_movie_list.csv for id of each movie in our dataset
# For example, Toy Story 3 (2010) has ID 2700, so to rate it "5", you can set
my_ratings[2700] = 5

#Or suppose you did not enjoy Persuasion (2007), you can set
my_ratings[2609] = 2;

# We have selected a few movies we liked / did not like and the ratings we
# gave are as follows:
my_ratings[929]  = 5   # Lord of the Rings: The Return of the King, The
my_ratings[246]  = 5   # Shrek (2001)
my_ratings[2716] = 3   # Inception
my_ratings[1150] = 5   # Incredibles, The (2004)
my_ratings[382]  = 2   # Amelie (Fabuleux destin d'Amélie Poulain, Le)
my_ratings[366]  = 5   # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
my_ratings[622]  = 5   # Harry Potter and the Chamber of Secrets (2002)
my_ratings[988]  = 3   # Eternal Sunshine of the Spotless Mind (2004)
my_ratings[2925] = 1   # Louis Theroux: Law & Disorder (2008)
my_ratings[2937] = 1   # Nothing to Declare (Rien à déclarer)
my_ratings[793]  = 5   # Pirates of the Caribbean: The Curse of the Black Pearl (2003)
#用于存储被我评价的电影对应的序号,并按升序排列
my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]
print(my_rated)


print('\nNew user ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0 :
        print(f'Rated {my_ratings[i]} for  {movieList_df.loc[i,"title"]}');

现在将以上新建的评价加入到Y,R数组中，使用np.c_按列连接矩阵，注意numpy一维数组np.array永远为列向量
 tf.random.normal函数用于从“服从指定正态分布的序列”中随机取出指定个数的值
这里还用了Adam方法：对梯度的一阶矩估计（First Moment Estimation，即梯度的均值）和二阶矩估计（Second Moment Estimation，即梯度的未中心化的方差）进行综合考虑，计算出更新步长。
现在，我们来训练协作过滤模型。这将学习参数X,W和b.
学习中涉及的操作w,b和x同时不属于 TensorFlow 神经网络软件包中提供的典型“层”。因此，课程 2 中使用的流程：Model、Compile（）、Fit（）、Predict（）并不直接适用。相反，我们可以使用自定义训练循环。
回想一下早期实验室的梯度下降步骤。
重复直到收敛：
计算前向传递
计算损失相对于参数的导数
使用学习率和计算导数更新参数
TensorFlow 具有为您计算导数的出色能力。如下所示。在本节中，将跟踪对 Tensorflow 变量的操作。稍后调用时，它将返回相对于跟踪变量的损失梯度。然后，可以使用优化器将梯度应用于参数。这是对 TensorFlow 和其他机器学习框架的有用功能的简要介绍。通过在感兴趣的框架内调查“自定义训练循环”，可以找到更多信息。tf.GradientTape()tape.gradient()
6 - 建议
下面，我们计算所有电影和用户的评分，并显示推荐的电影。这些是基于上面输入的电影和评级。预测电影的分级my_ratings[]i对于用户j，你计算 $\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}$ .这可以使用矩阵乘法计算所有评级。
在实践中，可以利用额外的信息来增强我们的预测。上图中，前几百部电影的预测收视率处于很小的范围内。我们可以通过选择那些顶级电影、平均评分高的电影和评分超过 20 的电影来增强上述内容。本节使用 Pandas 数据框，该数据框具有许多方便的排序功能。
在实践中，可以利用额外的信息来增强我们的预测。上图中，前几百部电影的预测收视率处于很小的范围内。我们可以通过选择那些顶级电影、平均评分高的电影和评分超过 20 的电影来增强上述内容。本节使用 Pandas 数据框，该数据框具有许多方便的排序功能。
暂时把代码跑了一遍但是梯度下降以后的代码没有完全看懂，以后再看11/3

import numpy as np
import tensorflow as tf
from tensorflow import keras
from recsys_utils import *
#Load data
X, W, b, num_movies, num_features, num_users = load_precalc_params_small()
Y, R = load_ratings_small()
print("Y", Y.shape, "R", R.shape)
print("X", X.shape)
print("W", W.shape)
print("b", b.shape)
print("num_features", num_features)
print("num_movies",   num_movies)
print("num_users",    num_users)
#  From the matrix, we can compute statistics like average rating.
# list = []
# for i in range(443):
#     list.append(False)
# print(np.array(list).shape)
a=Y[0, R[0, :].astype(bool)]
# b=R[0, :].astype(bool)

# print(b.shape)
# print('True个数：', np.sum(b!=0))
# print(b)
# print(a)
tsmean = np.mean(a)
print(f"Average rating for movie 1 : {tsmean:0.3f} / 5" )


# GRADED FUNCTION: cofi_cost_func
# UNQ_C1
def cofi_cost_func(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    nm, nu = Y.shape
    J = 0
    # ### START CODE HERE ###
    for j in range(nu):
        Wj=W[j,:]
        #b只能取一个数
        bj=b[0,j]
        for i in range(nm):
            Xi=X[i, :]
            Yij=Y[i, j]
            r=R[i,j]
            J+=np.square(r*(np.dot(Wj,Xi)+bj-Yij))

    J += lambda_ * (np.sum(np.square(W)) + np.sum(np.square(X)))
    J=(1/2)*J

    ### END CODE HERE ###

    return J


# Reduce the data set size so that this runs faster
num_users_r = 4
num_movies_r = 5
num_features_r = 3

X_r = X[:num_movies_r, :num_features_r]
W_r = W[:num_users_r,  :num_features_r]
b_r = b[0, :num_users_r].reshape(1,-1)
Y_r = Y[:num_movies_r, :num_users_r]
R_r = R[:num_movies_r, :num_users_r]

# Evaluate cost function
J = cofi_cost_func(X_r, W_r, b_r, Y_r, R_r, 0);
print(f"Cost: {J:0.2f}")
# Evaluate cost function with regularization
J = cofi_cost_func(X_r, W_r, b_r, Y_r, R_r, 1.5);
print(f"Cost (with regularization): {J:0.2f}")
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J
# Evaluate cost function
J = cofi_cost_func_v(X_r, W_r, b_r, Y_r, R_r, 0);
print(f"Cost: {J:0.2f}")

# Evaluate cost function with regularization
J = cofi_cost_func_v(X_r, W_r, b_r, Y_r, R_r, 1.5);
print(f"Cost (with regularization): {J:0.2f}")\

movieList, movieList_df = load_Movie_List_pd()
# print(movieList_df)

my_ratings = np.zeros(num_movies)          #  Initialize my ratings

# Check the file small_movie_list.csv for id of each movie in our dataset
# For example, Toy Story 3 (2010) has ID 2700, so to rate it "5", you can set
my_ratings[2700] = 5

#Or suppose you did not enjoy Persuasion (2007), you can set
my_ratings[2609] = 2;

# We have selected a few movies we liked / did not like and the ratings we
# gave are as follows:
my_ratings[929]  = 5   # Lord of the Rings: The Return of the King, The
my_ratings[246]  = 5   # Shrek (2001)
my_ratings[2716] = 3   # Inception
my_ratings[1150] = 5   # Incredibles, The (2004)
my_ratings[382]  = 2   # Amelie (Fabuleux destin d'Amélie Poulain, Le)
my_ratings[366]  = 5   # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
my_ratings[622]  = 5   # Harry Potter and the Chamber of Secrets (2002)
my_ratings[988]  = 3   # Eternal Sunshine of the Spotless Mind (2004)
my_ratings[2925] = 1   # Louis Theroux: Law & Disorder (2008)
my_ratings[2937] = 1   # Nothing to Declare (Rien à déclarer)
my_ratings[793]  = 5   # Pirates of the Caribbean: The Curse of the Black Pearl (2003)
#用于存储被我评价的电影对应的序号,并按升序排列
my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]
print(my_rated)


print('\nNew user ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0 :
        print(f'Rated {my_ratings[i]} for  {movieList_df.loc[i,"title"]}');
# Reload ratings and add new ratings
Y, R = load_ratings_small()
Y = np.c_[my_ratings, Y]
R = np.c_[(my_ratings != 0).astype(int), R]

# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)
#  Useful Values
num_movies, num_users = Y.shape
num_features = 100
#  Useful Values
num_movies, num_users = Y.shape
num_features = 100
"""
以下代码用于初始化W,X,b以便
作用在梯度下降中
"""
# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')

# Instantiate an optimizer.Adam优化
optimizer = keras.optimizers.Adam(learning_rate=1e-1)


# 已知成本函数运行梯度下降迭代
iterations = 200
lambda_ = 1
for iter in range(iterations):
    # Use TensorFlow’s GradientTape
    # to record the operations used to compute the cost
    with tf.GradientTape() as tape:

        # Compute the cost (forward pass included in cost)
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss
    grads = tape.gradient( cost_value, [X,W,b] )

    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients( zip(grads, [X,W,b]) )

    # Log periodically.
    if iter % 20 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

# Make a prediction using trained weights and biases 做出预测
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()

#restore the mean 可能和之前标准化过有关系
pm = p + Ymean

my_predictions = pm[:,0]


# sort predictions 对预测结果进行降序排序
ix = tf.argsort(my_predictions, direction='DESCENDING')
"""
由于上文使用了降序，所以这里取前17个评分
最高的电影由于每次迭代产生的W,b不同所以预测出的
电影可能不一样，如果不在之前我的评论中则
显示该电影的预测分数
"""
for i in range(17):
    j = ix[i]
    if j not in my_rated:
        print(f'Predicting rating {my_predictions[j]:0.2f} for movie {movieList[j]}')
"""
如果之前对电影做出过评论
则显示原始值和预测值
"""
print('\n\nOriginal vs Predicted ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
        print(f'Original {my_ratings[i]}, Predicted {my_predictions[i]:0.2f} for {movieList[i]}')
"""
利用额外的信息来增强我们的预测其实不应该放在最后
为了防止歧义因此注释掉
"""
# filter=(movieList_df["number of ratings"] > 20)
# movieList_df["pred"] = my_predictions
# movieList_df = movieList_df.reindex(columns=["pred", "mean rating", "number of ratings", "title"])
# movieList_df.loc[ix[:300]].loc[filter].sort_values("mean rating", ascending=False)

基于内容的过滤的深度学习

在本练习中，您将使用神经网络实现基于内容的过滤，以构建电影推荐系统。
2 - 电影评分数据集图像
该数据集派生自 MovieLens ml-latest-small 数据集。
[F.麦克斯韦·哈珀和约瑟夫·康斯坦。2015. MovieLens 数据集：历史和背景。ACM 交互式智能系统汇刊（TiiS） 5， 4： 19：1–19：19.https://doi.org/10.1145/2827872]
原始数据集有 9000 部电影，由 600 名用户评分，评分范围为 0.5 到 5，以 0.5 步为增量。该数据集的大小已经缩小，以专注于自 2000 年以来的电影和流行类型。缩减后的数据集具有 $n_u = 395$ users 和 $n_m= 694$ 电影。对于每部电影，数据集提供电影标题、发行日期以及一种或多种类型。比如《玩具总动员3》在2010年上映，有几种类型：“冒险|动画|儿童|喜剧|奇幻|IMAX”。此数据集除了用户评分外，几乎不包含有关用户的信息。该数据集用于为下述神经网络创建训练向量
2.1 使用神经网络进行基于内容的过滤
在协作筛选实验室中，您生成了两个向量，一个是用户向量，另一个是项目/电影向量，其点积将预测分级。向量仅来自评级。
基于内容的筛选还会生成用户和电影特征向量，但会识别出可能存在有关用户和/或电影的其他可用信息，这些信息可能会改进预测。附加信息被提供给神经网络，然后神经网络生成用户和电影向量，如下所示。
提供给网络的电影内容是原始数据和一些“工程功能”的组合。回想一下课程 1 第 2 周实验 4 中的特征工程讨论和实验。最初的特点是电影上映的年份，电影的类型呈现为一个热门的载体。有 14 种流派。工程功能是从用户评分得出的平均评分。具有多种类型的电影每个类型都有一个训练向量。
用户内容仅由工程功能组成。每个流派的平均评分是按用户计算的。此外，还提供用户 ID、评分计数和评分平均值，但不包括在训练或预测内容中。它们在解释数据时很有用。
训练集由数据集中用户所做的所有评分组成。用户和电影/项目向量作为训练集一起呈现到上述网络。对于用户分级的所有电影，用户向量都是相同的。
下面，让我们加载并显示一些数据。

np.genfromtxt函数通过字符串截取到array中，多维的列表其实是获取行数，对于多维array,len是用于获取行数
导入数据：

# Load Data, set configuration variables
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()

num_user_features = user_train.shape[1] - 3  # remove userid, rating count and ave rating during training
num_item_features = item_train.shape[1] - 1  # remove movie id at train time
uvs = 3  # user genre vector start
ivs = 3  # item genre vector start
u_s = 3  # start of columns to use in training, user
i_s = 1  # start of columns to use in training, items
scaledata = True  # applies the standard scalar to data if true
print(f"Number of training vectors: {len(item_train)}")

.shape[1]是获取列，列数-1就是去除左边第一个特征,content item_train.csv文件中后面的14列0/1代表
电影在14个流派中属于的流派。
归一化数据

# scale training data
if scaledata:
    item_train_save = item_train
    user_train_save = user_train

    scalerItem = StandardScaler()
    scalerItem.fit(item_train)
    item_train = scalerItem.transform(item_train)

    scalerUser = StandardScaler()
    scalerUser.fit(user_train)
    user_train = scalerUser.transform(user_train)

    print(np.allclose(item_train_save, scalerItem.inverse_transform(item_train)))
    print(np.allclose(user_train_save, scalerUser.inverse_transform(user_train)))

使用sklearn数据预处理，Fit(): Method calculates the parameters μ and σ and saves them as internal objects.
解释：简单来说，就是求得训练集X的均值啊，方差啊，最大值啊，最小值啊这些训练集X固有的属性。可以理解为一个训练过程
Transform(): Method using these calculated parameters apply the transformation to a particular dataset.
解释：在Fit的基础上，进行标准化，降维，归一化等操作（看具体用的是哪个工具，如PCA，StandardScaler等）。
y_train包含用户对电影的真实评分数据可用于后续处理。
来拆分和洗牌数据

item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test       = train_test_split(y_train,    train_size=0.80, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test  data shape: {item_test.shape}")
#缩放、随机排列的数据现在的平均值为零。
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)

3 - 用于基于内容的过滤的神经网络
现在，让我们构建一个神经网络，如上图所示。它将有两个网络，由一个点积组合而成。您将构建这两个网络。在此示例中，它们将是相同的。请注意，这些网络不需要相同。如果用户内容比电影内容大得多，则可以选择增加用户网络相对于电影网络的复杂性。在这种情况下，内容是相似的，因此网络是相同的。
使用 Keras 顺序模型
第一层是具有 256 个单元和 relu 激活的致密层。
第二层是致密层，有 128 个单位和一个 relu 激活。
第三层是具有单元和线性或无激活的致密层。num_outputs
将提供网络的其余部分。提供的代码不使用 Keras 顺序模型，而是使用 Keras 函数式 API。这种格式允许在组件互连方式方面具有更大的灵活性。
构建好神经网络后，下面，您将使用模型在多种情况下进行预测。
对新用户的预测
首先，我们将创建一个新用户，并让模型为该用户推荐电影。在示例用户内容上尝试此示例后，可以随意更改用户内容以匹配自己的偏好，并查看模型的建议。请注意，评分介于 0.5 和 5.0 之间（含 0.5 和 5.0），以半步为增量。

new_user_id = 5000
new_rating_ave = 1.0
new_action = 1.0
new_adventure = 1
new_animation = 1
new_childrens = 1
new_comedy = 5
new_crime = 1
new_documentary = 1
new_drama = 1
new_fantasy = 1
new_horror = 1
new_mystery = 1
new_romance = 5
new_scifi = 5
new_thriller = 1
new_rating_count = 3

user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
                      new_action, new_adventure, new_animation, new_childrens,
                      new_comedy, new_crime, new_documentary,
                      new_drama, new_fantasy, new_horror, new_mystery,
                      new_romance, new_scifi, new_thriller]])

如果您确实在上面创建了一个用户，值得注意的是，该网络经过训练，
可以在给定包含一组用户类型评级的用户向量的情况下预测用户评级。
如果没有具有类似评级集的用户，仅仅为单一类型提供最高评级，
为其余类型提供最低评级，可能对网络没有意义。
现有用户的预测。
让我们看一下“用户 36”（数据集中的用户之一）的预测。
我们可以将预测的评级与模型的评级进行比较。
请注意，具有多种类型的电影在训练数据中多次显示。
例如，《时光机》有三种类型：冒险、动作、科幻

uid =  36 
# form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs, y_vecs = get_user_vecs(uid, scalerUser.inverse_transform(user_train), item_vecs, user_to_genre)

# scale the vectors and make predictions for all movies. Return results sorted by rating.
sorted_index, sorted_ypu, sorted_items, sorted_user = predict_uservec(user_vecs, item_vecs, model, u_s, i_s, scaler, 
                                                                      scalerUser, scalerItem, scaledata=scaledata)
sorted_y = y_vecs[sorted_index]

#print sorted predictions
print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_items, item_features, ivs, uvs, movie_dict, maxcount = 10)

查找相似项目
上面的神经网络产生两个特征向量，一个是用户特征向量 $v_u$
和电影特征向量， $v_m$
.这些是 32 个条目向量，其值难以解释。但是，相似的项目将具有相似的向量。此信息可用于提出建议。例如，如果用户对《玩具总动员3》的评价很高，则可以通过选择具有相似电影特征向量的电影来推荐类似的电影。

相似性度量是两个向量之间的平方距离 $ \mathbf{v_m^{(k)}}$
和 $\mathbf{v_m^{(i)}}$
: $$\left\Vert \mathbf{v_m^{(k)}} - \mathbf{v_m^{(i)}} \right\Vert^2 = \sum_{l=1}^{n}(v_{m_l}{(k)} - v_{m_l}^{(i)})2\tag{1}$$
这个平方距离矩阵上标代表电影种类，下标代表电影的特征数m，因为对角线是相同的电影，所以沿对角线的掩码值不会包含在计算中。然后，我们可以通过找到每行的最小值来找到最近的电影。

强化学习

Q(s,a)函数

这个函数就是在状态s执行动作a，然后后续都做最有选择看看收益，最后把不同的动作比较得到maxQ。然后这个变量不太清除似乎是错误决定的概率
# Probability of going in the wrong direction misstep_prob = 0.0

深度 Q 学习 - 月球着陆器

在这项任务中，您将训练一名特工将月球着陆器安全地降落在月球表面的着陆台上。
1 - 导入包
我们将使用以下软件包：
numpy是 Python 中用于科学计算的包。
deque将是内存缓冲区的数据结构。
namedtuple将用于存储体验元组。
该工具包是可用于测试强化学习算法的环境集合。我们应该注意，在此笔记本中，我们使用的是版本 .gymgym0.24.0
PIL.Image并且是渲染月球着陆器环境所必需的。pyvirtualdisplay
我们将使用框架中的几个模块来构建深度学习模型。tensorflow.keras
utils是一个模块，其中包含此赋值的帮助程序函数。您无需修改此文件中的代码。
运行下面的单元格以导入所有必需的包。
pyvirtualdisplay报错，重新安装包可以在jupyter notebook中直接输入pip命令，在前面加感叹号

1 2	# Set up a virtual display to render the Lunar Lander environment. Display(visible=0, size=(840, 480)).start();

这行代码报错估计是windows不支持要在liunx中使用

机器学习背景知识补充

argmax是一种函数，是对函数求参数(集合)的函数。当我们有另一个函数y=f(x)时，若有结果x0= argmax(f(x))，则表示当函数f(x)取x=x0的时候，得到f(x)取值范围的最大值；若有多个点使得f(x)取得相同的最大值，那么argmax(f(x))的结果就是一个点集。换句话说，argmax(f(x))是使得 f(x)取得最大值所对应的变量点x(或x的集合)。arg即argument，此处意为“自变量”。
P问题、NP问题、NP完全问题和NP难问题理解