:: AsciiDoc

1. Intro
2. Input Output Shape
3. Data Splitting

1. Intro

pip install -U scikit-learn

1.1. make_circles

用Scikit Learn的make_circles()，
生成两个具有不同颜色点的圆colour dot；

from sklearn.datasets import make_circles

# create 1000 sample
n_sample = 1000

# create circle
# noise:little bit of noise to dot
# random_state:keep random state so we get the same value
X,y = make_circles(n_sample,noise=0.03,random_state=42)

print(f"First 5 X Feature:\n{X[:5]}")
print(f"\nFirst 5 y Label:\n{y[:5]}")

First 5 X Feature:
[[ 0.75424625  0.23148074]
 [-0.75615888  0.15325888]
 [-0.81539193  0.17328203]
 [-0.39373073  0.69288277]
 [ 0.44220765 -0.89672343]]

First 5 y Label:
[1 1 1 1 0]

1.2. Make DataFrame

看起来每个y值有两个X值：

# Make DataFrame of circle data
import pandas as pd
circle = pd.DataFrame({"X1": X[:, 0],"X2": X[:, 1],"Label": y})
circle.head(10)

	X1	X2	Label
0	0.754246	0.231481	1
1	-0.756159	0.153259	1
2	-0.815392	0.173282	1
3	-0.393731	0.692883	1
4	0.442208	-0.896723	0
5	-0.479646	0.676435	1
6	-0.013648	0.803349	1
7	0.771513	0.147760	1
8	-0.169322	-0.793456	1
9	-0.121486	1.021509	0

1.3. Class Value

看起来每对X特征(X1和X2)的标签y值都是0或1；
这告诉我们问题是binary classification，因只有两个选项(0或1)；
那每个类class有多少个值？每个500，很好，很平衡；

# check different label
circle.Label.value_counts()

Label
1    500
0    500
Name: count, dtype: int64

1.4. Visualization

# visualize with plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],y=X[:, 1],c=y,cmap=plt.cm.RdYlBu);
plt.savefig("VisualizeClassificationDataReadyPart.svg")

如何构建PyTorch神经网络，将点分类为红色(0)或蓝色(1)；
注：此数据集通常被认为是ML中的玩具(toy)问题，用来尝试和测试事物的问题；
但它代表分类的主要关键classification major key，
有一些数值表示数据，建立模型，能将其分类classify；

2. Input Output Shape

DL中常见错误之一就是形状错误shape error；
张量和张量运算的形状不匹配会导致模型中的错误，
没万无一失的方法确保它们不发生；
试问：输入的形状，输出的形状？

# check shape of our feature and label
X.shape, y.shape

((1000, 2), (1000,))

看起来，我们在每个的第一个维度都有一个匹配项，有1000个X和1000个y；
但X的第二个维度是啥？查看单个样本(feature和label)的值和形状通常有帮助；

# view first example of feature and label
X_sample = X[0]
y_sample = y[0]
print(f"Value for one sample of X: {X_sample} and the same for y: {y_sample}")
print(f"Shape for one sample of X: {X_sample.shape} and the same for y: {y_sample.shape}")

Value for one sample of X: [0.75424625 0.23148074] and the same for y: 1
Shape for one sample of X: (2,) and the same for y: ()

意味X的第二个维度有两个特征(向量vector)，
其中一个y有一个特征(标量scalar)，一个输出有两个输入；

3. Data Splitting

将数据转换为张量，并创建训练和测试拆分，
turn data into tensor and create train and test split；
已研究数据的输入和输出的形状，现准备将其用于PyTorch和建模，具体需：
A：将数据转换为张量(现数据在Numpy数组中)，PyTorch更喜欢PyTorch张量；
B：将数据拆分成训练集和测试集，将在训练集上训练模型，
以学习X和y之间的模型，然后在测试集上评估这些学习到的模式；

# turn data into tensor
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)

# view first five sample
X[:5], y[:5]

(tensor([[ 0.7542,  0.2315],
         [-0.7562,  0.1533],
         [-0.8154,  0.1733],
         [-0.3937,  0.6929],
         [ 0.4422, -0.8967]]),
 tensor([1., 1., 1., 1., 0.]))

现在数据是张量格式，将其拆分为训练集和测试集；
使用Scikit Learn的train_test_split()函数；
将使用test_size=0.2(80% training，20% testing)，
因分割在数据中是随机发生，使用random_state=42；
这样分割是可重复的reproducible；

# split data into train and test set
from sklearn.model_selection import train_test_split

# 20% test,80% train
# random_state=42：make random split reproducible
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

len(X_train), len(X_test), len(y_train), len(y_test)

(800, 200, 800, 200)

获得800训练样本(training sample)和200测试样本(testing sample)