1. Intro

pip install -U scikit-learn

1.1. make_circles

  • 用Scikit Learn的make_circles(),
    生成两个具有不同颜色点的圆colour dot;

from sklearn.datasets import make_circles

# create 1000 sample
n_sample = 1000

# create circle
# noise:little bit of noise to dot
# random_state:keep random state so we get the same value
X,y = make_circles(n_sample,noise=0.03,random_state=42)

print(f"First 5 X Feature:\n{X[:5]}")
print(f"\nFirst 5 y Label:\n{y[:5]}")
First 5 X Feature:
[[ 0.75424625  0.23148074]
 [-0.75615888  0.15325888]
 [-0.81539193  0.17328203]
 [-0.39373073  0.69288277]
 [ 0.44220765 -0.89672343]]

First 5 y Label:
[1 1 1 1 0]

1.2. Make DataFrame

  • 看起来每个y值有两个X值:

# Make DataFrame of circle data
import pandas as pd
circle = pd.DataFrame({"X1": X[:, 0],"X2": X[:, 1],"Label": y})
circle.head(10)
X1 X2 Label

0

0.754246

0.231481

1

1

-0.756159

0.153259

1

2

-0.815392

0.173282

1

3

-0.393731

0.692883

1

4

0.442208

-0.896723

0

5

-0.479646

0.676435

1

6

-0.013648

0.803349

1

7

0.771513

0.147760

1

8

-0.169322

-0.793456

1

9

-0.121486

1.021509

0

1.3. Class Value

  • 看起来每对X特征(X1和X2)的标签y值都是0或1;

  • 这告诉我们问题是binary classification,因只有两个选项(0或1);

  • 那每个类class有多少个值?每个500,很好,很平衡;

# check different label
circle.Label.value_counts()
Label
1    500
0    500
Name: count, dtype: int64

1.4. Visualization

# visualize with plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],y=X[:, 1],c=y,cmap=plt.cm.RdYlBu);
plt.savefig("VisualizeClassificationDataReadyPart.svg")
VisualizeClassificationDataReadyPart
  • 如何构建PyTorch神经网络,将点分类为红色(0)或蓝色(1);

  • 注:此数据集通常被认为是ML中的玩具(toy)问题,用来尝试和测试事物的问题;

  • 但它代表分类的主要关键classification major key,
    有一些数值表示数据,建立模型,能将其分类classify;

2. Input Output Shape

  • DL中常见错误之一就是形状错误shape error;

  • 张量和张量运算的形状不匹配会导致模型中的错误,
    没万无一失的方法确保它们不发生;

  • 试问:输入的形状,输出的形状?

# check shape of our feature and label
X.shape, y.shape
((1000, 2), (1000,))
  • 看起来,我们在每个的第一个维度都有一个匹配项,有1000个X和1000个y;

  • 但X的第二个维度是啥?查看单个样本(feature和label)的值和形状通常有帮助;

# view first example of feature and label
X_sample = X[0]
y_sample = y[0]
print(f"Value for one sample of X: {X_sample} and the same for y: {y_sample}")
print(f"Shape for one sample of X: {X_sample.shape} and the same for y: {y_sample.shape}")
Value for one sample of X: [0.75424625 0.23148074] and the same for y: 1
Shape for one sample of X: (2,) and the same for y: ()
  • 意味X的第二个维度有两个特征(向量vector),
    其中一个y有一个特征(标量scalar),一个输出有两个输入;

3. Data Splitting

  • 将数据转换为张量,并创建训练和测试拆分,
    turn data into tensor and create train and test split;

  • 已研究数据的输入和输出的形状,现准备将其用于PyTorch和建模,具体需:

  • A:将数据转换为张量(现数据在Numpy数组中),PyTorch更喜欢PyTorch张量;

  • B:将数据拆分成训练集和测试集,将在训练集上训练模型,
    以学习X和y之间的模型,然后在测试集上评估这些学习到的模式;

# turn data into tensor
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)

# view first five sample
X[:5], y[:5]
(tensor([[ 0.7542,  0.2315],
         [-0.7562,  0.1533],
         [-0.8154,  0.1733],
         [-0.3937,  0.6929],
         [ 0.4422, -0.8967]]),
 tensor([1., 1., 1., 1., 0.]))
  • 现在数据是张量格式,将其拆分为训练集和测试集;

  • 使用Scikit Learn的train_test_split()函数;

  • 将使用test_size=0.2(80% training,20% testing),
    因分割在数据中是随机发生,使用random_state=42;

  • 这样分割是可重复的reproducible

# split data into train and test set
from sklearn.model_selection import train_test_split

# 20% test,80% train
# random_state=42:make random split reproducible
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

len(X_train), len(X_test), len(y_train), len(y_test)
(800, 200, 800, 200)
  • 获得800训练样本(training sample)和200测试样本(testing sample)