1. Intro
pip install -U scikit-learn
1.1. make_circles
-
用Scikit Learn的make_circles(),
生成两个具有不同颜色点的圆colour dot;
from sklearn.datasets import make_circles
# create 1000 sample
n_sample = 1000
# create circle
# noise:little bit of noise to dot
# random_state:keep random state so we get the same value
X,y = make_circles(n_sample,noise=0.03,random_state=42)
print(f"First 5 X Feature:\n{X[:5]}")
print(f"\nFirst 5 y Label:\n{y[:5]}")
First 5 X Feature:
[[ 0.75424625 0.23148074]
[-0.75615888 0.15325888]
[-0.81539193 0.17328203]
[-0.39373073 0.69288277]
[ 0.44220765 -0.89672343]]
First 5 y Label:
[1 1 1 1 0]
1.2. Make DataFrame
-
看起来每个y值有两个X值:
# Make DataFrame of circle data
import pandas as pd
circle = pd.DataFrame({"X1": X[:, 0],"X2": X[:, 1],"Label": y})
circle.head(10)
X1 | X2 | Label | |
---|---|---|---|
0 |
0.754246 |
0.231481 |
1 |
1 |
-0.756159 |
0.153259 |
1 |
2 |
-0.815392 |
0.173282 |
1 |
3 |
-0.393731 |
0.692883 |
1 |
4 |
0.442208 |
-0.896723 |
0 |
5 |
-0.479646 |
0.676435 |
1 |
6 |
-0.013648 |
0.803349 |
1 |
7 |
0.771513 |
0.147760 |
1 |
8 |
-0.169322 |
-0.793456 |
1 |
9 |
-0.121486 |
1.021509 |
0 |
1.3. Class Value
|
# check different label
circle.Label.value_counts()
Label
1 500
0 500
Name: count, dtype: int64
1.4. Visualization
# visualize with plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],y=X[:, 1],c=y,cmap=plt.cm.RdYlBu);
plt.savefig("VisualizeClassificationDataReadyPart.svg")
|
2. Input Output Shape
|
# check shape of our feature and label
X.shape, y.shape
((1000, 2), (1000,))
|
# view first example of feature and label
X_sample = X[0]
y_sample = y[0]
print(f"Value for one sample of X: {X_sample} and the same for y: {y_sample}")
print(f"Shape for one sample of X: {X_sample.shape} and the same for y: {y_sample.shape}")
Value for one sample of X: [0.75424625 0.23148074] and the same for y: 1
Shape for one sample of X: (2,) and the same for y: ()
-
意味X的第二个维度有两个特征(向量vector),
其中一个y有一个特征(标量scalar),一个输出有两个输入;
3. Data Splitting
|
# turn data into tensor
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)
# view first five sample
X[:5], y[:5]
(tensor([[ 0.7542, 0.2315],
[-0.7562, 0.1533],
[-0.8154, 0.1733],
[-0.3937, 0.6929],
[ 0.4422, -0.8967]]),
tensor([1., 1., 1., 1., 0.]))
|
# split data into train and test set
from sklearn.model_selection import train_test_split
# 20% test,80% train
# random_state=42:make random split reproducible
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
len(X_train), len(X_test), len(y_train), len(y_test)
(800, 200, 800, 200)
-
获得800训练样本(training sample)和200测试样本(testing sample)