p.66 Scikit-Learn design
※設計思想が分かる
(1)Transformer (データ変換)
(2)Estimators. (学習): fit_transform(データ変換も含む)
(3)Predictor (予測)
_学習結果
-----------[p.67 Handling text and categorical attributes]----------------------------------------------
Handling text and categorical attributes
*the categorical attribute ocean_proximity (前にテキスト属性のため除外して計算した)
*機械学習は数値で作業する為、テキストを数値に変換(LabelEncoder)する
<前回データ準備>
import os
import pandas as pd
HOUSING_PATH = "datasets/housing"
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing=load_housing_data()
housing.columns #列名確認
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'median_house_value', 'ocean_proximity'],
dtype='object')
<数値に変換:LabelEncoder>
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
array([3, 3, 3, ..., 1, 1, 1], dtype=int64)
print(encoder.classes_)
['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN'] #(=>category:0,1,2,3,4)
print(housing_cat)
0 NEAR BAY
1 NEAR BAY
2 NEAR BAY
:
20637 INLAND
20638 INLAND
20639 INLAND
Name: ocean_proximity, dtype: object
<テキストtoバイナリ数値変換:OneHotEncoder>
['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN'] #(=>category:0,1,2,3,4)
categories 0 and 4 are more similar than categories 0 and 1.
一般的な解決方法は、カテゴリ毎に1つのバイナリ属性を作成すること
one-hotencoding=only one attribute will be equal to 1 (hot)
the others will be 0(cold).
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot
<20640x5 sparse matrix of type '<class 'numpy.float64'>'
with 20640 stored elements in Compressed Sparse Row format>
Note thatfit_transform() expects a 2D array but housing_cat_encoded is a 1D array,
so we need to reshape it:
<バイナリ数値変換+メモリ節約:LabelBinarizer>
メモリ節約(sparse matrix:0が多い行列)にtoarray()
toarray(): SciPy sparse matrix-(変換)->NumPy array;
housing_cat_1hot.toarray()
array([[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 1., 0.],
...,
[ 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 0.]])
OneHotEncoder + toarray() => LabelBinarizer
2機能をまとめた関数
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot
array([[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
...,
[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0]])
※出力デフォルトNumPy arrayになる
※sparse_output=Trueで、sparse matrix形式出力が得られる
Custom transformers
独自変数用の変換スクリプトを作成したい時。
bedrooms_per_roomを追加
独自クラスにmethod(fit(), transform() and fit_transform().)作成。
BaseクラスにBaseEstimator追加
two extra methods (get_params(), set_params())=hyperparameter tuningに使う
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None): #変換なし
return self # nothing else to do
def transform(self, X, y=None): #変換あり
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs
array([[-122.23, 37.88, 41.0, ..., 'NEAR BAY', 6.984126984126984,
2.5555555555555554],
[-122.22, 37.86, 21.0, ..., 'NEAR BAY', 6.238137082601054,
2.109841827768014],
[-122.24, 37.85, 52.0, ..., 'NEAR BAY', 8.288135593220339,
2.8022598870056497],
...,
[-121.22, 39.43, 17.0, ..., 'INLAND', 5.20554272517321,
2.325635103926097],
[-121.32, 39.43, 18.0, ..., 'INLAND', 5.329512893982808,
2.1232091690544412],
[-121.24, 39.37, 16.0, ..., 'INLAND', 5.254716981132075,
2.616981132075472]], dtype=object)
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=True)
housing_extra_attribs2 = attr_adder.transform(housing.values)
housing_extra_attribs2
array([[-122.23, 37.88, 41.0, ..., 6.984126984126984, 2.5555555555555554,
0.14659090909090908],
[-122.22, 37.86, 21.0, ..., 6.238137082601054, 2.109841827768014,
0.15579659106916466],
[-122.24, 37.85, 52.0, ..., 8.288135593220339, 2.8022598870056497,
0.12951601908657123],
...,
[-121.22, 39.43, 17.0, ..., 5.20554272517321, 2.325635103926097,
0.21517302573203195],
[-121.32, 39.43, 18.0, ..., 5.329512893982808, 2.1232091690544412,
0.21989247311827956],
[-121.24, 39.37, 16.0, ..., 5.254716981132075, 2.616981132075472,
0.22118491921005387]], dtype=object)
hyperparameterにadd_bedrooms_per_roomを設定。
デフォルトではTrue
データ準備手順自動化で時間節約になる。
Feature scaling
※最も重要な変換の1つ
※例外を除き、機械学習アルゴリズムの入力数値属性は、通常異なるスケールを持つ。
全属性のスケールを同じにするには、2つの一般的な方法がある。
①min-max scaling (=normalization) MinMaxScaler
= rescaled so that they end up ranging from 0 to 1.
=feature_range hyperparameter(0~1)
②標準化(=standardization) StandardScaler
= 平均値で引算し平均値0にする。分散で割り、分散値を1にする。
= 最大最小値が固定でないので、Neural Network(0-1)では問題になる。
=外れ値(Outlier)による影響が少ない
ヘルプの出し方。
from sklearn.preprocessing import MinMaxScaler
MinMaxScaler?