Skip to content

Latest commit

 

History

History
215 lines (154 loc) · 6.84 KB

170703_p67.md

File metadata and controls

215 lines (154 loc) · 6.84 KB

p.66 Scikit-Learn design

※設計思想が分かる
(1)Transformer (データ変換)
(2)Estimators. (学習): fit_transform(データ変換も含む)
(3)Predictor  (予測)

_学習結果

-----------[p.67 Handling text and categorical attributes]----------------------------------------------

Handling text and categorical attributes

*the categorical attribute ocean_proximity (前にテキスト属性のため除外して計算した)
*機械学習は数値で作業する為、テキストを数値に変換(LabelEncoder)する

<前回データ準備>
import os
import pandas as pd
HOUSING_PATH = "datasets/housing"
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing=load_housing_data()

housing.columns #列名確認

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')



<数値に変換:LabelEncoder>

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded

array([3, 3, 3, ..., 1, 1, 1], dtype=int64)

print(encoder.classes_)

['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN'] #(=>category:0,1,2,3,4)

print(housing_cat)

0        NEAR BAY
1        NEAR BAY
2        NEAR BAY
:
20637      INLAND
20638      INLAND
20639      INLAND
Name: ocean_proximity, dtype: object


<テキストtoバイナリ数値変換:OneHotEncoder>

['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN'] #(=>category:0,1,2,3,4)
categories 0 and 4 are more similar than categories 0 and 1.


一般的な解決方法は、カテゴリ毎に1つのバイナリ属性を作成すること
one-hotencoding=only one attribute will be equal to 1 (hot) 
the others will be 0(cold).

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot


<20640x5 sparse matrix of type '<class 'numpy.float64'>'
	with 20640 stored elements in Compressed Sparse Row format>

Note thatfit_transform() expects a 2D array but housing_cat_encoded is a 1D array, 
so we need to reshape it:



<バイナリ数値変換+メモリ節約:LabelBinarizer>


メモリ節約(sparse matrix:0が多い行列)にtoarray()
toarray(): SciPy sparse matrix-(変換)->NumPy array;

housing_cat_1hot.toarray()

array([[ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       ..., 
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.]])

OneHotEncoder + toarray() => LabelBinarizer
2機能をまとめた関数

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot


array([[0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0],
       ..., 
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0]])

※出力デフォルトNumPy arrayになる
※sparse_output=Trueで、sparse matrix形式出力が得られる


Custom transformers

独自変数用の変換スクリプトを作成したい時。
bedrooms_per_roomを追加

独自クラスにmethod(fit(), transform() and fit_transform().)作成。
BaseクラスにBaseEstimator追加
two extra methods (get_params(), set_params())=hyperparameter tuningに使う

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None): #変換なし
        return self # nothing else to do
    def transform(self, X, y=None): #変換あり
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs

array([[-122.23, 37.88, 41.0, ..., 'NEAR BAY', 6.984126984126984,
        2.5555555555555554],
       [-122.22, 37.86, 21.0, ..., 'NEAR BAY', 6.238137082601054,
        2.109841827768014],
       [-122.24, 37.85, 52.0, ..., 'NEAR BAY', 8.288135593220339,
        2.8022598870056497],
       ..., 
       [-121.22, 39.43, 17.0, ..., 'INLAND', 5.20554272517321,
        2.325635103926097],
       [-121.32, 39.43, 18.0, ..., 'INLAND', 5.329512893982808,
        2.1232091690544412],
       [-121.24, 39.37, 16.0, ..., 'INLAND', 5.254716981132075,
        2.616981132075472]], dtype=object)

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=True)
housing_extra_attribs2 = attr_adder.transform(housing.values)
housing_extra_attribs2

array([[-122.23, 37.88, 41.0, ..., 6.984126984126984, 2.5555555555555554,
        0.14659090909090908],
       [-122.22, 37.86, 21.0, ..., 6.238137082601054, 2.109841827768014,
        0.15579659106916466],
       [-122.24, 37.85, 52.0, ..., 8.288135593220339, 2.8022598870056497,
        0.12951601908657123],
       ..., 
       [-121.22, 39.43, 17.0, ..., 5.20554272517321, 2.325635103926097,
        0.21517302573203195],
       [-121.32, 39.43, 18.0, ..., 5.329512893982808, 2.1232091690544412,
        0.21989247311827956],
       [-121.24, 39.37, 16.0, ..., 5.254716981132075, 2.616981132075472,
        0.22118491921005387]], dtype=object)

hyperparameterにadd_bedrooms_per_roomを設定。
デフォルトではTrue
データ準備手順自動化で時間節約になる。


Feature scaling

※最も重要な変換の1つ
※例外を除き、機械学習アルゴリズムの入力数値属性は、通常異なるスケールを持つ。

全属性のスケールを同じにするには、2つの一般的な方法がある。
①min-max scaling (=normalization) MinMaxScaler
     = rescaled so that they end up ranging from 0 to 1.
  =feature_range hyperparameter(0~1)

②標準化(=standardization) StandardScaler
     = 平均値で引算し平均値0にする。分散で割り、分散値を1にする。
     = 最大最小値が固定でないので、Neural Network(0-1)では問題になる。
    =外れ値(Outlier)による影響が少ない


ヘルプの出し方。
from sklearn.preprocessing import MinMaxScaler
MinMaxScaler?