Skip to content

Latest commit

 

History

History
154 lines (94 loc) · 4.61 KB

170605_p57.md

File metadata and controls

154 lines (94 loc) · 4.61 KB

p.57-67 
----------------------------------------------------------------------------

*Discover and visualize the data to gain insights

<Data prepared>

Also, if the training set is very large, you may want to sample an exploration
set, to make manipulations easy and fast. In our case, the set is quite small 
so you can just work directly on the full set.


----------------------------------------------------------------------------

+Visualizing geographical data\


import os
import pandas as pd
HOUSING_PATH = "datasets/housing"
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing=load_housing_data()

%matplotlib inline
import matplotlib.pyplot as plt
housing.plot(kind="scatter", x="longitude", y="latitude")
plt.savefig("p57_1.png") 

geographical information (latitude and longitude)

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)  [alpha=透過度]
plt.savefig("p57_2.png") 


high density areas, namely the Bay
area and around Los Angeles and San Diego

Now let’s look at the housing prices. The radius of each circle will represent the district’s
population (using option s), and the color represents the price (option c). We
will use a predefined color map (option cmap) called “jet”, which ranges from blue
(low values) to red (high prices).

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
            )
plt.legend()
plt.savefig("p57_3.png")

住宅価格が場所に非常に関連している、
主クラスタ検出に、クラスタリングアルゴリズムが有用
海隣接属性は有用かもしれませんが、北カリフォルニアでは
沿岸地区の住宅価格は高くないので、単純なルールではない

----------------------------------------------------------------------------

+Looking for correlations

the standard correlation
coefficient (also called Pearson’s r) between every pair of attributes 


相関係数=1に近い場合は強い正の相関、-1に近いときは、強い負の相関



属性間相関関係をチェックするもう1つの方法は、Pandasのscater_matrix関数利用。
各数値属性をペアでプロット。


from pandas.tools.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.savefig("p57_4.png")

4変数についてscatter plotでデータをプロットしてみる。


相関しているincomeとhouse_valueのプロットを抽出。
50万の値の所に注意。35万付近にもデータを確認>もっと詳しいデータを要求する。

housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
plt.savefig("p57_5.png")




-----------------------------------------------------------------------------------------------------------------------------------------------------------
+Experimenting with attribute combinations


housing["rooms_per_household"] = housing["total_rooms"]/housing["population"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]


corr_matrix = housing.corr()


corr_matrix["median_house_value"].sort_values(ascending=False)


median_house_value          1.000000
median_income               0.688075
rooms_per_household         0.209482
total_rooms                 0.134153
housing_median_age          0.105623
households                  0.065843
total_bedrooms              0.049686
population_per_household   -0.023737
population                 -0.024650
longitude                  -0.045967
latitude                   -0.144160
bedrooms_per_room          -0.255880
Name: median_house_value, dtype: float64


housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()



housing.dropna(subset=["total_bedrooms"]) # option 1
housing.drop("total_bedrooms", axis=1) # option 2
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median) # option 3


from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")



[p.64]Prepare the data for machine learning algorithms


Data cleaning


Scikit-Learn’s API is remarkably well designed. The main design principles are17:
*Consistency.
*Inspection:
*Non-proliferation of classes.
*Composition
*Sensible defaults. 

[p.67]Handling text and categorical attributes [--> NEXT]