Decision Tree 집값 예측(회귀)

September 24, 2022

목표

Decision Tree 모델 사용하기
모델 최적화하기

# 예측하기 좋은 최적의 질문을 만들어 학습하는 모델
from sklearn.tree import DecisionTreeRegressor

house_model = DecisionTreeRegressor()

2. 모델 학습

데이터 로딩 후 탐색

import pandas as pd

# 데이터 로딩
train = pd.read_csv('./data/house/train.csv')
test = pd.read_csv('./data/house/test.csv')
train.head()

	Id	Suburb	Address	Rooms	Type	Method	SellerG	Date	Distance	Postcode	…	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	Price
0	5467	Rosanna	22 Douglas St	2	h	S	Miles	19/11/2016	11.4	3084	…	1.0	757	NaN	NaN	Banyule	-37.74280	145.07000	Eastern Metropolitan	3540	1200000
1	4365	North Melbourne	103/25 Byron St	1	u	SP	Jellis	16/07/2016	2.3	3051	…	1.0	0	60.0	2012.0	Melbourne	-37.80200	144.95160	Northern Metropolitan	6821	450000
2	9741	Surrey Hills	4/40 Durham Rd	3	u	SP	Noel	17/06/2017	10.2	3127	…	1.0	149	NaN	NaN	Boroondara	-37.82971	145.09007	Southern Metropolitan	5457	780000
3	11945	Cheltenham	3/33 Sunray Av	2	t	S	Buxton	29/07/2017	17.9	3192	…	1.0	171	NaN	NaN	Kingston	-37.96304	145.06421	Southern Metropolitan	9758	751000
4	4038	Mont Albert	7/27 High St	3	t	S	Fletchers	15/10/2016	11.8	3127	…	2.0	330	148.0	2001.0	Whitehorse	-37.81670	145.10700	Eastern Metropolitan	2079	1310000

5 rows × 22 columns

# 전체 데이터 갯수 파악
train.shape, test.shape

((10185, 22), (3395, 21))

# 컬럼 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10185 entries, 0 to 10184
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 Id             10185 non-null  int64
 Suburb         10185 non-null  object
 Address        10185 non-null  object
 Rooms          10185 non-null  int64
 Type           10185 non-null  object
 Method         10185 non-null  object
 SellerG        10185 non-null  object
 Date           10185 non-null  object
 Distance       10185 non-null  float64
 Postcode       10185 non-null  int64
Bedroom2       10185 non-null  int64
Bathroom       10185 non-null  int64
Car            10142 non-null  float64
Landsize       10185 non-null  int64
BuildingArea   5367 non-null   float64
YearBuilt      6153 non-null   float64
CouncilArea    9174 non-null   object
Lattitude      10185 non-null  float64
Longtitude     10185 non-null  float64
Regionname     10185 non-null  object
Propertycount  10185 non-null  int64
Price          10185 non-null  int64
dtypes: float64(6), int64(8), object(8)
memory usage: 1.7+ MB

# 기술 통계, include='all' 범주형 통계 값 포함(고유값, 최빈값, ...)
train.describe(include='all')

	Id	Suburb	Address	Rooms	Type	Method	SellerG	Date	Distance	Postcode	…	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	Price
count	10185.000000	10185	10185	10185.000000	10185	10185	10185	10185	10185.000000	10185.000000	…	10142.000000	10185.000000	5367.000000	6153.000000	9174	10185.000000	10185.000000	10185	10185.000000	1.018500e+04
unique	NaN	310	10066	NaN	3	5	243	58	NaN	NaN	…	NaN	NaN	NaN	NaN	33	NaN	NaN	8	NaN	NaN
top	NaN	Reservoir	2 Bruce St	NaN	h	S	Nelson	27/05/2017	NaN	NaN	…	NaN	NaN	NaN	NaN	Moreland	NaN	NaN	Southern Metropolitan	NaN	NaN
freq	NaN	261	3	NaN	7106	6753	1156	364	NaN	NaN	…	NaN	NaN	NaN	NaN	887	NaN	NaN	3525	NaN	NaN
mean	6802.613942	NaN	NaN	2.943250	NaN	NaN	NaN	NaN	10.198213	3105.172607	…	1.613883	573.426411	154.137372	1964.904599	NaN	-37.809763	144.995347	NaN	7447.172018	1.077961e+06
std	3926.702100	NaN	NaN	0.952794	NaN	NaN	NaN	NaN	5.866640	90.198740	…	0.959076	4550.757180	614.711880	37.603561	NaN	0.079922	0.104255	NaN	4354.473015	6.364301e+05
min	3.000000	NaN	NaN	1.000000	NaN	NaN	NaN	NaN	0.000000	3000.000000	…	0.000000	0.000000	0.000000	1196.000000	NaN	-38.182550	144.431810	NaN	249.000000	1.310000e+05
25%	3384.000000	NaN	NaN	2.000000	NaN	NaN	NaN	NaN	6.200000	3044.000000	…	1.000000	178.000000	93.920000	1940.000000	NaN	-37.857700	144.929500	NaN	4380.000000	6.500000e+05
50%	6838.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	9.300000	3084.000000	…	2.000000	448.000000	127.000000	1970.000000	NaN	-37.802900	145.000130	NaN	6543.000000	9.050000e+05
75%	10223.000000	NaN	NaN	4.000000	NaN	NaN	NaN	NaN	13.000000	3149.000000	…	2.000000	652.000000	175.000000	2000.000000	NaN	-37.756710	145.059280	NaN	10331.000000	1.330000e+06
max	13577.000000	NaN	NaN	8.000000	NaN	NaN	NaN	NaN	48.100000	3977.000000	…	10.000000	433014.000000	44515.000000	2018.000000	NaN	-37.408530	145.526350	NaN	21650.000000	7.650000e+06

11 rows × 22 columns

문제와 정답 추출
일단은 결측치가 존재하는 컬럼과 문자형태의 컬럼은 배제

X_train = train[['Propertycount', 'Rooms', 'Bedroom2']]
y_train = train['Price']

# train -> 7.5 : 2.5비율로 train2, validation 데이터로 나누기
from sklearn.model_selection import train_test_split

X_train2, X_val, y_train2, y_val = train_test_split(X_train, y_train, random_state=3)
X_train2.shape, X_val.shape, y_train2.shape, y_val.shape

((7638, 3), (2547, 3), (7638,), (2547,))

house_model.fit(X_train2, y_train2)

DecisionTreeRegressor()

3. 모델 예측

pred = house_model.predict(X_val)
pred

array([ 451840. , 1344925. , 684462.5, …, 1843500. , 1300000. , 2272500. ])

4. 모델 평가

MAE(Mean Absolute Error, 평균 절대값 오차) 활용한 평가

from sklearn.metrics import mean_absolute_error
error = mean_absolute_error(y_val, pred)
error

255227.3455560717

캐글에 업로드 하기

X_test = test[['Propertycount', 'Rooms', 'Bedroom2']]
X_test.shape

(3395, 3)

test_pre = house_model.predict(X_test)
test_pre

array([ 382333.33333333, 522500. , 700000. , …, 710500. , 651357.14285714, 1042205.88235294])

# 정답지 파일 로딩
submission = pd.read_csv("./data/house/sample_submission.csv")
submission

	Id	Price
0	3189	0
1	2539	0
2	9171	0
3	4741	0
4	12455	0
…	…	…
3390	12276	0
3391	4618	0
3392	12913	0
3393	11741	0
3394	1072	0

3395 rows × 2 columns

test.head()

	Id	Suburb	Address	Rooms	Type	Method	SellerG	Date	Distance	Postcode	…	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	3189	Hawthorn	22/9 Lisson Gr	1	u	S	Biggin	19/11/2016	4.6	3122	…	1	1.0	0	52.0	1970.0	Boroondara	-37.82610	145.02690	Southern Metropolitan	11308
1	2539	Fitzroy	113/300 Young St	1	u	SP	Jellis	19/11/2016	1.6	3065	…	1	1.0	0	52.0	2011.0	Yarra	-37.79740	144.97990	Northern Metropolitan	5825
2	9171	Greenvale	7 Murray Ct	5	h	S	Barry	3/06/2017	20.4	3059	…	3	5.0	1750	310.0	1990.0	Hume	-37.65439	144.89113	Northern Metropolitan	4864
3	4741	Port Melbourne	172 Albert St	2	h	S	hockingstuart	10/12/2016	3.8	3207	…	1	0.0	106	70.0	1910.0	Port Phillip	-37.83460	144.93730	Southern Metropolitan	8648
4	12455	Brunswick West	47 Everett St	4	h	VB	Nelson	9/09/2017	5.2	3055	…	2	2.0	600	180.0	2004.0	NaN	-37.75465	144.94144	Northern Metropolitan	7082

5 rows × 21 columns

submission['Price'] = test_pre
submission.head()

	Id	Price
0	3189	3.823333e+05
1	2539	5.225000e+05
2	9171	7.000000e+05
3	4741	9.572778e+05
4	12455	1.294818e+06

# csv파일로 저장
submission.to_csv("./data/house/myPrediction.csv",
                 index=False)

다른 컬럼을 이용해보자.

결측치가 있는 컬럼
- 데이터를 버린다. -> drop, dropna
- 데이터를 채운다. -> fillna
  - 기술통계 활용
  - 모델 활용 -> 결측치를 정답, 주변컬럼을 문제로 설정
문자형태의 컬럼
- 문자타입 -> 숫자타입 변경(인코딩)
  - 라벨 인코딩 -> 임의의 숫자를 글자에 부여
  - 원핫 인코딩 -> 0과1을 이용해서 변환

test.columns

Index([‘Id’, ‘Suburb’, ‘Address’, ‘Rooms’, ‘Type’, ‘Method’, ‘SellerG’, ‘Date’, ‘Distance’, ‘Postcode’, ‘Bedroom2’, ‘Bathroom’, ‘Car’, ‘Landsize’, ‘BuildingArea’, ‘YearBuilt’, ‘CouncilArea’, ‘Lattitude’, ‘Longtitude’, ‘Regionname’, ‘Propertycount’], dtype=‘object’)

train.corr()

| | Id | Rooms | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | Lattitude | Longtitude | Propertycount | Price | h | t | u | Method_label | Regionname_label | | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | | Id | 1.000000 | 0.100741 | 0.293943 | 0.092843 | 0.116731 | 0.040834 | 0.099056 | 0.026959 | 0.024208 | 0.104456 | 0.040772 | 0.056133 | 0.008075 | -0.051498 | 0.114506 | -0.041921 | -0.099133 | 0.011529 | 0.267904 | | Rooms | 0.100741 | 1.000000 | 0.294621 | 0.053791 | 0.937357 | 0.588991 | 0.406340 | 0.024508 | 0.115067 | -0.059154 | 0.012710 | 0.104277 | -0.081669 | 0.497539 | 0.507407 | -0.028028 | -0.543692 | 0.003428 | 0.166209 | | Distance | 0.293943 | 0.294621 | 1.000000 | 0.430208 | 0.294438 | 0.120574 | 0.268467 | 0.024602 | 0.108834 | 0.245371 | -0.134113 | 0.239075 | -0.052206 | -0.164388 | 0.213490 | -0.012074 | -0.228571 | -0.093669 | 0.510509 | | Postcode | 0.092843 | 0.053791 | 0.430208 | 1.000000 | 0.059840 | 0.108927 | 0.052475 | 0.027304 | 0.061235 | 0.021593 | -0.408562 | 0.450340 | 0.057250 | 0.111511 | -0.024155 | -0.008960 | 0.032696 | 0.016300 | -0.055349 | | Bedroom2 | 0.116731 | 0.937357 | 0.294438 | 0.059840 | 1.000000 | 0.579122 | 0.402223 | 0.024382 | 0.112563 | -0.045148 | 0.012189 | 0.105248 | -0.080096 | 0.473742 | 0.485838 | -0.027327 | -0.520256 | 0.005557 | 0.165695 | | Bathroom | 0.040834 | 0.588991 | 0.120574 | 0.108927 | 0.579122 | 1.000000 | 0.319449 | 0.038824 | 0.106673 | 0.154065 | -0.072667 | 0.116156 | -0.050214 | 0.464396 | 0.169435 | 0.119998 | -0.267174 | 0.084622 | 0.015280 | | Car | 0.099056 | 0.406340 | 0.268467 | 0.052475 | 0.402223 | 0.319449 | 1.000000 | 0.023537 | 0.096233 | 0.113786 | -0.003670 | 0.067147 | -0.024338 | 0.235853 | 0.253054 | -0.014562 | -0.270764 | 0.012097 | 0.112987 | | Landsize | 0.026959 | 0.024508 | 0.024602 | 0.027304 | 0.024382 | 0.038824 | 0.023537 | 1.000000 | 0.546089 | 0.040293 | 0.010407 | 0.010326 | -0.009704 | 0.040665 | 0.022618 | -0.018688 | -0.012693 | 0.029722 | 0.022445 | | BuildingArea | 0.024208 | 0.115067 | 0.108834 | 0.061235 | 0.112563 | 0.106673 | 0.096233 | 0.546089 | 1.000000 | 0.021732 | 0.052281 | -0.033764 | -0.029216 | 0.082028 | 0.060088 | -0.004645 | -0.064177 | -0.001155 | 0.060186 | | YearBuilt | 0.104456 | -0.059154 | 0.245371 | 0.021593 | -0.045148 | 0.154065 | 0.113786 | 0.040293 | 0.021732 | 1.000000 | 0.060937 | -0.017734 | 0.006838 | -0.324162 | -0.393885 | 0.298914 | 0.230626 | 0.027549 | 0.130729 | | Lattitude | 0.040772 | 0.012710 | -0.134113 | -0.408562 | 0.012189 | -0.072667 | -0.003670 | 0.010407 | 0.052281 | 0.060937 | 1.000000 | -0.359648 | 0.060830 | -0.210428 | 0.108648 | -0.035545 | -0.096862 | -0.027790 | 0.141548 | | Longtitude | 0.056133 | 0.104277 | 0.239075 | 0.450340 | 0.105248 | 0.116156 | 0.067147 | 0.010326 | -0.033764 | -0.017734 | -0.359648 | 1.000000 | 0.056610 | 0.206866 | -0.008365 | 0.006092 | 0.005237 | 0.022174 | -0.006846 | | Propertycount | 0.008075 | -0.081669 | -0.052206 | 0.057250 | -0.080096 | -0.050214 | -0.024338 | -0.009704 | -0.029216 | 0.006838 | 0.060830 | 0.056610 | 1.000000 | -0.046582 | -0.065180 | -0.020591 | 0.085853 | -0.005943 | -0.184688 | | Price | -0.051498 | 0.497539 | -0.164388 | 0.111511 | 0.473742 | 0.464396 | 0.235853 | 0.040665 | 0.082028 | -0.324162 | -0.210428 | 0.206866 | -0.046582 | 1.000000 | 0.391641 | -0.061995 | -0.392931 | 0.029182 | -0.237278 | | h | 0.114506 | 0.507407 | 0.213490 | -0.024155 | 0.485838 | 0.169435 | 0.253054 | 0.022618 | 0.060088 | -0.393885 | 0.108648 | -0.008365 | -0.065180 | 0.391641 | 1.000000 | -0.453989 | -0.807573 | -0.065250 | 0.191063 | | t | -0.041921 | -0.028028 | -0.012074 | -0.008960 | -0.027327 | 0.119998 | -0.014562 | -0.018688 | -0.004645 | 0.298914 | -0.035545 | 0.006092 | -0.020591 | -0.061995 | -0.453989 | 1.000000 | -0.158859 | 0.008709 | -0.037968 | | u | -0.099133 | -0.543692 | -0.228571 | 0.032696 | -0.520256 | -0.267174 | -0.270764 | -0.012693 | -0.064177 | 0.230626 | -0.096862 | 0.005237 | 0.085853 | -0.392931 | -0.807573 | -0.158859 | 1.000000 | 0.066537 | -0.186580 | | Method_label | 0.011529 | 0.003428 | -0.093669 | 0.016300 | 0.005557 | 0.084622 | 0.012097 | 0.029722 | -0.001155 | 0.027549 | -0.027790 | 0.022174 | -0.005943 | 0.029182 | -0.065250 | 0.008709 | 0.066537 | 1.000000 | -0.070270 | | Regionname_label | 0.267904 | 0.166209 | 0.510509 | -0.055349 | 0.165695 | 0.015280 | 0.112987 | 0.022445 | 0.060186 | 0.130729 | 0.141548 | -0.006846 | -0.184688 | -0.237278 | 0.191063 | -0.037968 | -0.186580 | -0.070270 | 1.000000 |

Type 컬럼 전처리

# 원한인코딩 함수
type_onehot = pd.get_dummies(train['Type'])
type_onehot

	h	t	u
0	1	0	0
1	0	0	1
2	0	0	1
3	0	1	0
4	0	1	0
…	…	…	…
10180	1	0	0
10181	1	0	0
10182	1	0	0
10183	1	0	0
10184	0	1	0

10185 rows × 3 columns

type_onehot_test = pd.get_dummies(test['Type'])
type_onehot_test

	h	t	u
0	0	0	1
1	0	0	1
2	1	0	0
3	1	0	0
4	1	0	0
…	…	…	…
3390	1	0	0
3391	1	0	0
3392	0	1	0
3393	1	0	0
3394	0	0	1

3395 rows × 3 columns

train = pd.concat([train, type_onehot], axis=1)
train.head()

	Id	Suburb	Address	Rooms	Type	Method	SellerG	Date	Distance	Postcode	…	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	Price	h	t	u
0	5467	Rosanna	22 Douglas St	2	h	S	Miles	19/11/2016	11.4	3084	…	NaN	Banyule	-37.74280	145.07000	Eastern Metropolitan	3540	1200000	1	0	0
1	4365	North Melbourne	103/25 Byron St	1	u	SP	Jellis	16/07/2016	2.3	3051	…	2012.0	Melbourne	-37.80200	144.95160	Northern Metropolitan	6821	450000	0	0	1
2	9741	Surrey Hills	4/40 Durham Rd	3	u	SP	Noel	17/06/2017	10.2	3127	…	NaN	Boroondara	-37.82971	145.09007	Southern Metropolitan	5457	780000	0	0	1
3	11945	Cheltenham	3/33 Sunray Av	2	t	S	Buxton	29/07/2017	17.9	3192	…	NaN	Kingston	-37.96304	145.06421	Southern Metropolitan	9758	751000	0	1	0
4	4038	Mont Albert	7/27 High St	3	t	S	Fletchers	15/10/2016	11.8	3127	…	2001.0	Whitehorse	-37.81670	145.10700	Eastern Metropolitan	2079	1310000	0	1	0

5 rows × 25 columns

test = pd.concat([test, type_onehot_test], axis=1)
test.head()

	Id	Suburb	Address	Rooms	Type	Method	SellerG	Date	Distance	Postcode	…	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	h	u
0	3189	Hawthorn	22/9 Lisson Gr	1	u	S	Biggin	19/11/2016	4.6	3122	…	52.0	1970.0	Boroondara	-37.82610	145.02690	Southern Metropolitan	11308	0	1
1	2539	Fitzroy	113/300 Young St	1	u	SP	Jellis	19/11/2016	1.6	3065	…	52.0	2011.0	Yarra	-37.79740	144.97990	Northern Metropolitan	5825	0	1
2	9171	Greenvale	7 Murray Ct	5	h	S	Barry	3/06/2017	20.4	3059	…	310.0	1990.0	Hume	-37.65439	144.89113	Northern Metropolitan	4864	1	0
3	4741	Port Melbourne	172 Albert St	2	h	S	hockingstuart	10/12/2016	3.8	3207	…	70.0	1910.0	Port Phillip	-37.83460	144.93730	Southern Metropolitan	8648	1	0
4	12455	Brunswick West	47 Everett St	4	h	VB	Nelson	9/09/2017	5.2	3055	…	180.0	2004.0	NaN	-37.75465	144.94144	Northern Metropolitan	7082	1	0

5 rows × 24 columns

Method 컬럼 전처리

# 라벨인코딩
# 캐글의 metadata 정보 참고
method_dict = {
    'S' : 0,
    'SP' : 1,
    'PI' : 2,
    'PN' : 3,
    'SN' : 4,
    'NB' : 5,
    'VB' : 6,
    'W' : 7,
    'SA' : 8,
    'SS' : 9
}
method_label = train['Method'].map(method_dict)
method_label_test = test['Method'].map(method_dict)
method_label

      0
      1
      1
      0
      0
        ..
  0
  2
  0
  2
  0
Name: Method, Length: 10185, dtype: int64

train['Method_label'] = method_label
test['Method_label'] = method_label_test
train.head()

	Id	Suburb	Address	Rooms	Type	Method	SellerG	Date	Distance	Postcode	…	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	Price	h	t	u	Method_label
0	5467	Rosanna	22 Douglas St	2	h	S	Miles	19/11/2016	11.4	3084	…	Banyule	-37.74280	145.07000	Eastern Metropolitan	3540	1200000	1	0	0	0
1	4365	North Melbourne	103/25 Byron St	1	u	SP	Jellis	16/07/2016	2.3	3051	…	Melbourne	-37.80200	144.95160	Northern Metropolitan	6821	450000	0	0	1	1
2	9741	Surrey Hills	4/40 Durham Rd	3	u	SP	Noel	17/06/2017	10.2	3127	…	Boroondara	-37.82971	145.09007	Southern Metropolitan	5457	780000	0	0	1	1
3	11945	Cheltenham	3/33 Sunray Av	2	t	S	Buxton	29/07/2017	17.9	3192	…	Kingston	-37.96304	145.06421	Southern Metropolitan	9758	751000	0	1	0	0
4	4038	Mont Albert	7/27 High St	3	t	S	Fletchers	15/10/2016	11.8	3127	…	Whitehorse	-37.81670	145.10700	Eastern Metropolitan	2079	1310000	0	1	0	0

5 rows × 26 columns

결측치와 문자형태를 처리해서 다양한 컬럼으로 학습해보자.

train, test 데이터에 원하는 결측치,인코딩 처리 실시
원하는 컬럼 선택
train을 train2와 val로 분리
모델 학습 후 평가
test 데이터를 예측해 kaggle에 업로드

Car 컬럼 전처리

# Car 컬럼의 결측치 채우기
train['Car'].describe()

count    10142.000000
mean         1.613883
std          0.959076
min          0.000000
25%          1.000000
50%          2.000000
75%          2.000000
max         10.000000
Name: Car, dtype: float64

# 중앙값으로 채우기
train['Car'] = train['Car'].fillna(train['Car'].median())
test['Car'] = test['Car'].fillna(train['Car'].median())

Regionname 컬럼 전처리

train['Regionname'].value_counts()

Southern Metropolitan         3525
Northern Metropolitan         2912
Western Metropolitan          2219
Eastern Metropolitan          1096
South-Eastern Metropolitan     348
Eastern Victoria                37
Northern Victoria               30
Western Victoria                18
Name: Regionname, dtype: int64

regionname_dict = {
    'Southern Metropolitan' : 0,
    'Northern Metropolitan' : 1,
    'Western Metropolitan' : 2,
    'Eastern Metropolitan' : 3,
    'South-Eastern Metropolitan' : 4,
    'Eastern Victoria' : 5,
    'Northern Victoria' : 6,
    'Western Victoria' : 7
}

regionname_label = train['Regionname'].map(regionname_dict)
regionname_label_test = test['Regionname'].map(regionname_dict)
regionname_label

train['Regionname_label'] = regionname_label
test['Regionname_label'] = regionname_label_test
train.head()

	Id	Suburb	Address	Rooms	Type	Method	SellerG	Date	Distance	Postcode	…	Lattitude	Longtitude	Regionname	Propertycount	Price	h	t	u	Method_label	Regionname_label
0	5467	Rosanna	22 Douglas St	2	h	S	Miles	19/11/2016	11.4	3084	…	-37.74280	145.07000	Eastern Metropolitan	3540	1200000	1	0	0	0	3
1	4365	North Melbourne	103/25 Byron St	1	u	SP	Jellis	16/07/2016	2.3	3051	…	-37.80200	144.95160	Northern Metropolitan	6821	450000	0	0	1	1	1
2	9741	Surrey Hills	4/40 Durham Rd	3	u	SP	Noel	17/06/2017	10.2	3127	…	-37.82971	145.09007	Southern Metropolitan	5457	780000	0	0	1	1	0
3	11945	Cheltenham	3/33 Sunray Av	2	t	S	Buxton	29/07/2017	17.9	3192	…	-37.96304	145.06421	Southern Metropolitan	9758	751000	0	1	0	0	0
4	4038	Mont Albert	7/27 High St	3	t	S	Fletchers	15/10/2016	11.8	3127	…	-37.81670	145.10700	Eastern Metropolitan	2079	1310000	0	1	0	0	3

5 rows × 27 columns

CouncilArea 전처리

X_train = train[['Rooms', 'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'Propertycount', 'h', 't', 'u', 'Method_label', 'Regionname_label', 'Lattitude', 'Longtitude']]
y_train = train['Price']

X_test = test[['Rooms', 'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'Propertycount', 'h', 't', 'u', 'Method_label', 'Regionname_label', 'Lattitude', 'Longtitude']]
X_train.head()

	Rooms	Distance	Bedroom2	Bathroom	Car	Landsize	Propertycount	h	t	u	Method_label	Regionname_label	Lattitude	Longtitude
0	2	11.4	2	1	1.0	757	3540	1	0	0	0	3	-37.74280	145.07000
1	1	2.3	1	1	1.0	0	6821	0	0	1	1	1	-37.80200	144.95160
2	3	10.2	2	1	1.0	149	5457	0	0	1	1	0	-37.82971	145.09007
3	2	17.9	2	1	1.0	171	9758	0	1	0	0	0	-37.96304	145.06421
4	3	11.8	3	2	2.0	330	2079	0	1	0	0	3	-37.81670	145.10700

X_test.head()

| | Rooms | Distance | Bedroom2 | Bathroom | Car | Landsize | Propertycount | h | t | u | Method_label | Regionname_label | Lattitude | Longtitude | | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | | 0 | 1 | 4.6 | 1 | 1 | 1.0 | 0 | 11308 | 0 | 0 | 1 | 0 | 0 | -37.82610 | 145.02690 | | 1 | 1 | 1.6 | 1 | 1 | 1.0 | 0 | 5825 | 0 | 0 | 1 | 1 | 1 | -37.79740 | 144.97990 | | 2 | 5 | 20.4 | 5 | 3 | 5.0 | 1750 | 4864 | 1 | 0 | 0 | 0 | 1 | -37.65439 | 144.89113 | | 3 | 2 | 3.8 | 2 | 1 | 0.0 | 106 | 8648 | 1 | 0 | 0 | 0 | 0 | -37.83460 | 144.93730 | | 4 | 4 | 5.2 | 4 | 2 | 2.0 | 600 | 7082 | 1 | 0 | 0 | 6 | 1 | -37.75465 | 144.94144 |

검증데이터 분리

X_train3, X_val3, y_train3, y_val3 = train_test_split(X_train, y_train, test_size=0.2, random_state=916)

모델 정의

house_model2 = DecisionTreeRegressor()

모델 학습

house_model2.fit(X_train3, y_train3)

DecisionTreeRegressor()

모델 예측

pred = house_model2.predict(X_val3)

모델 평가

mean_absolute_error(y_val3, pred)

245946.3264604811

캐글 제출

submission['Price'] = house_model2.predict(X_test)
submission.head()

	Id	Price
0	3189	348000.0
1	2539	597000.0
2	9171	725500.0
3	4741	1000000.0
4	12455	1610000.0

submission.to_csv("./data/house/myPrediction.csv",
                 index=False)

모델 최적화

모델복잡도 제어하기
KNN은 이웃의 숫자로 모델의 복잡도 제어
- 이웃의 숫자가 커질수록 단순,
- 숫자가 적어질수록 복잡해진다.
DecisionTree는 질문의 깊이로 모델의 복잡도를 제어(max_depth)
- 깊이가 얕으면 단순,
- 깊이가 깊으면 복잡해진다.

train_score_list = []
val_score_list = []

for d in range(1, 30):
    m = DecisionTreeRegressor(max_depth=d)
    m.fit(X_train3, y_train3)
    pred_train = m.predict(X_train3)
    pred_val = m.predict(X_val3)
    
    score_train = mean_absolute_error(y_train3, pred_train)
    score_val = mean_absolute_error(y_val3, pred_val)
    train_score_list.append(score_train)
    val_score_list.append(score_val)
    print(d, ":", score_val)

: 417733.49974970485
: 343806.0797298899
: 296665.1121222014
: 266627.09369623766
: 250658.9323204579
: 240011.4545997507
: 228006.44228511458
: 225544.86401968033
: 223158.24205112088
: 222866.61170201903
: 222247.53782971326
: 226750.59009251397
: 232785.61967598234
: 236449.60651906623
: 238339.72952208374
: 235256.46193522654
: 239032.74728354442
: 246129.8931403534
: 245102.87083875437
: 248124.2068576584
: 248081.48209490062
: 247554.49879324844
: 247564.5162821142
: 243753.52724594992
: 248604.58184694266
: 244163.59204712813
: 244919.02258222876
: 245388.98429062346
: 248771.09327442318

# 시각화
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(range(1, 30), train_score_list)
plt.plot(range(1, 30), val_score_list)
plt.show()

max_depth가 11일때 가장 낮은 MAE값을 얻을 수 있다.

Twitter Facebook LinkedIn

minyoongi96

Decision Tree 집값 예측(회귀)

목표

2. 모델 학습

3. 모델 예측

4. 모델 평가

캐글에 업로드 하기

다른 컬럼을 이용해보자.

결측치와 문자형태를 처리해서 다양한 컬럼으로 학습해보자.

모델 최적화

공유하기

댓글남기기

참고

S3 CORS 테스트해보기

S3 버킷 정책 설정 & 암호화하기

S3 생성 & 퍼블릭 ACL 설정하여 외부 접근해보기

S3에 대한 기본 개념