-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path307dbscan.py
47 lines (31 loc) · 1.69 KB
/
307dbscan.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# -*- coding: utf-8 -*-
"""307DBSCAN.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1wesAUODTosN-MWjyzc_nhPLdRG9qxALC
This training material is developed by Dr. Yang Song at UNC Wilmington.
This work is supported by NSF #2230046. You can use this material for non-commercial purposes. But please retain proper attribution by keeping this content.
### Assume that Dr. Yang Song is a poor phd student, and he cannot say no to his advisor...
Dr. Song has a clutering problem to solve, and he sampled 20% of the data for testing purpuse. He has build the model below with `t4.8k_sample.csv`.
Full dataset is [here](https://drive.google.com/file/d/1Te8tCSFKzSVSTJmVtZEMr8m0CPdeZewo/view?usp=sharing) and the sampled dataset is [here](https://drive.google.com/file/d/11WQT9PUmmJ_af2CS8UOYFaC1AVGxqFy5/view?usp=sharing).
Below is his finished code, and you don't need to make any changes.
"""
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
# The datafile in google drive root/Colab Notebooks/data/t4.8k_sample.txt
# for the full data, we can read it like below -
# df = pd.read_csv("data/t4.8k.txt", sep=" ", names=['x','y'])
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/t4.8k_sample.csv')
df.head()
df.shape
from sklearn.cluster import DBSCAN
# This is a model Dr. Song tuned with the sampled model.
db = DBSCAN(eps=15, min_samples=15, metric='euclidean')
labels = db.fit_predict(df)
df["cluster_lables"] = labels
df.head(2)
df.cluster_lables.value_counts()
# On HPC, this line should be changed, too!
df.to_csv("/content/drive/My Drive/Colab Notebooks/data/t4.8k_output.csv")