forked from wildingka/decisiontreeid3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata.py
102 lines (79 loc) · 4.18 KB
/
data.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
import os
import csv
def load_data(data_path):
"""
Associated test case: tests/test_data.py
Reading and manipulating data is a vital skill for machine learning.
This function loads the data in data_path csv into two numpy arrays:
features (of size NxK) and targets (of size Nx1) where N is the number of rows
and K is the number of features.
data_path leads to a csv comma-delimited file with each row corresponding to a
different example. Each row contains binary features for each example
(e.g. chocolate, fruity, caramel, etc.) The last column indicates the label for the
example how likely it is to win a head-to-head matchup with another candy
bar.
This function reads in the csv file, and reads each row into two numpy arrays.
The first array contains the features for each row. For example, in candy-data.csv
the features are:
chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus
The second array contains the targets for each row. The targets are in the last
column of the csv file (labeled 'class'). The first row of the csv file contains
the labels for each column and shouldn't be read into an array.
Example:
chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,class
1,0,1,0,0,1,0,1,0,1
should be turned into:
[1,0,1,0,0,1,0,1,0] (features) and [1] (targets).
This should be done for each row in the csv file, making arrays of size NxK and Nx1.
Args:
data_path (str): path to csv file containing the data
Output:
features (np.array): numpy array of size NxK containing the K features
targets (np.array): numpy array of size Nx1 containing the 1 feature.
attribute_names (list): list of strings containing names of each attribute
(headers of csv)
"""
# Implement this function and remove the line that raises the error after.
raw = np.loadtxt(data_path, delimiter =",", skiprows = 1)
features = np.array(raw[:,:-1])
target = np.array(raw[:,-1])
attribute_names = list(np.loadtxt(data_path, delimiter =",", max_rows=1, dtype = str))[:-1]
return features, target, attribute_names
def train_test_split(features, targets, fraction):
"""
Split features and targets into training and testing, randomly. N points from the data
sampled for training and (features.shape[0] - N) points for testing. Where N:
N = int(features.shape[0] * fraction)
Returns train_features (size NxK), train_targets (Nx1), test_features (size MxK
where M is the remaining points in data), and test_targets (Mx1).
Special case: When fraction is 1.0. Training and test splits should be exactly the same.
(i.e. Return the entire feature and target arrays for both train and test splits)
Args:
features (np.array): numpy array containing features for each example
targets (np.array): numpy array containing labels corresponding to each example.
fraction (float between 0.0 and 1.0): fraction of examples to be drawn for training
Returns
train_features: subset of features containing N examples to be used for training.
train_targets: subset of targets corresponding to train_features containing targets.
test_features: subset of features containing M examples to be used for testing.
test_targets: subset of targets corresponding to test_features containing targets.
"""
if (fraction > 1.0):
raise ValueError('N cannot be bigger than number of examples!')
elif (fraction == 1.0):
train_features = features
train_targets = targets
test_features = features
test_targets = targets
else:
total = len(features)
n = int(fraction*total)
possibilities = np.arange(total)
chosen = np.sort(np.random.choice(possibilities, n, False))
notchosen = ~(np.isin(possibilities, chosen))
train_features = features[chosen,:]
train_targets = targets[chosen]
test_features = features[notchosen,:]
test_targets = targets[notchosen]
return train_features, train_targets, test_features, test_targets