Before modeling, data understand is the key part. Thus, we are going to do Exploratory Data Analysis (EDA) and trying to understand data fully. That is the main scope of the work. Let's follow the steps.
For fun, the global shark attacks dataset has been considered in this work.
- Data acquisition
- Dataset
- Preprocessing
- Data mining : EDA, correaltion matrix, visualization
- Features engineering
download data set from the kaggle link below
🦆 Kaggle data Global shark attacks file
First, you read csv file and check all variables.
df0 = pd.read_csv("attacks.csv",encoding = "ISO-8859-1")
#..... describe dataset
print('================================================')
print(df0.describe)
print('================================================')
print('total size :',df0.shape)
Total size of data is 25723 rows and 24 columns.
Case Number | Date | Year | Type | country | Area | Location | Activity | Name | Sex | Age | Injury | Fatal (Y/N) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2018.05.26.a | 26-May-2018 | 2018 | Unproviked | USA | Florida | Daytona | Standing | male | M | 12 | Minor injury to foot | N |
Time | Species | Investigator or Source | href formula | href | Case Number.1 | Case Number.2 | original order | Unnamed: 22 | Unnamed: 23 | |
---|---|---|---|---|---|---|---|---|---|---|
14h00 | nan | K. McMurray, Tracking Sharks.com | 2018.05.26.a-DaytonaBeach.pdf | http://sharkattackfile.net/spreadsheets/pdf_directory/2018.05.26.a-DaytonaBeach.pdf | http://sharkattackfile.net/spreadsheets/pdf_directory/2018.05.26.a-DaytonaBeach.pdf | 2018.05.26.a | 2018.05.26.a | 6294 | nan | nan |
In details, it is necessary to understand all columns, contents, and values whether numerical or categorical.
Case Number
: DateDate
: LITR, date when incident happended.Year
: LITR, yearType
: provoke, unprovoked, ...Country
Area
: area, espeically states in the USALocation
: exact locationActivity
: what kinda activities when they were attacked by sharks- [delete]
Name
: name of victims Sex
: LITR, sex. Male or FemaleAge
: LITR, ageInjury
: wounded parts in the attackFatal (Y/N)
: seriousnessTime
: Exact time or a time slotSpecies
: what kinda sharks- [delete]
Investigator or Source
: name who invested the incident - [delete]
pdf
: title of the investing file - [delete]
href formula
: directory of file - [delete, duplicate]
href
: same w/ href formula Case Number.1
: Casenumber, same w/ date- [delete, duplicate]
Case Number.2
: same w/ Case Number.1 - [delete, duplicate]
original order
: order of incidents - [delete]
Unnamed: 22
: all nan - [delete]
Unnamed: 23
: all nan
There are several erorrs including missing values, duplication, noise, categorical values, mixing values etc. Thus, data cleaning and preprocessing are strongly required and then different approaches should be considered for data analysis.
Before moving to EDA, preprocessing is required due to several issues
categorical data -> numerical data