🦈 Global Shark Attacks data

Before modeling, data understand is the key part. Thus, we are going to do Exploratory Data Analysis (EDA) and trying to understand data fully. That is the main scope of the work. Let's follow the steps.

For fun, the global shark attacks dataset has been considered in this work.

☾ Table of contents

Data acquisition
Dataset
Preprocessing
Data mining : EDA, correaltion matrix, visualization
Features engineering

☺︎ Data acquisition

download data set from the kaggle link below

🦆 Kaggle data Global shark attacks file

☺︎ Dataset

First, you read csv file and check all variables.

df0 = pd.read_csv("attacks.csv",encoding = "ISO-8859-1")

#..... describe dataset 
print('================================================')
print(df0.describe)
print('================================================')
print('total size :',df0.shape)

☻ Variables

Total size of data is 25723 rows and 24 columns.

Case Number	Date	Year	Type	country	Area	Location	Activity	Name	Sex	Age	Injury	Fatal (Y/N)
2018.05.26.a	26-May-2018	2018	Unproviked	USA	Florida	Daytona	Standing	male	M	12	Minor injury to foot	N

Time	Species	Investigator or Source	pdf	href formula	href	Case Number.1	Case Number.2	original order	Unnamed: 22	Unnamed: 23
14h00	nan	K. McMurray, Tracking Sharks.com	2018.05.26.a-DaytonaBeach.pdf	http://sharkattackfile.net/spreadsheets/pdf_directory/2018.05.26.a-DaytonaBeach.pdf	http://sharkattackfile.net/spreadsheets/pdf_directory/2018.05.26.a-DaytonaBeach.pdf	2018.05.26.a	2018.05.26.a	6294	nan	nan

In details, it is necessary to understand all columns, contents, and values whether numerical or categorical.

Case Number : Date
Date : LITR, date when incident happended.
Year : LITR, year
Type : provoke, unprovoked, ...
Country
Area : area, espeically states in the USA
Location : exact location
Activity : what kinda activities when they were attacked by sharks
[delete] Name : name of victims
Sex : LITR, sex. Male or Female
Age : LITR, age
Injury : wounded parts in the attack
Fatal (Y/N) : seriousness
Time : Exact time or a time slot
Species : what kinda sharks
[delete] Investigator or Source : name who invested the incident
[delete] pdf : title of the investing file
[delete] href formula : directory of file
[delete, duplicate] href : same w/ href formula
Case Number.1 : Casenumber, same w/ date
[delete, duplicate] Case Number.2 : same w/ Case Number.1
[delete, duplicate]original order : order of incidents
[delete] Unnamed: 22 : all nan
[delete] Unnamed: 23 : all nan

☻ Issues

There are several erorrs including missing values, duplication, noise, categorical values, mixing values etc. Thus, data cleaning and preprocessing are strongly required and then different approaches should be considered for data analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🦈 Global Shark Attacks data

☾ Table of contents

☺︎ Data acquisition

☺︎ Dataset

☻ Variables

☻ Issues

☺︎ Preprocessing

☻ Data cleaning

☻ Encoding

☺︎ Data mining

☻ Data visualization

☻ Features correlation

☻ Reference

☻ image sources

Files

README.md

Latest commit

History

README.md

File metadata and controls

🦈 Global Shark Attacks data

☾ Table of contents

☺︎ Data acquisition

☺︎ Dataset

☻ Variables

☻ Issues

☺︎ Preprocessing

☻ Data cleaning

☻ Encoding

☺︎ Data mining

☻ Data visualization

☻ Features correlation

☻ Reference

☻ image sources