Skip to content

Data analysis project on school-age digital connectivity, exploring internet access disparities among children aged 3-17 years across various global regions. Using UNICEF's "School-Age-Digital-Connectivity-A1data.csv" dataset, the project delves into differences based on income levels and rural-urban divides, showcasing insights through statistical

Notifications You must be signed in to change notification settings

prottayislive/eda-school-connectivity

Repository files navigation

Report on Data Cleaning Process of School-Age Digital Connectivity Dataset

View Notebook for code and visualisations

Overview

The data cleaning process was applied to the "School-Age-Digital-Connectivity-A1data.csv" dataset. The primary goal was to prepare the dataset for further analysis by addressing various data quality issues. The dataset contains information about the digital connectivity of children in school attendance age, with internet connection at home.

Data Cleaning Steps

  1. Initial Data Loading and Inspection:

    • The dataset was loaded into a pandas DataFrame using pd.read_csv.
    • Initial examination of the data was performed using df.head() to understand the structure and content.
  2. Whitespace Trimming:

    • Whitespaces were stripped from column names and string values within the dataset. This step is crucial for consistency and to avoid errors during data manipulation caused by unnoticed trailing or leading spaces.
  3. Missing Values Analysis:

    • Missing values in the dataset were identified using df.isnull().sum(). Addressing missing values is vital for ensuring the integrity of statistical analyses.
  4. Column Statistics Compilation:

    • Each column's unique values and data types were analyzed to understand the data's diversity and structure. This step helps in identifying data inconsistencies and errors.
  5. Data Type Conversion:

    • Percentage columns were converted from string to float types after stripping the '%' character. Accurate data type representation is crucial for numerical computations.
  6. Handling Missing Values:

    • Missing values in specific columns were replaced with the respective column's median. This method is a standard practice for dealing with missing numeric data, as it preserves the overall distribution of the data.
  7. Conversion to Integer Types:

    • After handling missing values, certain columns were converted to integers for consistency and ease of analysis.
  8. Duplicate Rows Removal:

    • Duplicate rows were identified and removed. This step is essential to prevent skewed analysis results due to repeated entries.
  9. Typo Correction in 'Income Group' Column:

    • A typo was corrected in the 'Income Group' column to ensure consistency in categorical data.
  10. Extraction and Conversion of 'Time period' Column:

    • The 'Time period' column was extracted and converted to an integer type. This step ensures that time-related data is in a consistent and analyzable format.
  11. Removal of Rows with Impossible Time Periods:

    • Rows with time periods beyond 2023 were removed as they are not possible and could represent data entry errors.
  12. Irrelevant Columns Removal:

    • Columns not relevant to the analysis ('ISO3', 'Region', 'Sub-region', 'Data source', 'Time period') were dropped. Removing irrelevant data simplifies the dataset and focuses analysis on pertinent information.
  13. Final Dataset Description:

    • A descriptive statistics summary of the cleaned dataset was obtained using df.describe() to provide an overview of the data's distribution post-cleanup.
  14. Saving the Cleaned Data:

    • The cleaned dataset was saved to a new CSV file "cleaned_A1data.csv" for further analysis. This step is crucial for documenting the data cleaning process and providing a clean dataset for subsequent tasks.

Discussion of Results from Data Exploration of School-Age Digital Connectivity Dataset

1. Income Group Analysis

The boxplot visualization and mean calculation of the "Poorest (Wealth quintile)" across different income groups revealed critical insights:

  • High vs. Low Income Groups: There is a stark contrast in digital connectivity between high-income (mean: 30.4%) and low-income groups (mean: 2.13%). This disparity highlights the digital divide, where children in wealthier nations have significantly better access to internet connectivity than those in poorer countries.
  • Income Group Variability: The boxplot visualization likely showed a wide range of variability within each income group. This suggests that even within income categories, there are significant differences in digital connectivity among countries.

2. Rural and Urban Residence Comparison

The median comparison between rural (13.5%) and urban (40.5%) residences indicates a pronounced digital divide based on the living environment:

  • Urban Advantage: The higher median in urban areas suggests that urban children have better access to the internet compared to their rural counterparts. This could be due to better infrastructure and availability of services in urban areas.
  • Rural Challenges: The lower median in rural areas points to challenges such as limited infrastructure, which hampers internet access. This gap is a concern for equitable educational and developmental opportunities for rural children.

3. Top 10 Countries Analysis

Identifying the top 10 countries with the highest internet connectivity in rural and urban areas provides a nuanced understanding of geographical disparities:

  • Leading Countries: Countries like the Dominican Republic and Serbia, which lead in rural and urban connectivity respectively, may have effective policies or infrastructural advantages that support digital access.
  • Comparative Insights: The differences between the top countries in rural and urban settings may reflect varied national priorities or geographical challenges that impact internet connectivity.

4. Upper Middle Income Group Disparity

The statistical analysis within the "Upper middle income (UM)" group showed a significant disparity between the poorest and richest quintiles:

  • Economic Inequality Impact: Despite being in the same income group, there is a large gap between the poorest (mean: 32.67%) and richest (mean: 83.85%) in terms of internet access. This suggests that economic inequality within a country can significantly affect digital connectivity.
  • Policy Implications: These findings indicate the need for targeted policies within countries to bridge the digital divide, not just between but also within income groups.

Overall Implications

The results from this data exploration underscore the multifaceted nature of the digital divide. While income level is a significant determinant of digital connectivity, factors like urban vs. rural residence and internal economic disparities within countries also play crucial roles. The findings highlight the need for nuanced policy interventions that address these varied aspects to ensure equitable digital access for all children, which is essential for their educational and social development in an increasingly digital world.

About

Data analysis project on school-age digital connectivity, exploring internet access disparities among children aged 3-17 years across various global regions. Using UNICEF's "School-Age-Digital-Connectivity-A1data.csv" dataset, the project delves into differences based on income levels and rural-urban divides, showcasing insights through statistical

Topics

Resources

Stars

Watchers

Forks