Skip to content

Latest commit

 

History

History
133 lines (101 loc) · 4.15 KB

README.md

File metadata and controls

133 lines (101 loc) · 4.15 KB

SocialED Datasets

This repository contains the datasets used by the SocialED Python library for social event detection tasks.

📁 Repository Structure

SocialED_dataset
├── npy_data/          # Preprocessed datasets in .npy format
├── raw_data/          # Original raw datasets
└── README.md

📊 Dataset Overview

This repository includes 14 widely-used datasets for social event detection, covering multiple languages and various event types:

Dataset Language Events Texts Long tail
Event2012 English 503 68,841 No
Event2018 French 257 64,516 No
Arabic_Twitter Arabic 7 9,070 No
MAVEN English 164 10,242 No
CrisisLexT26 English 26 27,933 No
CrisisLexT6 English 6 60,082 No
CrisisMMD English 7 18,082 No
CrisisNLP English 11 25,976 No
HumAID English 19 76,484 No
Mix_Data English 5 78,489 No
KBP English 100 85,569 No
Event2012_100 English 100 15,019 Yes
Event2018_100 French 100 19,944 Yes
Arabic_7 Arabic 7 3,022 Yes
CrisisLexT7 English 7 1,959 Yes

📝 Dataset Descriptions

General Event Detection Datasets

  • Event2012 [Paper]

    • 68,841 annotated English tweets
    • 503 distinct event categories
    • Collected over a continuous 29-day period
    • Rich temporal context for event analysis
  • Event2018 [Paper]

    • 64,516 annotated French tweets
    • 257 event categories
    • 23 consecutive days of data
    • Valuable insights into French social media patterns
  • Arabic_Twitter

    • 9,070 annotated Arabic tweets
    • 7 major catastrophic events
    • Focus on crisis-related social media behavior

Crisis-Related Datasets

  • CrisisLexT26

    • 27,933 tweets covering 26 crisis events
    • Focus on emergency situations
  • CrisisLexT6

    • 60,082 tweets documenting 6 major crises
    • Detailed public communication patterns
  • CrisisMMD

    • 18,082 manually annotated tweets
    • 7 major natural disasters in 2017
    • Multimodal data including text and images
  • CrisisNLP

    • 25,976 tweets spanning 11 events
    • Human-annotated data
    • Specialized crisis information analysis
  • HumAID

    • 76,484 manually annotated tweets
    • 19 major natural disasters (2016-2019)
    • Diverse disaster types and locations

Mixed and Specialized Datasets

  • MAVEN [Paper]

    • 10,242 annotated texts
    • 164 event types
    • Domain-agnostic event detection
  • Mix_Data

    • Composite dataset including:
      • ICWSM2018: 21,571 expert-labeled tweets
      • ISCRAM2013: 4,676 annotated tweets
      • ISCRAM2018: 49,804 tweets
      • BigCrisisData: 2,438 classified tweets
  • KBP

    • 85,569 tweets
    • 100 event types
    • Long-tail event detection

🔧 Usage

These datasets are ready to use with the SocialED library. You can find:

  • Preprocessed data in npy_data/
  • Original data in raw_data/

📚 Citation

If you use these datasets in your research, please cite both the original dataset papers and the SocialED library:

@misc{zhang2024socialedpythonlibrarysocial,
      title={SocialED: A Python Library for Social Event Detection}, 
      author={Kun Zhang and Xiaoyan Yu and Pu Li and Hao Peng and Philip S. Yu},
      year={2024},
      eprint={2412.13472},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.13472},
}

🔗 Related Links

📄 License

This dataset collection is released under the same license as the SocialED library. Please refer to individual dataset papers for their specific terms of use.