This is the first publicly available dataset for the Albanian language. It contains more than 3 million news articles from various albanian news sources (see list below).
After having scraped all of the newspages through their Wordpress API’s we merged all of the data into this file, where to separate the origin of each news article we’ve also added the source to each post.
All available articles from the first one posted on each page until 27.08.2020 are stored in the file.
These articles were taken from these news pages:
- https://www.gazetaexpress.com/
- https://insajderi.com/
- https://gazetablic.com
- https://ballkani.info/
- https://indeksonline.net/
- https://klankosova.tv/ # Was Removed in V2 due to the API not being public anymore.
- https://kallxo.com/
- https://lajmi.net/
- https://telegrafi.com/
- https://www.kungulli.com/ (Satire)
Download the Dataset: The Kosovo News Articles Dataset is available for download on Kaggle. Access it here.