Skip to content

Latest commit

 

History

History
11 lines (7 loc) · 2.71 KB

README.md

File metadata and controls

11 lines (7 loc) · 2.71 KB

NPR Morning Edition Website Song Scraper

NPR has some great chill interlude music throughout the day. They only play a snippet of it though. I wanted a playlist I could jam to while working so I wrote these scripts to scrape the last weeks interlude music titles and artists from NPR's website, search youtube (wihtout API calls) for the first result from that search term, and create a youtube playlist of all the songs.

There are several files included here. The goal was to browse a show's archive on npr.org and ultimately make playlists as far back as you wanted automatically, ie. for the last 4 weeks for instance if you wanted. There's problems with this. Firstly, the method for creating the youtube playlists doesn't use the API or login credentials and so it only workes for <50 videos. More vids would require multiple playlist links. Secondly, the autoscrolling NPR website doesn't allow the pages to be scraped correctly. I have all my code up here in case anyone wants to contribute to fix that...

Beware: youtube will catch this after a few runs and no longer allow you to search youtube unless you manually attempt a search then fill in a captcha they show you.

Files: InfiniteScraperPlaylistMaker.py: requires you enter the URL to search. This would be for example a particular episode of "Morning Edition" https://www.npr.org/programs/morning-edition/2018/12/20/678513405?showDate=2018-12-20 Then it will scrape all the songs for that date and build a a playlist for it. This is the manual way of doing it. This code also only requires BeautifulSoup to run.

InfiniteSearcherScraperPlaylistMaker.py: Running this one with "testing=1" will list the last week's songs in CSV format in the terminal. You can add an argument to get multiple weeks of shows songs. Running this outside of test mode for more than 1 week of songs will definitely get your flagged as a bot on youtube... This code starts a chrome client on the npr archive page and uses selenium to scroll down a number of times to load previous weeks onto the page before scraping the bands and such. Since each day has about 15-20 songs, passing the argument 3 will get 3 weeks of data which is the last 15 shows or approx. 3 weeks * 5 shows a week * 20 songs per show = 300 songs. Youtube will call you out on this and the code will start erroring out when the request return nothing. The next time you manually visit youtube and try to search, they'll make you do a captcha and they let you know you've been put on the naughty list. Also, since this method only allows <50 vids in the playlist, this will create one playlist for each week of shows (about 20 songs). This code requires Selenium, BeautifulSoup, and a chromedriver be installed in the working directory.