Categories
code Scraping

Update Scraping script to scrape from many sources

I’ve made some changes to the script to scrape from different sources (http://kiosko.nethttp:/nytimes.comhttp://elpais.com) and other sources can be added easily, for each source there are two methods, build_source_issues and save_source_issues, the first method is to construct the URI of the issue image based on some pattern which different from source to another, and the other method is to scrape the images and save them on the disk in their specific folders. I’ve wrote some comments to clear some parts of the code.

Note, to scrape from specific source you should comment the others as you can see in the code, for example to scrape from New york Times you should un-comment line 15 and line 32 and comment line 14 and line 31, and also if you want to run the script on https://scraperwiki.com/ you should comment line 3 and don’t try to scrape from elpais because scraperwiki don’t have “RMagick” gem installed.

Information about sources used in the script

1. http://kiosko.net 

Date limits: there are no specific starting date, scraping starting from 2008, but most of the newspapers exist starting from 2011, the script is able to scrape from [2008-2012 ]

Image resolution: [750×1072]

2. http:/nytimes.com

Date limits: first issue available date is 2002/01/24

Image resolution: [348×640] the resolution is not enough for coding!

3. http://elpais.com

Date limits: first issue available date is 2012/03/01

Image resolution: [765×1133] the resolution of produced images can be changed!

Script on Github

https://gist.github.com/2925910/

One reply on “Update Scraping script to scrape from many sources”

Note on kiosko.net images:
The image size varies from newspaper to newspaper based on is proportions, all of them maintaining its width at 750px.
New York Times in kiosko.net is 1.375px and El País 1.110px

Images url are http://img.kiosko.net/2012/06/23/es/elpais.750.jpg and also at http://img.kiosko.net/2012/06/23/es/elpais.jpg (without the ‘750’).

Note: we should talk about “size” of the image, and not image “resolution”.

Worth mentioning that we have looked at http://www.newseum.org/todaysfrontpages/ and not found yet a good pattern to scrape or go backwards in time. We are connecting with them to see if it is possible to use their data base.

Leave a Reply

Your email address will not be published. Required fields are marked *