I’ve finished the scraping script, and it’s public on https://gist.github.com/2925910 to run the script just pass the file to ruby [ruby scraper.rb] and it will generate the folders (the directories is set for Linux, if you are on Windows you should modify them first), download the images(you can change the variable values in the get_issues method to get different newspapers), and write the log to stdout.

I’ve also contacted Newseum to see if their “only today” front page data base is avaible for PageOneX.

Script to scrape front pages images of newspapers form kiosko.net

require "fileutils"
require "open-uri"

class Scraper

def self.get_issues

# Sample of the countries and their newspapers form http://kiosko.net/

counteries_newspapers = {"es" => ["elpais", "abc"], "de" => ["faz", "bild"], "fr" => ["lemonde", "lacroix"], "it" => ["corriere_della_sera", "ilmessaggero"], "uk" => ["the_times", ],"us" => ["wsj", "newyork_times", "usa_today"]}

year = 2012
month = 2
start_day = 3
end_day = 5

Scraper.scrape year, month, start_day, end_day, counteries_newspapers

puts "Scraping is done"

def self.scrape(year, month, start_day, end_day, counteries_newspapers)

domain = "http://img.kiosko.net/"
newspapers = Scraper.newspapers counteries_newspapers
paths = []

issues = Scraper.issues_dates year, month, start_day, end_day
issues.each do |issue|
newspapers.each do |newspaper|
paths << domain + issue + newspaper end end paths.each do |path| begin open(path) do |source| Scraper.save_issues path, source end rescue => e
puts e.message + " => " + path


def self.save_issues(path, source)

file_name = path.split('/').last

open("pics/#{path.split("/")[-1].split(".")[0]}/" + "#{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name ,"wb") do |file|
puts "done => #{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name


def self.newspapers(counteries_newspapers)
newspapers = []
counteries_newspapers.each do |country, newspaper|
newspaper.each do |_newspaper|
newspapers << "/#{country}/#{_newspaper}.750.jpg" FileUtils.mkdir "pics/#{_newspaper}" end end newspapers end def self.issues_dates(year, month, start_day, end_day) day = start_day days = [] number_of_issues = 1 number_of_issues = end_day - start_day + 1 unless end_day == 0 number_of_issues.times do if day < 10 f_day = String("0" + day.to_s) else f_day = day.to_s end days << "#{year}/" + "0#{month}/" + f_day day += 1 end days end end FileUtils.mkdir "pics" Scraper.get_issues

