Ahmd has been working on a scrapper in Ruby for the front Pages at Kiosko.net
I’ve finished the scraping script, and it’s public on https://gist.github.com/2925910 to run the script just pass the file to ruby [ruby scraper.rb] and it will generate the folders (the directories is set for Linux, if you are on Windows you should modify them first), download the images(you can change the variable values in the get_issues method to get different newspapers), and write the log to stdout.
Check the script below.
I’ve also contacted Newseum to see if their “only today” front page data base is avaible for PageOneX.
Script to scrape front pages images of newspapers form kiosko.net
require "fileutils"
require "open-uri"
class Scraper
def self.get_issues
# Sample of the countries and their newspapers form http://kiosko.net/
counteries_newspapers = {"es" => ["elpais", "abc"], "de" => ["faz", "bild"], "fr" => ["lemonde", "lacroix"], "it" => ["corriere_della_sera", "ilmessaggero"], "uk" => ["the_times", ],"us" => ["wsj", "newyork_times", "usa_today"]}
year = 2012
month = 2
start_day = 3
end_day = 5
Scraper.scrape year, month, start_day, end_day, counteries_newspapers
puts "Scraping is done"
end
def self.scrape(year, month, start_day, end_day, counteries_newspapers)
domain = "http://img.kiosko.net/"
newspapers = Scraper.newspapers counteries_newspapers
paths = []
issues = Scraper.issues_dates year, month, start_day, end_day
issues.each do |issue|
newspapers.each do |newspaper|
paths < e
puts e.message + " => " + path
puts
end
end
end
def self.save_issues(path, source)
file_name = path.split('/').last
open("pics/#{path.split("/")[-1].split(".")[0]}/" + "#{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name ,"wb") do |file|
file.write(source.read())
puts "done => #{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name
end
end
def self.newspapers(counteries_newspapers)
newspapers = []
counteries_newspapers.each do |country, newspaper|
newspaper.each do |_newspaper|
newspapers << "/#{country}/#{_newspaper}.750.jpg"
FileUtils.mkdir "pics/#{_newspaper}"
end
end
newspapers
end
def self.issues_dates(year, month, start_day, end_day)
day = start_day
days = []
number_of_issues = 1
number_of_issues = end_day - start_day + 1 unless end_day == 0
number_of_issues.times do
if day < 10
f_day = String("0" + day.to_s)
else
f_day = day.to_s
end
days << "#{year}/" + "0#{month}/" + f_day
day += 1
end
days
end
end
FileUtils.mkdir "pics"
Scraper.get_issues