Ahmd has been working on a scrapper in Ruby for the front Pages at Kiosko.net

I’ve finished the scraping script, and it’s public on https://gist.github.com/2925910 to run the script just pass the file to ruby [ruby scraper.rb] and it will generate the folders (the directories is set for Linux, if you are on Windows you should modify them first), download the images(you can change the variable values in the get_issues method to get different newspapers), and write the log to stdout.

Check the script below.

I’ve also contacted Newseum to see if their “only today” front page data base is avaible for PageOneX.

Script to scrape front pages images of newspapers form kiosko.net


require "fileutils"
require "open-uri"

class Scraper

def self.get_issues

# Sample of the countries and their newspapers form http://kiosko.net/

counteries_newspapers = {"es" => ["elpais", "abc"], "de" => ["faz", "bild"], "fr" => ["lemonde", "lacroix"], "it" => ["corriere_della_sera", "ilmessaggero"], "uk" => ["the_times", ],"us" => ["wsj", "newyork_times", "usa_today"]}

year = 2012
month = 2
start_day = 3
end_day = 5

Scraper.scrape year, month, start_day, end_day, counteries_newspapers

puts "Scraping is done"
end

def self.scrape(year, month, start_day, end_day, counteries_newspapers)

domain = "http://img.kiosko.net/"
newspapers = Scraper.newspapers counteries_newspapers
paths = []

issues = Scraper.issues_dates year, month, start_day, end_day
issues.each do |issue|
newspapers.each do |newspaper|
paths << domain + issue + newspaper end end paths.each do |path| begin open(path) do |source| Scraper.save_issues path, source end rescue => e
puts e.message + " => " + path
puts
end
end

end

def self.save_issues(path, source)

file_name = path.split('/').last

open("pics/#{path.split("/")[-1].split(".")[0]}/" + "#{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name ,"wb") do |file|
file.write(source.read())
puts "done => #{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name
end

end

def self.newspapers(counteries_newspapers)
newspapers = []
counteries_newspapers.each do |country, newspaper|
newspaper.each do |_newspaper|
newspapers << "/#{country}/#{_newspaper}.750.jpg" FileUtils.mkdir "pics/#{_newspaper}" end end newspapers end def self.issues_dates(year, month, start_day, end_day) day = start_day days = [] number_of_issues = 1 number_of_issues = end_day - start_day + 1 unless end_day == 0 number_of_issues.times do if day < 10 f_day = String("0" + day.to_s) else f_day = day.to_s end days << "#{year}/" + "0#{month}/" + f_day day += 1 end days end end FileUtils.mkdir "pics" Scraper.get_issues

Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> 

required