Categories
code Scraping

First Scraping script for Kiosko.net

Ahmd has been working on a scrapper in Ruby for the front Pages at Kiosko.net

I’ve finished the scraping script, and it’s public on https://gist.github.com/2925910 to run the script just pass the file to ruby [ruby scraper.rb] and it will generate the folders (the directories is set for Linux, if you are on Windows you should modify them first), download the images(you can change the variable values in the get_issues method to get different newspapers), and write the log to stdout.

Check the script below.

I’ve also contacted Newseum to see if their “only today” front page data base is avaible for PageOneX.

Script to scrape front pages images of newspapers form kiosko.net


require "fileutils"
require "open-uri"

class Scraper

def self.get_issues

# Sample of the countries and their newspapers form http://kiosko.net/

counteries_newspapers = {"es" => ["elpais", "abc"], "de" => ["faz", "bild"], "fr" => ["lemonde", "lacroix"], "it" => ["corriere_della_sera", "ilmessaggero"], "uk" => ["the_times", ],"us" => ["wsj", "newyork_times", "usa_today"]}

year = 2012
month = 2
start_day = 3
end_day = 5

Scraper.scrape year, month, start_day, end_day, counteries_newspapers

puts "Scraping is done"
end

def self.scrape(year, month, start_day, end_day, counteries_newspapers)

domain = "http://img.kiosko.net/"
newspapers = Scraper.newspapers counteries_newspapers
paths = []

issues = Scraper.issues_dates year, month, start_day, end_day
issues.each do |issue|
newspapers.each do |newspaper|
paths < e
puts e.message + " => " + path
puts
end
end

end

def self.save_issues(path, source)

file_name = path.split('/').last

open("pics/#{path.split("/")[-1].split(".")[0]}/" + "#{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name ,"wb") do |file|
file.write(source.read())
puts "done => #{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name
end

end

def self.newspapers(counteries_newspapers)
newspapers = []
counteries_newspapers.each do |country, newspaper|
newspaper.each do |_newspaper|
newspapers << "/#{country}/#{_newspaper}.750.jpg"
FileUtils.mkdir "pics/#{_newspaper}"
end
end
newspapers
end

def self.issues_dates(year, month, start_day, end_day)
day = start_day
days = []
number_of_issues = 1

number_of_issues = end_day - start_day + 1 unless end_day == 0
number_of_issues.times do
if day < 10
f_day = String("0" + day.to_s)
else
f_day = day.to_s
end
days << "#{year}/" + "0#{month}/" + f_day

day += 1
end
days
end

end

FileUtils.mkdir "pics"
Scraper.get_issues

Leave a Reply

Your email address will not be published. Required fields are marked *