First Scraping script for Kiosko.net

Ahmd has been working on a scrapper in Ruby for the front Pages at Kiosko.net

I’ve finished the scraping script, and it’s public on https://gist.github.com/2925910 to run the script just pass the file to ruby [ruby scraper.rb] and it will generate the folders (the directories is set for Linux, if you are on Windows you should modify them first), download the images(you can change the variable values in the get_issues method to get different newspapers), and write the log to stdout.

Check the script below.

I’ve also contacted Newseum to see if their “only today” front page data base is avaible for PageOneX.

Script to scrape front pages images of newspapers form kiosko.net

require "fileutils" require "open-uri"


class Scraper
	def self.get_issues
		# Sample of the countries and their newspapers form http://kiosko.net/
		counteries_newspapers = {"es" => ["elpais", "abc"], "de" => ["faz", "bild"], "fr" => ["lemonde", "lacroix"], "it" => ["corriere_della_sera", "ilmessaggero"], "uk" => ["the_times", ],"us" => ["wsj", "newyork_times", "usa_today"]}
		year = 2012

		month = 2

		start_day = 3

		end_day = 5
		Scraper.scrape year, month, start_day, end_day, counteries_newspapers
		puts "Scraping is done"

	end
	def self.scrape(year, month, start_day, end_day, counteries_newspapers)
		domain = "http://img.kiosko.net/"

		newspapers = Scraper.newspapers counteries_newspapers

		paths = []
		issues = Scraper.issues_dates year, month, start_day, end_day

		issues.each do |issue|

			newspapers.each do |newspaper|

				paths < e

				puts e.message + " => " + path

				puts

			end

		end
	end
	def self.save_issues(path, source)
		file_name = path.split('/').last
		open("pics/#{path.split("/")[-1].split(".")[0]}/" + "#{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name ,"wb") do |file|

			file.write(source.read())

			puts "done => #{path.split('/')[-3]}-#{path.split('/')[-4]}-#{path.split('/')[-5]}-" + file_name

		end
	end
	def self.newspapers(counteries_newspapers)

		newspapers = []

		counteries_newspapers.each do |country, newspaper|

			newspaper.each do |_newspaper|

				newspapers << "/#{country}/#{_newspaper}.750.jpg"

				FileUtils.mkdir "pics/#{_newspaper}"

			end

		end

		newspapers

	end
	def self.issues_dates(year, month, start_day, end_day)

		day = start_day

		days = []

		number_of_issues = 1
		number_of_issues = end_day - start_day + 1 unless end_day == 0

		number_of_issues.times do

			if day < 10

				f_day = String("0" + day.to_s)

			else

				f_day = day.to_s

			end

			days << "#{year}/" + "0#{month}/" + f_day
			day += 1

		end

		days

	end
end

FileUtils.mkdir "pics" Scraper.get_issues

Script to scrape front pages images of newspapers form kiosko.net

Leave a Reply Cancel reply