Tag: ruby

PageOneX Development Status

Post author By numeroteca
Post date July 6, 2012
No Comments on PageOneX Development Status

We are now is so close to the first Version 0.1, which will basically give the user the following features; to be able to create an account and creating a Thread, with basic info; name, start date, end date (in the same month, just for alpha version), description, choosing and number of newspaper, and Topics to code with it and for each topic the user can add a color and description.

Then user start to code scraped images (opened issues) with selected topics, the color of highlighted area will be based on the topic color, maximum number highlighted areas for alpha version is two, and if the user want to add any other highlighted areas, the system will prompt the user with the option to clear the current highlighted areas or skip adding other highlighted area, after that the display, it’s not finished yet.

Here’s a UI – first drafts and Some of opened issues

Tags rails, ruby, ui

code Scraping

Update Scraping script to scrape from many sources

Post author By numeroteca
Post date June 20, 2012
1 Comment on Update Scraping script to scrape from many sources

I’ve made some changes to the script to scrape from different sources (http://kiosko.net, http:/nytimes.com, http://elpais.com) and other sources can be added easily, for each source there are two methods, build_source_issues and save_source_issues, the first method is to construct the URI of the issue image based on some pattern which different from source to another, and the other method is to scrape the images and save them on the disk in their specific folders. I’ve wrote some comments to clear some parts of the code.

Note, to scrape from specific source you should comment the others as you can see in the code, for example to scrape from New york Times you should un-comment line 15 and line 32 and comment line 14 and line 31, and also if you want to run the script on https://scraperwiki.com/ you should comment line 3 and don’t try to scrape from elpais because scraperwiki don’t have “RMagick” gem installed.

Information about sources used in the script

1. http://kiosko.net

Date limits: there are no specific starting date, scraping starting from 2008, but most of the newspapers exist starting from 2011, the script is able to scrape from [2008-2012 ]

Image resolution: [750×1072]

2. http:/nytimes.com

Date limits: first issue available date is 2002/01/24

Image resolution: [348×640] the resolution is not enough for coding!

3. http://elpais.com

Date limits: first issue available date is 2012/03/01

Image resolution: [765×1133] the resolution of produced images can be changed!

Script on Github

https://gist.github.com/2925910/

Tags elpais, kiosko, nytimes, ruby, scraper

code Scraping

First Scraping script for Kiosko.net

Post author By numeroteca
Post date June 13, 2012
No Comments on First Scraping script for Kiosko.net

Ahmd has been working on a scrapper in Ruby for the front Pages at Kiosko.net

I’ve finished the scraping script, and it’s public on https://gist.github.com/2925910 to run the script just pass the file to ruby [ruby scraper.rb] and it will generate the folders (the directories is set for Linux, if you are on Windows you should modify them first), download the images(you can change the variable values in the get_issues method to get different newspapers), and write the log to stdout.

Check the script below.

I’ve also contacted Newseum to see if their “only today” front page data base is avaible for PageOneX.

Script to scrape front pages images of newspapers form kiosko.net

Tags kiosko, newseum, ruby, scraper