Ahmd has been working on a scrapper in Ruby for the front Pages at Kiosko.net
I’ve finished the scraping script, and it’s public on https://gist.github.com/2925910 to run the script just pass the file to ruby [ruby scraper.rb] and it will generate the folders (the directories is set for Linux, if you are on Windows you should modify them first), download the images(you can change the variable values in the get_issues method to get different newspapers), and write the log to stdout.
Check the script below.
I’ve also contacted Newseum to see if their “only today” front page data base is avaible for PageOneX.
Script to scrape front pages images of newspapers form kiosko.net
June 11th 2012. 10am-12.45pm EST
People: Ahmd + Pablo
Check the live notes of the meeting at http://brownbag.me:9001/p/120611pageonex
Crossposting from numeroteca.org.
View this datavis full size at gigapan.
Today’s post is to present the tool we are building this summer: PageOneX. The idea behind is to make online and easier the coding process of front page newspapers. Make this visualization process available for researchers, advocacy groups and anyone interested. I’ll will give some background about this process.
How things started
Approximately one year ago I started diving in the front page world. It was days after the occupations of squares in many cities from Spain, and I was living in Boston. I made a front page visualization to show what people was talking about: the blackout in the media about the indignados #15M movement. You can read more about Cthe story in the ivic Media blog. Since then I’ve been making more visualizations around front pages of paper newspapers, testing different methods and possible ways to use them. I’ve also made a tool, built in Processing, to scrap front pages from kiosko.net and build a .svg matrix.
I’ve met with Nathan Mathias, who is developing Media Meter http://mediameter.org/ (development version http://mmdev.media.mit.edu/, please do not share the link yet), a tool built on Ruby on Rails to crowdsource the analysis of news, and able to test the intercoder reliability. They are offering their code (the github link will come soon). We still have to figure out if we go for it, and if we do how the collaboration will be: fork it or stay in the same platform. Nathan, the main developer, is offering online support and to be on the conference calls once a week. He is willing to expand the use of the tool for more uses.
I think it is a great idea to not start from scratch and to have Nathan’s help. At the same time I am worried about starting our project with many things that we do not need. Any thoughts on this?