Most Used Text Tags on the Internet (MUTTI)

I wanted to try Ruby. After some thinking I found a good project: a script that shows stats for the most used text tags on the Internet (MUTTI).

What needed to be done

  • Given an URL the script should fetch the HTML
  • Parse the HTML for known text tags
  • Store the results
  • Ability to control from the console
  • Create a web-page showing the results

Doing it in Ruby

Given a URL the script should fetch the HTML:
It was pretty easy to fetch using Net::HTTP. I also wanted some automation: therefor I made an option to store the URL's in a YAML file called site_list.yaml.

Parse the HTML:
Ruby has a very nice support for regular expressions.

Store the results:
Using YAML it was very easy to save and load the results.

Ability to control from the console:
It was pretty hard to find any documentation that covered how to access arguments passed from the console - - At last I downloaded a Ruby script and saw how they did it.

Create a web-page showing the results:
It went pretty smooth - thanks to YAML.

The Ruby experience

I had coded in Python for some months and Ruby feels a lot like Python. The best thing about
Ruby is the @ operator - it is much better than Python's self which has to be declared in every method of a class...! The worst things that I encountered:

  • The debugging is hard because Ruby isn't that smart to help one with error information.
  • When one prints a Hash or an Array then the representation is simply to no use... Python has a mcuh better representation.
  • Blocks are very hard to use.

The script

Demo:
Click here to see some stats

Download:
Click here to download the script

Usage

First edit site_list.yaml with a text editor. Then from a console type: ruby mutti_html.rb start_auto.

Options

Download, parse and store a site:
ruby mutti_html.rb www.website.com/index.html

Reset the database:
ruby mutti_html.rb reset

Generate a stat HTML file:
ruby mutti_html.rb gen_stats

Code 30. Jan 2005
Post a comment
Commenting on this post has expired.
© 2000-2009 amix. Powered by Skeletonz.