Most Used Text Tags on the Internet (MUTTI)
I wanted to try
Ruby. After some thinking I found a good project: a script that shows stats for the most used text tags on the Internet (MUTTI).
What needed to be done
-
Given an URL the script should fetch the HTML
-
Parse the HTML for known text tags
-
Store the results
-
Ability to control from the console
-
Create a web-page showing the results
Doing it in Ruby
Given a URL the script should fetch the HTML:
It was pretty easy to fetch using Net::HTTP. I also wanted some automation: therefor I made an option to store the URL's in a YAML file called site_list.yaml.
Parse the HTML:
Ruby has a very nice support for regular expressions.
Store the results:
Using YAML it was very easy to save and load the results.
Ability to control from the console:
It was pretty hard to find any documentation that covered how to access arguments passed from the console - - At last I downloaded a Ruby script and saw how they did it.
Create a web-page showing the results:
It went pretty smooth - thanks to YAML.
The Ruby experience
I had coded in Python for some months and Ruby feels a lot like Python. The best thing about
Ruby is the @ operator - it is much better than Python's self which has to be declared in every method of a class...! The worst things that I encountered:
-
The debugging is hard because Ruby isn't that smart to help one with error information.
-
When one prints a Hash or an Array then the representation is simply to no use... Python has a mcuh better representation.
-
Blocks are very hard to use.
The script
Demo:
Click here to see some stats
Download:
Click here to download the script
Usage
First edit site_list.yaml with a text editor. Then from a console type: ruby mutti_html.rb start_auto.
Options
Download, parse and store a site:
ruby mutti_html.rb www.website.com/index.html
Reset the database:
ruby mutti_html.rb reset
Generate a stat HTML file:
ruby mutti_html.rb gen_stats