I am a big fan of the Mobwars game on facebook. It is pretty fun, but there is lots of tedious things that you have to do to start making money. One of these things is attacking various users attempting to boost your rating or steal money.
As a fun task, I decided to see what I could do with page scraping and Mechanize. Here is some examples on how to use it.
Follow up:
Before doing anything, you have to log in to facebook. Mechanize is smart enough to keep sessions and cookies around so once you log in, you are good for the entire run of your script.
To log in to facebook, all you have to do is find the login form, fill in the username and password, and submit it:
$fbUrl = "http://www.facebook.com"
$agent = WWW::Mechanize.new
$agent.user_agent_alias = "Mac FireFox"
page = $agent.get($fbUrl)
if(page.title == "Welcome to Facebook! | Facebook")
loginf = page.form('loginform')
loginf.email = $username
if not $pwd
print "Enter your password: "
$pwd = $stdin.gets.chomp
end
loginf.pass = $pwd
$agent.submit(loginf, loginf.buttons.first)
end
Once you have done this, you can now access any app pages that are on facebook. Some apps are flash based, others are plain html and javascript.
For Mobwars, it is just html with very little javascript involved. One great benefit of Mechanize is that it uses Hpricot to represent the pages. This means you can use XPath expressions to pull out datasets that you may not otherwise be able to.
One thing that you need to do quite often is get your stats. This is shown on every page you load and has your cash, health, energy, stamina, and exp. It is all in a table in a fairly well defined format:
<table id="app8743457343_statusMenu" >
<tr>
<td>
<div class="wrapOuter wrap3outer">Cash:$0 </div>
</td>
<td>
<div class="wrapOuter wrap3outer">Health: 159/210 </div>
</td>
<td>
<div class="wrapOuter wrap3outer">Energy: 8/35 </div>
</td>
<td>
<div class="wrapOuter wrap3outer">Stamina: 10/17 </div>
</td>
<td>
<div class="wrapOuter wrap3outer">Exp: 7558</div>
</td>
<td>
<div class="wrapOuter wrap3outer">
Level: 25 <div style="overflow: hidden; height: 3px; background-color: #000000;"><div style="overflow: hidden; background-color: #999999; background-image: url("http://i306.photobucket.com/albums/nn250/mobwars/site/progress_small.gif"); width: 86%;"> </div></div> </div>
</td>
</tr>
</table>
This is the function that we will use to scrape that section:
def getStats(page=nil)
page = $agent.get($homeUrl) if page.nil?
cashStr = ""
page.search("//table[@id='app8743457343_statusMenu']") do |row|
row.search("//td/div/text()").each do |result|
result = result.to_s.strip
if result =~ /([a-zA-Z]+): ([0-9]+)\/([0-9]+)/
case $1
when "Health" then $health = $2; $maxHealth = $3
when "Energy" then $energy = $2; $maxEnergy = $3
when "Stamina" then $stam = $2; $maxStam = $3
end
end
if result =~ /([a-zA-Z]+):[ ]?([\$,0-9]+)/
case $1
when "Cash" then cashStr = $2;
when "Exp" then $exp = $2;
end
end
end
end
if ($health.to_f / $maxHealth.to_i) < 0.2
$hospital = true;
else
$hospital = false;
end
cashStr.gsub!(/\$/,'')
cashStr.gsub!(/,/,'')
$cash = cashStr.to_i
hosp = ""
hosp = "!" if $hospital
puts "H="+$health + hosp+ "/" + $maxHealth + " E=" + $energy + "/" + $maxEnergy + " S=" + $stam + "/" + $maxStam
puts "CASH=" + $cash.to_s
end
Because it is at every page, I allow page to be passed in as a parameter. If another function is doing things and needs to know the stats, it can just pass in the result page and the stats will be updated from that.
The first thing we do is get all the rows in the table:
page.search("//table[@id='app8743457343_statusMenu']") do |row|
This finds the table with the given id, and returns all child elements of that node. The result is all the rows in the table
The next section takes the rows, and pulls out the text of all the div tags:
row.search("//td/div/text()").each do |result|
After that, we strip all extra whitespace from the lines leaving us with very simple strings. These strings are in two formats, one with the type and a current/max, and another with the type and the raw number. We have two expressions to match this and and are able to then save all the elements into the global variables for all functions to access.
You would write a function like this for each action that you want to accomplish, and then tie it together with some logic to perform whatever functions you may need.
As a general note, page scraping is inherently unreliable. Typically as soon as the page layout breaks, your scraping breaks and you will have to go back and fix it quite often.
Recent Comments