Extracting images from text mail archives

When you back up or save emails, one format of doing so is in plain text. The attachment to emails are then stored as base64 encoded data in the file. I wrote this script to find known signatures of emails in base64 attachments and write the images out to the disk.

Simply pass it in the file name you want to read from, or it will read from stdin. This just goes to show Ruby still has very high performance, on my system it was processing a test file at 50+ MB/sec. Full code after the break.

require 'base64'
require 'zlib'

in_base64 = false
attachment = []
ext = ""
crcs = {}
file_base = 0
line_count = 0

signatures = {
  "TU0AKgAKzHiAN1oM" => ".tiff",
  "R0lGODlh" => ".gif",
  "iVBORw0KGgoAAAANSUhEU" => ".png",
  "/9j/4AAQSkZJRgA" => ".jpg"
}


ARGF.each do |line|
  line_count+=1
  #If we aren't in the middle of a file, look
  #for signatures of images
  if not in_base64
    signatures.each do |sig,extension|
      if line.start_with? sig
        in_base64 = true
	attachment = []
        attachment << line
        ext = extension
      end
    end
  else 
    if line.start_with? "--"
      #end of base64, write out the file
      attachment = attachment.join("")
      length = attachment.length
      crc = Zlib::crc32(attachment)
      if crcs[crc] and crcs[crc] == length
        puts "Duplicate file, skipping"
      else
        puts "Writing file #{file_base}#{ext}"
        File.open("#{file_base}#{ext}",'wb') do |f|
          f.write(Base64.decode64(attachment))
          file_base += 1
        end
        crcs[crc] = length
      end
      in_base64 = false
      ext = ""
    else
      #middle of a base64 block, save the line
      attachment << line
    end
  end
end