[Forensics-changes] [cewl] 01/02: Imported Upstream version 5.1

Joao Eriberto Mota Filho eriberto at moszumanska.debian.org
Wed Dec 31 13:01:27 UTC 2014


This is an automated email from the git hooks/post-receive script.

eriberto pushed a commit to branch debian
in repository cewl.

commit c5f38e9f3c9a080d252cdc2ba02890220df2410f
Author: Joao Eriberto Mota Filho <eriberto at debian.org>
Date:   Wed Dec 31 10:59:54 2014 -0200

    Imported Upstream version 5.1
---
 README      | 183 +++++++++++
 cewl.rb     | 995 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 cewl_lib.rb | 234 ++++++++++++++
 fab.rb      |  88 ++++++
 4 files changed, 1500 insertions(+)

diff --git a/README b/README
new file mode 100644
index 0000000..32fec46
--- /dev/null
+++ b/README
@@ -0,0 +1,183 @@
+CeWL - Custom Word List generator
+=================================
+
+Copyright(c) 2012, Robin Wood <robin at digininja.org>
+
+Based on a discussion on PaulDotCom about creating custom word lists by
+spidering a targets website and collecting unique words I decided to write
+CeWL, the Custom Word List generator. CeWL is a ruby app which spiders a
+given url to a specified depth, optionally following external links, and
+returns a list of words which can then be used for password crackers such
+as John the Ripper.
+
+By default, CeWL sticks to just the site you have specified and will go to a
+depth of 2 links, this behaviour can be changed by passing arguments. Be
+careful if setting a large depth and allowing it to go offsite, you could end
+up drifting on to a lot of other domains. All words of three characters and
+over are output to stdout. This length can be increased and the words can be
+written to a file rather than screen so the app can be automated.
+
+CeWL also has an associated command line app, FAB (Files Already Bagged)
+which uses the same meta data extraction techniques to create author/creator
+lists from already downloaded.
+
+Change Log
+==========
+
+Version 5.0
+-----------
+
+Adds proxy support from the command line and the ability to pass in
+credentials for both basic and digest authentication.
+
+A few other smaller bug fixes as well.
+
+Version 4.3
+-----------
+
+CeWL now sorts the words found by count and optionally (new --count argument)
+includes the word count in the output.     I've left the words in the case
+they are in the pages so "Product" is different to "product" I figure that if
+it is being used for password generation then the case may be significant so
+let the user strip it if they want to. There are also more improvments to the
+stability of the spider in this release.
+
+By default, CeWL sticks to just the site you have specified and will go to a
+depth of 2 links, this behaviour can be     changed by passing arguments. Be
+careful if setting a large depth and allowing it to go offsite, you could end
+up drifting on to     a lot of other domains. All words of three characters
+and over are output to stdout. This length can be increased and the words can
+be written to a file rather than screen so the app can be automated.
+
+Version 4.2
+-----------
+
+Fixes a pretty major bug that I found while fixing a smaller bug for @yorikv.
+The bug was related to a hack I had to put in place because of a problem I was
+having with the spider, while I was looking in to it I spotted this line which
+is the one that the spider uses to find new links in downloaded pages:
+
+	web_page.scan(/href="(.*?)"/i).flatten.map do |link|
+
+This is fine if all the links look like this:
+
+	<a href="test.php">link</a>
+
+But if the link looks like either of these:
+
+	<a href='test.php'>link</a>
+	<a href=test.php>link</a>
+
+the regex will fail so the links will be ignored.
+
+To fix this up I've had to override the function that parses the page to find
+all the links, rather than use a regex I've changed it to use Nokogiri which
+is designed to parse a page looking for links rather than just running through
+it with a custom regex. This brings in a new dependency but I think it is worth
+it for the fix to the functionality. I also found another bug where a link like
+this:
+
+	<a href='#name'>local</a>
+
+which should be ignored as it just links to an internal name was actually being
+translated to '/#name' which may unintentionally mean referencing the index
+page. I've fixed this one as well after a lot of debugging to find how best to
+do it.
+
+A final addition is to allow a user to specify a depth of 0 which allows CeWL
+to spider a single page.
+
+I'm only putting this out as a point release as I'd like to rewrite the
+spidering to use a better spider, that will come out as the next major release.
+
+Version 4.0/4.1
+---------------
+
+The main change in version 4.0/1 is the upgrade to run with Ruby 1.9.x, this
+has been tested on various machines and on BT5 as that is a popular platform
+for running it and it appears to run fine. Another minor change is that Up to
+version 4 all HTML tags were stripped out before the page was parsed for words,
+this meant that text in alt and title tags were missed. I now grab the text
+from those tags before stripping the HTML to give those extra few works.
+
+Version 3
+---------
+
+Addresses a problem spotted by Josh Wright. The Spider gem doesn't handle
+JavaScript redirection URLs, for exmaple an index page containing just the
+following:
+
+	<script language="JavaScript">
+	self.location.href =
+	'http://www.FOO.com/FOO/connect/FOONet/Top+Navigator/Home';
+	</script>
+
+wasn't spidered because the redirect wasn't picked up. I now scan through a
+page looking for any lines containing location.href= and then add the given
+URL to the list of pages to spider. 
+
+Version 2
+---------
+
+Version 2 of CeWL can also create two new lists, a list of email addresses
+found in mailto links and a list of author/creator names collected from meta
+data found in documents on the site. It can currently process documents in
+Office pre 2007, Office 2007 and PDF formats. This user data can then be used
+to create the list of usernames to be used in association with the password
+list.
+
+Pronunciation
+=============
+Seeing as I was asked, CeWL is pronounced "cool".
+
+Installation
+============
+CeWL needs the rubygems package to be installed along with the following gems:
+
+* mime-types
+* mini_exiftool
+* rubyzip
+* spider
+
+All these gems were available by running "gem install xxx" as root. The
+mini_exiftool gem also requires the exiftool application to be installed.
+
+Then just save CeWL to a directory and make it executable.
+
+The project page on my site gives some tips on solving common problems people
+have encountered while running CeWL - http://www.digininja.org/projects/cewl.php
+
+Usage
+=====
+Usage: cewl [OPTION] ... URL
+	--help, -h: show help
+	--depth x, -d x: depth to spider to, default 2
+	--min_word_length, -m: minimum word length, default 3
+	--offsite, -o: let the spider visit other sites
+	--write, -w file: write the output to the file
+	--ua, -u user-agent: useragent to send
+	--no-words, -n: don't output the wordlist
+	--meta, -a include meta data
+	--meta_file file: file for metadata output
+	--email, -e include email addresses
+	--email_file file: file for email output
+	--meta-temp-dir directory: the temporary directory used by exiftool when parsing files, default /tmp
+	-v: verbose
+
+	URL: The site to spider.
+
+Ruby Doc
+========
+CeWL is commented up in Ruby Doc format.
+
+Licence
+=======
+This project released under the Creative Commons Attribution-Share Alike 2.0
+UK: England & Wales
+
+( http://creativecommons.org/licenses/by-sa/2.0/uk/ )
+
+
+Alternativelly, you can use GPL-3+ instead the of the original license.
+
+( http://opensource.org/licenses/GPL-3.0 )
diff --git a/cewl.rb b/cewl.rb
new file mode 100755
index 0000000..854281e
--- /dev/null
+++ b/cewl.rb
@@ -0,0 +1,995 @@
+#!/usr/bin/env ruby
+
+# == CeWL: Custom Word List Generator
+#
+# CeWL will spider a target site and generate up to three lists:
+#
+# * A word list of all unique words found on the target site
+# * A list of all email addresses found in mailto links
+# * A list of usernames/author details from meta data found in any documents on the site
+#
+# == Usage
+#
+# cewl [OPTION] ... URL
+#
+# -h, --help:
+#	show help
+#
+# --depth x, -d x:
+#	depth to spider to, default 2
+#
+# --min_word_length, -m:
+#	minimum word length, default 3
+#
+# --email file, -e
+# --email_file file: 
+#	include any email addresses found duing the spider, email_file is optional output file, if 
+#	not included the output is added to default output
+#
+# --meta file, -a
+# --meta_file file:
+#	include any meta data found during the spider, meta_file is optional output file, if 
+#	not included the output is added to default output
+#
+# --no-words, -n
+#	don't output the wordlist
+#
+# --offsite, -o:
+#	let the spider visit other sites
+#
+# --write, -w file:
+#	write the words to the file
+#
+# --ua, -u user-agent:
+#	useragent to send
+#
+# --meta-temp-dir directory:
+#	the temporary directory used by exiftool when parsing files, default /tmp
+#
+# --keep, -k:
+#   keep the documents that are downloaded
+#
+# --count, -c:
+#   show the count for each of the words found
+#
+# -v
+#	verbose
+#
+# URL: The site to spider.
+#
+# Author:: Robin Wood (robin at digi.ninja)
+# Copyright:: Copyright (c) Robin Wood 2014
+# Licence:: CC-BY-SA 2.0 or GPL-3+
+#
+
+VERSION = "5.1"
+
+puts"CeWL #{VERSION} Robin Wood (robin at digi.ninja) (http://digi.ninja)"
+puts
+
+begin
+	require 'getoptlong'
+	require 'spider'
+	require 'nokogiri'
+	require 'net/http'
+rescue LoadError => e
+	# catch error and prodive feedback on installing gem
+	if e.to_s =~ /cannot load such file -- (.*)/
+		missing_gem = $1
+		puts "\nError: #{missing_gem} gem not installed\n"
+		puts "\t use: \"gem install #{missing_gem}\" to install the required gem\n\n"
+		exit
+	else
+		puts "There was an error loading the gems:"
+		puts
+		puts e.to_s
+		exit
+	end
+end
+
+require './cewl_lib'
+
+# Doing this so I can override the allowed? fuction which normally checks
+# the robots.txt file
+class MySpider<Spider
+	@@proxy_host = nil
+	@@proxy_port = nil
+	@@proxy_username = nil
+	@@proxy_password = nil
+
+	@@auth_type = nil
+	@@auth_user = nil
+	@@auth_password = nil
+	@@verbose = false
+
+	def self.proxy (host, port = nil, username = nil, password = nil)
+		@@proxy_host = host
+		port = 8080 if port.nil?
+		@@proxy_port = port
+		@@proxy_username = username
+		@@proxy_password = password
+	end
+
+	def self.auth_creds (type, user, password)
+		@@auth_type = type
+		@@auth_user = user
+		@@auth_password = password
+	end
+
+	def self.verbose (val)
+		@@verbose = val
+	end
+
+	# Create an instance of MySpiderInstance rather than SpiderInstance
+	def self.start_at(a_url, &block)
+		rules = RobotRules.new('Ruby Spider 1.0')
+		a_spider = MySpiderInstance.new({nil => a_url}, [], rules, [])
+		a_spider.auth_type = @@auth_type
+		a_spider.auth_user = @@auth_user
+		a_spider.auth_password = @@auth_password
+
+		a_spider.proxy_host = @@proxy_host
+		a_spider.proxy_port = @@proxy_port
+		a_spider.proxy_username = @@proxy_username
+		a_spider.proxy_password = @@proxy_password
+
+		a_spider.verbose = @@verbose
+		block.call(a_spider)
+		a_spider.start!
+	end
+end
+
+# My version of the spider class which allows all files
+# to be processed
+class MySpiderInstance<SpiderInstance
+	attr_writer :auth_type
+	attr_writer :auth_user
+	attr_writer :auth_password
+
+	attr_writer :proxy_host
+	attr_writer :proxy_port
+	attr_writer :proxy_username
+	attr_writer :proxy_password
+
+	attr_writer :verbose
+
+	# Force all files to be allowed
+	# Normally the robots.txt file will be honoured
+	def allowed?(a_url, parsed_url)
+		true
+	end
+	def start! #:nodoc: 
+		interrupted = false
+		trap("SIGINT") { interrupted = true } 
+		begin
+			next_urls = @next_urls.pop
+			tmp_n_u = {}
+			next_urls.each do |prior_url, urls|
+				x = []
+				urls.each_line do |a_url|
+					x << [a_url, (URI.parse(a_url) rescue nil)]
+				end
+				y = []
+				x.select do |a_url, parsed_url|
+					y << [a_url, parsed_url] if allowable_url?(a_url, parsed_url)
+				end
+				y.each do |a_url, parsed_url|
+					@setup.call(a_url) unless @setup.nil?
+					get_page(parsed_url) do |response|
+						do_callbacks(a_url, response, prior_url)
+						#tmp_n_u[a_url] = generate_next_urls(a_url, response)
+						#@next_urls.push tmp_n_u
+						generate_next_urls(a_url, response).each do |a_next_url|
+							#puts 'pushing ' + a_next_url
+							@next_urls.push a_url => a_next_url
+						end
+						#exit if interrupted
+					end
+					@teardown.call(a_url) unless @teardown.nil?
+					exit if interrupted
+				end
+			end
+		end while !@next_urls.empty?
+	end
+
+	def get_page(uri, &block) #:nodoc:
+		@seen << uri
+		
+		begin
+			if @proxy_host.nil?
+				http = Net::HTTP.new(uri.host, uri.port)
+
+				if uri.scheme == 'https'
+					http.use_ssl = true
+					http.verify_mode = OpenSSL::SSL::VERIFY_NONE
+				end
+			else
+				proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, @proxy_username, @proxy_password)
+				begin
+					if uri.scheme == 'https'
+						http = proxy.start(uri.host, uri.port, :use_ssl => true, :verify_mode => OpenSSL::SSL::VERIFY_NONE)
+					else
+						http = proxy.start(uri.host, uri.port)
+					end
+				rescue => e
+					puts "Failed to connect to the proxy"
+					exit
+				end
+			end
+			
+			req = Net::HTTP::Get.new(uri.request_uri, @headers)
+			
+			if !@auth_type.nil?
+				case @auth_type
+					when "digest"
+						uri.user = @auth_user
+						uri.password = @auth_password
+
+						res = http.request req
+
+						if not res['www-authenticate'].nil?
+							digest_auth = Net::HTTP::DigestAuth.new
+							auth = digest_auth.auth_header uri, res['www-authenticate'], 'GET'
+
+							req = Net::HTTP::Get.new uri.request_uri
+							req.add_field 'Authorization', auth
+						end
+
+					when "basic"
+						req.basic_auth @auth_user, @auth_password
+				end
+			end
+			res = http.request(req)
+			
+			if res.redirect?
+				#puts "redirect url"
+				base_url = uri.to_s[0, uri.to_s.rindex('/')]
+				new_url = URI.parse(construct_complete_url(base_url,res['Location']))
+
+				# If auth is used then a name:pass@ gets added, this messes the tree
+				# up so easiest to just remove it
+				current_uri = uri.to_s.gsub(/:\/\/[^:]*:[^@]*@/, "://")
+				@next_urls.push current_uri => new_url.to_s
+			elsif res.code == "401"
+				puts "Authentication required, can't continue on this branch - #{uri}" if @verbose
+			else
+				block.call(res)
+			end
+		rescue  => e
+			puts "Unable to connect to the site, run in verbose mode for more information"
+			if @verbose
+				puts
+				puts"The following error may help:"
+				puts e.to_s
+			end
+			exit
+		end
+	end
+	# overriding so that I can get it to ingore direct names - i.e. #name
+	def construct_complete_url(base_url, additional_url, parsed_additional_url = nil) #:nodoc:
+		if additional_url =~ /^#/
+			return nil
+		end
+		parsed_additional_url ||= URI.parse(additional_url)
+		case parsed_additional_url.scheme
+			when nil
+				u = base_url.is_a?(URI) ? base_url : URI.parse(base_url)
+				if additional_url[0].chr == '/'
+					"#{u.scheme}://#{u.host}#{additional_url}"
+				elsif u.path.nil? || u.path == ''
+					"#{u.scheme}://#{u.host}/#{additional_url}"
+				elsif u.path[0].chr == '/'
+					"#{u.scheme}://#{u.host}#{u.path}/#{additional_url}"
+				else
+					"#{u.scheme}://#{u.host}/#{u.path}/#{additional_url}"
+				end
+			else
+				additional_url
+		end
+	end
+
+	# Overriding the original spider one as it doesn't find hrefs very well
+	def generate_next_urls(a_url, resp) #:nodoc:
+		web_page = resp.body
+		if URI.parse(a_url).path == ""
+			base_url = a_url
+		else
+			base_url = a_url[0, a_url.rindex('/')]
+		end
+
+		doc = Nokogiri::HTML(web_page)
+		links = doc.css('a').map{ |a| a['href'] }
+		links.map do |link|
+			begin
+				if link.nil?
+					nil
+				else
+					begin
+						parsed_link = URI.parse(link)
+						if parsed_link.fragment == '#'
+							nil
+						else
+							construct_complete_url(base_url, link, parsed_link)
+						end
+					rescue
+						nil
+					end
+				end
+			rescue => e
+				puts "There was an error generating URL list"
+				puts "Error: " + e.inspect
+				puts e.backtrace
+				exit
+			end
+		end.compact
+	end
+end
+
+# A node for a tree
+class TreeNode
+	attr :value
+	attr :depth
+	attr :key
+	attr :visited, true
+	def initialize(key, value, depth)
+		@key=key
+		@value=value
+		@depth=depth
+		@visited=false
+	end
+
+	def to_s
+		if key==nil
+			return "key=nil value="+ at value+" depth="+ at depth.to_s+" visited="+ at visited.to_s
+		else
+			return "key="+ at key+" value="+ at value+" depth="+ at depth.to_s+" visited="+ at visited.to_s
+		end
+	end
+	def to_url_hash
+		return({@key=>@value})
+	end
+end
+
+# A tree structure
+class Tree
+	attr :data
+	@max_depth
+	@children
+
+	# Get the maximum depth the tree can grow to
+	def max_depth
+		@max_depth
+	end
+
+	# Set the max depth the tree can grow to
+	def max_depth=(val)
+		@max_depth=Integer(val)
+	end
+	
+	# As this is used to work out if there are any more nodes to process it isn't a true empty
+	def empty?
+		if !@data.visited
+			return false
+		else
+			@children.each { |node|
+				if !node.data.visited
+					return false
+				end
+			}
+		end
+		return true
+	end
+
+	# The constructor
+	def initialize(key=nil, value=nil, depth=0)
+		@data=TreeNode.new(key,value,depth)
+		@children = []
+		@max_depth = 2
+	end
+
+	# Itterator
+	def each
+		yield @data
+			@children.each do |child_node|
+			child_node.each { |e| yield e }
+		end
+	end
+
+	# Remove an item from the tree
+	def pop
+		if !@data.visited
+			@data.visited=true
+			return @data.to_url_hash
+		else
+			@children.each { |node|
+				if !node.data.visited
+					node.data.visited=true
+					return node.data.to_url_hash
+				end
+			}
+		end
+		return nil
+	end
+
+	# Push an item onto the tree
+	def push(value)
+		key=value.keys.first
+		value=value.values_at(key).first
+
+		if key==nil
+			@data=TreeNode.new(key,value,0)
+		else
+			# if the depth is 0 then don't add anything to the tree
+			if @max_depth == 0
+				return
+			end
+			if key==@data.value
+				child=Tree.new(key,value, @data.depth+1)
+				@children << child
+			else
+				@children.each { |node|
+					if node.data.value==key && node.data.depth<@max_depth
+						child=Tree.new(key,value, node.data.depth+1)
+						@children << child
+					end
+				}
+			end
+		end
+	end
+end
+
+opts = GetoptLong.new(
+	[ '--help', '-h', GetoptLong::NO_ARGUMENT ],
+	[ '--keep', '-k', GetoptLong::NO_ARGUMENT ],
+	[ '--depth', '-d', GetoptLong::OPTIONAL_ARGUMENT ],
+	[ '--min_word_length', "-m" , GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--no-words', "-n" , GetoptLong::NO_ARGUMENT ],
+	[ '--offsite', "-o" , GetoptLong::NO_ARGUMENT ],
+	[ '--write', "-w" , GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--ua', "-u" , GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--meta-temp-dir', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--meta_file', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--email_file', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--meta', "-a" , GetoptLong::NO_ARGUMENT ],
+	[ '--email', "-e" , GetoptLong::NO_ARGUMENT ],
+	[ '--count', '-c', GetoptLong::NO_ARGUMENT ],
+	[ '--auth_user', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--auth_pass', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--auth_type', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--proxy_host', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--proxy_port', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--proxy_username', GetoptLong::REQUIRED_ARGUMENT ],
+	[ '--proxy_password', GetoptLong::REQUIRED_ARGUMENT ],
+	[ "--verbose", "-v" , GetoptLong::NO_ARGUMENT ]
+)
+
+# Display the usage
+def usage
+	puts "Usage: cewl [OPTION] ... URL
+	--help, -h: show help
+	--keep, -k: keep the downloaded file
+	--depth x, -d x: depth to spider to, default 2
+	--min_word_length, -m: minimum word length, default 3
+	--offsite, -o: let the spider visit other sites
+	--write, -w file: write the output to the file
+	--ua, -u user-agent: useragent to send
+	--no-words, -n: don't output the wordlist
+	--meta, -a include meta data
+	--meta_file file: output file for meta data
+	--email, -e include email addresses
+	--email_file file: output file for email addresses
+	--meta-temp-dir directory: the temporary directory used by exiftool when parsing files, default /tmp
+	--count, -c: show the count for each word found
+
+	Authentication
+	--auth_type: digest or basic
+	--auth_user: authentication username
+	--auth_pass: authentication password
+	
+	Proxy Support
+	--proxy_host: proxy host
+	--proxy_port: proxy port, default 8080
+	--proxy_username: username for proxy, if required
+	--proxy_password: password for proxy, if required
+
+	--verbose, -v: verbose
+
+	URL: The site to spider.
+
+"
+	exit
+end
+
+verbose=false
+ua=nil
+url = nil
+outfile = nil
+email_outfile = nil
+meta_outfile = nil
+offsite = false
+depth = 2
+min_word_length=3
+email=false
+meta=false
+wordlist=true
+meta_temp_dir="/tmp/"
+keep=false
+show_count = false
+auth_type = nil
+auth_user = nil
+auth_pass = nil
+
+proxy_host = nil
+proxy_port = nil
+proxy_username = nil
+proxy_password = nil
+
+begin
+	opts.each do |opt, arg|
+		case opt
+		when '--help'
+			usage
+		when "--count"
+			show_count = true
+		when "--meta-temp-dir"
+			if !File.directory?(arg)
+				puts "Meta temp directory is not a directory\n"
+				exit
+			end
+			if !File.writable?(arg)
+				puts "The meta temp directory is not writable\n"
+				exit
+			end
+			meta_temp_dir=arg
+			if meta_temp_dir !~ /.*\/$/
+				meta_temp_dir+="/"
+			end
+		when "--keep"
+			keep=true
+		when "--no-words"
+			wordlist=false
+		when "--meta_file"
+			meta_outfile = arg
+		when "--meta"
+			meta=true
+		when "--email_file"
+			email_outfile = arg
+		when "--email"
+			email=true
+		when '--min_word_length'
+			min_word_length=arg.to_i
+			if min_word_length<1
+				usage
+			end
+		when '--depth'
+			depth=arg.to_i
+			if depth < 0
+				usage
+			end
+		when '--offsite'
+			offsite=true
+		when '--ua'
+			ua=arg
+		when '--verbose'
+			verbose=true
+		when '--write'
+			outfile=arg
+		when "--proxy_password"
+			proxy_password = arg
+		when "--proxy_username"
+			proxy_username = arg
+		when "--proxy_host"
+			proxy_host = arg
+		when "--proxy_port"
+			proxy_port = arg.to_i
+		when "--auth_pass"
+			auth_pass = arg
+		when "--auth_user"
+			auth_user = arg
+		when "--auth_type"
+			if arg =~ /(digest|basic)/i
+				auth_type=$1.downcase
+				if auth_type == "digest"
+					begin
+						require "net/http/digest_auth"
+					rescue LoadError => e
+						# catch error and prodive feedback on installing gem
+						puts "\nError: To use digest auth you require the net-http-digest_auth gem, to install it use:\n\n"
+						puts "\t\"gem install net-http-digest_auth\"\n\n"
+						exit
+					end
+				end
+			else
+				puts "Invalid authentication type, please specify either basic or digest"
+				exit
+			end
+		end
+	end
+rescue
+	usage
+end
+
+if !auth_type.nil? and (auth_user.nil? or auth_pass.nil?)
+	puts "If using basic or digest auth you must provide a username and password\n\n"
+	exit
+end
+
+if auth_type.nil? and (!auth_user.nil? or !auth_pass.nil?)
+	puts "Authentication details provided but no mention of basic or digest"
+	exit
+end
+
+if ARGV.length != 1
+	puts "Missing url argument (try --help)"
+	exit 0
+end
+
+url = ARGV.shift
+
+# Must have protocol
+if url !~ /^http(s)?:\/\//
+	url="http://"+url
+end
+
+# The spider doesn't work properly if there isn't a / on the end
+if url !~ /\/$/
+#	Commented out for Yori
+#	url=url+"/"
+end
+
+word_hash = {}
+email_arr=[]
+url_stack=Tree.new
+url_stack.max_depth=depth
+usernames=Array.new()
+
+# Do the checks here so we don't do all the processing then find we can't open the file
+if !outfile.nil?
+	begin
+		outfile_file=File.new(outfile,"w")
+	rescue
+		puts "Couldn't open the output file for writing"
+		exit
+	end
+else
+	outfile_file=$stdout
+end
+
+if !email_outfile.nil? and email
+	begin
+		email_outfile_file=File.new(email_outfile,"w")
+	rescue
+		puts "Couldn't open the email output file for writing"
+		exit
+	end
+else
+	email_outfile_file = outfile_file
+end
+
+if !meta_outfile.nil? and email
+	begin
+		meta_outfile_file=File.new(meta_outfile,"w")
+	rescue
+		puts "Couldn't open the metadata output file for writing"
+		exit
+	end
+else
+	meta_outfile_file = outfile_file
+end
+
+begin
+	if verbose
+		puts "Starting at " + url
+	end
+
+	if !proxy_host.nil?
+		MySpider.proxy(proxy_host, proxy_port, proxy_username, proxy_password)
+	end
+
+	if !auth_type.nil?
+		MySpider.auth_creds(auth_type, auth_user, auth_pass)
+	end
+	MySpider.verbose(verbose)
+	
+	MySpider.start_at(url) do |s|
+		if ua!=nil
+			s.headers['User-Agent'] = ua
+		end
+
+		s.add_url_check do |a_url|
+			#puts "checking page " + a_url
+			allow=true
+			# Extensions to ignore
+			if a_url =~ /(\.zip$|\.gz$|\.zip$|\.bz2$|\.png$|\.gif$|\.jpg$|^#)/
+				if verbose
+					puts "Ignoring internal link or graphic: "+a_url
+				end
+				allow=false
+			else
+				if /^mailto:(.*)/i.match(a_url)
+					if email
+						email_arr<<$1
+						if verbose
+							puts "Found #{$1} on page #{a_url}"
+						end
+					end
+					allow=false
+				else
+					if !offsite
+						a_url_parsed = URI.parse(a_url)
+						url_parsed = URI.parse(url)
+#							puts 'comparing ' + a_url + ' with ' + url
+
+						allow = (a_url_parsed.host == url_parsed.host)
+
+						if !allow && verbose
+							puts "Offsite link, not following: "+a_url
+						end
+					end
+				end
+			end
+			allow
+		end
+
+		s.on :success do |a_url, resp, prior_url|
+
+			if verbose
+				if prior_url.nil?
+					puts "Visiting: #{a_url}, got response code #{resp.code}"
+				else
+					puts "Visiting: #{a_url} referred from #{prior_url}, got response code #{resp.code}"
+				end
+			end
+			body=resp.body.to_s
+
+			# get meta data
+			if /.*<meta.*description.*content\s*=[\s'"]*(.*)/i.match(body)
+				description=$1
+				body += description.gsub(/[>"\/']*/, "") 
+			end 
+
+			if /.*<meta.*keywords.*content\s*=[\s'"]*(.*)/i.match(body)
+				keywords=$1
+				body += keywords.gsub(/[>"\/']*/, "") 
+			end 
+
+#				puts body
+#				while /mailto:([^'">]*)/i.match(body)
+#					email_arr<<$1
+#					if verbose
+#						puts "Found #{$1} on page #{a_url}"
+#					end
+#				end 
+
+			while /(location.href\s*=\s*["']([^"']*)['"];)/i.match(body)
+				full_match = $1
+				j_url = $2
+				if verbose
+					puts "Javascript redirect found " + j_url
+				end
+
+				re = Regexp.escape(full_match)
+
+				body.gsub!(/#{re}/,"")
+
+				if j_url !~ /https?:\/\//i
+
+# Broken, needs real domain adding here
+# http://docs.seattlerb.org/net-http-digest_auth/Net/HTTP/DigestAuth.html
+
+					domain = "http://ninja.dev/"
+					j_url = domain + j_url
+					if verbose
+						puts "Relative URL found, adding domain to make " + j_url
+					end
+				end
+
+				x = {a_url=>j_url}
+				url_stack.push x
+			end
+
+			# strip comment tags
+			body.gsub!(/<!--/, "")
+			body.gsub!(/-->/, "")
+
+			# If you want to add more attribute names to include, just add them to this array
+			attribute_names = [
+								"alt",
+								"title",
+							]
+
+			attribute_text = ""
+
+			attribute_names.each { |attribute_name|
+				body.gsub!(/#{attribute_name}="([^"]*)"/) { |attr| attribute_text += $1 + " " }
+			}
+
+			if verbose
+				puts "Attribute text found:"
+				puts attribute_text
+				puts
+			end
+
+			body += " " + attribute_text
+
+			# strip html tags
+			words=body.gsub(/<\/?[^>]*>/, "") 
+
+			# check if this is needed
+			words.gsub!(/&[a-z]*;/, "") 
+
+			# may want 0-9 in here as well in the future but for now limit it to a-z so
+			# you can't sneak any nasty characters in
+			if /.*\.([a-z]+)(\?.*$|$)/i.match(a_url)
+				file_extension=$1
+			else
+				file_extension=""
+			end
+
+			if meta
+				begin
+					if keep and file_extension =~ /^((doc|dot|ppt|pot|xls|xlt|pps)[xm]?)|(ppam|xlsb|xlam|pdf|zip|gz|zip|bz2)$/
+						if /.*\/(.*)$/.match(a_url)
+							output_filename=meta_temp_dir+$1
+							if verbose
+								puts "Keeping " + output_filename
+							end
+						else
+							# shouldn't ever get here as the regex above should always be able to pull the filename out of the url, 
+							# but just in case
+							output_filename=meta_temp_dir+"cewl_tmp"
+							output_filename += "."+file_extension unless file_extension==""
+						end
+					else
+						output_filename=meta_temp_dir+"cewl_tmp"
+						output_filename += "."+file_extension unless file_extension==""
+					end
+					out=File.new(output_filename, "w")
+					out.print(resp.body)
+					out.close
+
+					meta_data=process_file(output_filename, verbose)
+					if(meta_data!=nil)
+						usernames+=meta_data
+					end
+				rescue => e
+					puts "Couldn't open the meta temp file for writing - " + e.inspect
+					exit
+				end
+			end
+
+			# don't get words from these file types. Most will have been blocked by the url_check function but
+			# some are let through, such as .css, so that they can be checked for email addresses
+
+			# this is a bad way to do this but it is either white or black list extensions and 
+			# the list of either is quite long, may as well black list and let extra through
+			# that can then be weeded out later than stop things that could be useful
+			begin
+				if file_extension !~ /^((doc|dot|ppt|pot|xls|xlt|pps)[xm]?)|(ppam|xlsb|xlam|pdf|zip|gz|zip|bz2|css|png|gif|jpg|#)$/
+					begin
+						if email
+							# Split the file down based on the email address regexp
+							#words.gsub!(/\b([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/i)
+							#p words
+
+							# If you want to pull email addresses from the contents of files found, such as word docs then move
+							# this block outside the if statement
+							# I've put it in here as some docs contain email addresses that have nothing to do with the target
+							# so give false positive type results
+							words.each_line do |word|
+								while /\b([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/i.match(word)
+									if verbose
+										puts "Found #{$1} on page #{a_url}"
+									end
+									email_arr<<$1
+									word=word.gsub(/#{$1}/, "")
+								end
+							end
+						end
+					rescue => e
+						puts "There was a problem generating the email list"
+						puts "Error: " + e.inspect
+						puts e.backtrace
+					end
+				
+					if wordlist
+						# remove any symbols
+						words.gsub!(/[^a-z0-9]/i," ")
+						# add to the array
+						words.split(" ").each do |word|
+							if word.length >= min_word_length
+								if !word_hash.has_key?(word)
+									word_hash[word] = 0
+								end
+								word_hash[word] += 1
+							end
+						end
+					end
+				end
+			rescue => e
+				puts "There was a problem handling word generation"
+				puts "Error: " + e.inspect
+			end
+		end
+		s.store_next_urls_with url_stack
+
+	end
+rescue Errno::ENOENT
+	puts "Invalid URL specified"
+	puts
+	exit
+rescue => e
+	puts "Couldn't access the site"
+	puts
+	puts "Error: " + e.inspect
+	puts e.backtrace
+	exit
+end
+
+#puts "end of main loop"
+
+if wordlist
+	puts "Words found\n\n" if verbose
+
+	sorted_wordlist = word_hash.sort_by do |word, count| -count end
+	sorted_wordlist.each do |word, count|
+		if show_count
+			outfile_file.puts word + ', ' + count.to_s
+		else
+			outfile_file.puts word
+		end
+	end
+end
+
+#puts "end of wordlist loop"
+
+if email
+	puts "Dumping email addresses to file" if verbose
+
+	email_arr.delete_if { |x| x.chomp==""}
+	email_arr.uniq!
+	email_arr.sort!
+
+	if (wordlist||verbose) && email_outfile.nil?
+		outfile_file.puts
+	end
+	if email_outfile.nil?
+		outfile_file.puts "Email addresses found"
+		outfile_file.puts email_arr.join("\n")
+	else
+		email_outfile_file.puts email_arr.join("\n")
+	end
+end
+
+#puts "end of email loop"
+
+if meta
+	puts "Dumping meta data to file" if verbose
+	usernames.delete_if { |x| x.chomp==""}
+	usernames.uniq!
+	usernames.sort!
+
+	if (email||wordlist) && meta_outfile.nil?
+		outfile_file.puts
+	end
+	if meta_outfile.nil?
+		outfile_file.puts "Meta data found"
+		outfile_file.puts usernames.join("\n")
+	else
+		meta_outfile_file.puts usernames.join("\n")
+	end
+end
+
+#puts "end of meta loop"
+
+if meta_outfile!=nil
+	meta_outfile_file.close
+end
+
+if email_outfile!=nil
+	email_outfile_file.close
+end
+
+if outfile!=nil
+	outfile_file.close
+end
diff --git a/cewl_lib.rb b/cewl_lib.rb
new file mode 100644
index 0000000..b049ffe
--- /dev/null
+++ b/cewl_lib.rb
@@ -0,0 +1,234 @@
+# == CeWL Library: Library to outsource reusable features
+#
+# Author:: Robin Wood (robin at digininja.org)
+# Copyright:: Copyright (c) Robin Wood 2013
+# Licence:: GPL
+#
+
+begin
+	require 'mini_exiftool'
+	require "zip"
+	require "rexml/document"
+	require 'mime'
+	require 'mime-types'
+	include REXML
+rescue LoadError => e
+	# catch error and prodive feedback on installing gem
+	if e.to_s =~ /cannot load such file -- (.*)/
+		missing_gem = $1
+		puts "\nError: #{missing_gem} gem not installed\n"
+		puts "\t use: \"gem install #{missing_gem}\" to install the required gem\n\n"
+		exit
+	else
+		puts "There was an error loading the gems:"
+		puts
+		puts e.to_s
+		exit
+	end
+end
+
+# Override the MiniExiftool class so that I can modify the parse_line
+# method and force all encoding to ISO-8859-1. Without this the app bombs
+# on some machines as it is unable to parse UTF-8
+class MyMiniExiftool<MiniExiftool
+	def parse_line line
+		line.force_encoding('ISO-8859-1')
+		super	
+	end
+end
+
+# == Synopsis
+#
+# This library contains functions to evaulate files found while running CeWL
+#
+# Author:: Robin Wood (dninja at gmail.com)
+# Copyright:: Copyright (c) Robin Wood 2010
+# Licence:: GPL
+#
+
+# Get data from a pdf file using regexps
+def get_pdf_data(pdf_file, verbose)
+	meta_data=[]
+	begin
+		interesting_fields=Array.[]("/Author")
+
+		f=File.open(pdf_file)
+		f.each_line{ |line|
+			line.force_encoding('ISO-8859-1')
+			if /pdf:Author='([^']*)'/.match(line)
+				if verbose
+					puts "Found pdf:Author: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			if /xap:Author='([^']*)'/i.match(line)
+				if verbose
+					puts "Found xap:Author: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			if /dc:creator='([^']*)'/i.match(line)
+				if verbose
+					puts "Found dc:creator: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			if /\/Author ?\(([^\)]*)\)/i.match(line)
+				if verbose
+					puts "Found Author: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			if /<xap:creator>(.*)<\/xap:creator>/i.match(line)
+				if verbose
+					puts "Found pdf:creator: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			if /<xap:Author>(.*)<\/xap:Author>/i.match(line)
+				if verbose
+					puts "Found xap:Author: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			if /<pdf:Author>(.*)<\/pdf:Author>/i.match(line)
+				if verbose
+					puts "Found pdf:Author: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			if /<dc:creator>(.*)<\/dc:creator>/i.match(line)
+				if verbose
+					puts "Found dc:creator: "+$1
+				end
+				meta_data<<$1.to_s.chomp unless $1.to_s==""
+			end
+			
+		}
+		return meta_data
+	rescue => e
+		if verbose
+			puts "There was an error processing the document - " + e.message
+		end
+	end
+	return meta_data
+end
+
+# Get data from files using exiftool
+def get_doc_data(doc_file, verbose)
+	data=[]
+	begin
+		interesting_fields=Array.[]("Author","LastSavedBy","Creator")
+		file = MyMiniExiftool.new(doc_file)
+
+		interesting_fields.each{ |field_name|
+			if file.tags.include?(field_name)
+				data<<file[field_name].to_s
+			end
+		}
+	rescue => e
+		if verbose
+			puts "There was an error processing the document - " + e.message
+		end
+	end
+	return data
+end
+
+# Get data from Office 2007 documents by unziping relivant XML files then
+# checking for known fields
+def get_docx_data(docx_file, verbose)
+	meta_data=[]
+
+	interesting_fields=Array.[]("cp:coreProperties/dc:creator","cp:coreProperties/cp:lastModifiedBy")
+	interesting_files=Array.[]("docProps/core.xml")
+
+	begin
+		Zip::ZipFile.open(docx_file) { |zipfile|
+			interesting_files.each { |file|
+				if zipfile.find_entry(file)
+					xml=zipfile.read(file)
+
+					doc=Document.new(xml)
+					interesting_fields.each { |field|
+						element=doc.elements[field]
+						#puts element.get_text unless element==nil||element.get_text==nil
+						meta_data<<element.get_text.to_s.chomp unless element==nil||element.get_text==nil
+					}
+				end
+			}
+		}
+	rescue => e
+		if verbose
+			# not a zip file
+			puts "File probably not a zip file - " + e.message
+		end
+	end
+	return meta_data
+end
+
+# Take the file given, try to work out what type of file it is then pass it
+# to the relivant function to try to grab meta data
+def process_file(filename, verbose=false)
+	meta_data=nil
+
+	begin
+
+		if File.file?(filename) && File.exist?(filename)
+			mime_types=MIME::Types.type_for(filename)
+			if(mime_types.size==0)
+				if(verbose)
+					puts "Empty mime type"
+				end
+				return meta_data
+			end
+			if verbose
+				puts "Checking "+filename
+				puts "  Mime type="+mime_types.join(", ")
+				puts
+			end
+			if mime_types.include?("application/word") || mime_types.include?("application/excel") || mime_types.include?("application/powerpoint")
+				if verbose
+					puts "  Mime type says original office document"
+				end
+				meta_data=get_doc_data(filename, verbose)
+			else
+				if mime_types.include?("application/pdf")
+					if verbose
+						puts "  Mime type says PDF"
+					end
+					# Running both my own regexp and exiftool on pdfs as I've found exif misses some data
+					meta_data=get_doc_data(filename, verbose)
+					meta_data+=get_pdf_data(filename, verbose)
+				else
+					# list taken from http://en.wikipedia.org/wiki/Microsoft_Office_2007_file_extensions
+					if filename =~ /(.(doc|dot|ppt|pot|xls|xlt|pps)[xm]$)|(.ppam$)|(.xlsb$)|(.xlam$)/
+						if verbose
+							puts "  File extension says 2007 style office document"
+						end
+						meta_data=get_docx_data(filename, verbose)
+					elsif filename =~ /.php$|.aspx$|.cfm$|.asp$|.html$|.htm$/
+						if verbose
+							puts "  Language file, can ignore"
+						end
+					else
+						if verbose
+							puts "  Unknown file type"
+						end
+					end
+				end
+			end
+			if meta_data!=nil
+				if verbose
+					if meta_data.length > 0
+						puts "  Found "+meta_data.join(", ")+"\n"
+					end
+				end
+			end
+		end
+	rescue => e
+		puts "Problem in process_file function"
+		puts "Error: " + e.message
+	end
+
+	return meta_data
+end
diff --git a/fab.rb b/fab.rb
new file mode 100644
index 0000000..37e3832
--- /dev/null
+++ b/fab.rb
@@ -0,0 +1,88 @@
+#!/usr/bin/env ruby
+
+# == FAB: Files Already Bagged
+#
+# This script can be ran against files already
+# downloaded from a target site to generate a list
+# of usernames and email addresses based on meta
+# data contained within them.
+#
+# To see a list of file types which can be processed
+# see cewl_lib.rb
+#
+# == Usage
+#
+# fab [OPTION] ... filename/list
+#
+# -h, --help:
+#    show help
+#
+# -v
+#    verbose
+#
+# filename/list: the file or list of files to check
+#
+# Author:: Robin Wood (robin at digininja.org)
+# Copyright:: Copyright (c) Robin Wood 2011
+# Licence:: GPL
+#
+
+require "rubygems"
+require 'getoptlong'
+require "./cewl_lib.rb"
+
+opts = GetoptLong.new(
+	[ '--help', '-h', GetoptLong::NO_ARGUMENT ],
+	[ "-v" , GetoptLong::NO_ARGUMENT ]
+)
+
+def usage
+	puts"xx
+
+Usage: xx [OPTION] ... filename/list
+	-h, --help: show help
+	-v: verbose
+	
+	filename/list: the file or list of files to check
+
+"
+	exit
+end
+
+verbose=false
+
+begin
+	opts.each do |opt, arg|
+		case opt
+		when '--help'
+			usage
+		when '-v'
+			verbose=true
+		end
+	end
+rescue
+	usage
+end
+
+if ARGV.length < 1
+	puts "Missing filename/list (try --help)"
+	exit 0
+end
+
+meta_data=[]
+
+ARGV.each { |param|
+	data=process_file(param, verbose)
+	if(data!=nil)
+		meta_data+=data
+	end
+}
+
+meta_data.delete_if { |x| x.chomp==""}
+meta_data.uniq!
+meta_data.sort!
+if meta_data.length==0
+	puts "No data found\n"
+else
+	puts meta_data.join("\n")
+end

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/forensics/cewl.git



More information about the forensics-changes mailing list