[Forensics-changes] [cewl] 01/02: Imported Upstream version 5.1
Joao Eriberto Mota Filho
eriberto at moszumanska.debian.org
Wed Dec 31 13:01:27 UTC 2014
This is an automated email from the git hooks/post-receive script.
eriberto pushed a commit to branch debian
in repository cewl.
commit c5f38e9f3c9a080d252cdc2ba02890220df2410f
Author: Joao Eriberto Mota Filho <eriberto at debian.org>
Date: Wed Dec 31 10:59:54 2014 -0200
Imported Upstream version 5.1
---
README | 183 +++++++++++
cewl.rb | 995 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
cewl_lib.rb | 234 ++++++++++++++
fab.rb | 88 ++++++
4 files changed, 1500 insertions(+)
diff --git a/README b/README
new file mode 100644
index 0000000..32fec46
--- /dev/null
+++ b/README
@@ -0,0 +1,183 @@
+CeWL - Custom Word List generator
+=================================
+
+Copyright(c) 2012, Robin Wood <robin at digininja.org>
+
+Based on a discussion on PaulDotCom about creating custom word lists by
+spidering a targets website and collecting unique words I decided to write
+CeWL, the Custom Word List generator. CeWL is a ruby app which spiders a
+given url to a specified depth, optionally following external links, and
+returns a list of words which can then be used for password crackers such
+as John the Ripper.
+
+By default, CeWL sticks to just the site you have specified and will go to a
+depth of 2 links, this behaviour can be changed by passing arguments. Be
+careful if setting a large depth and allowing it to go offsite, you could end
+up drifting on to a lot of other domains. All words of three characters and
+over are output to stdout. This length can be increased and the words can be
+written to a file rather than screen so the app can be automated.
+
+CeWL also has an associated command line app, FAB (Files Already Bagged)
+which uses the same meta data extraction techniques to create author/creator
+lists from already downloaded.
+
+Change Log
+==========
+
+Version 5.0
+-----------
+
+Adds proxy support from the command line and the ability to pass in
+credentials for both basic and digest authentication.
+
+A few other smaller bug fixes as well.
+
+Version 4.3
+-----------
+
+CeWL now sorts the words found by count and optionally (new --count argument)
+includes the word count in the output. I've left the words in the case
+they are in the pages so "Product" is different to "product" I figure that if
+it is being used for password generation then the case may be significant so
+let the user strip it if they want to. There are also more improvments to the
+stability of the spider in this release.
+
+By default, CeWL sticks to just the site you have specified and will go to a
+depth of 2 links, this behaviour can be changed by passing arguments. Be
+careful if setting a large depth and allowing it to go offsite, you could end
+up drifting on to a lot of other domains. All words of three characters
+and over are output to stdout. This length can be increased and the words can
+be written to a file rather than screen so the app can be automated.
+
+Version 4.2
+-----------
+
+Fixes a pretty major bug that I found while fixing a smaller bug for @yorikv.
+The bug was related to a hack I had to put in place because of a problem I was
+having with the spider, while I was looking in to it I spotted this line which
+is the one that the spider uses to find new links in downloaded pages:
+
+ web_page.scan(/href="(.*?)"/i).flatten.map do |link|
+
+This is fine if all the links look like this:
+
+ <a href="test.php">link</a>
+
+But if the link looks like either of these:
+
+ <a href='test.php'>link</a>
+ <a href=test.php>link</a>
+
+the regex will fail so the links will be ignored.
+
+To fix this up I've had to override the function that parses the page to find
+all the links, rather than use a regex I've changed it to use Nokogiri which
+is designed to parse a page looking for links rather than just running through
+it with a custom regex. This brings in a new dependency but I think it is worth
+it for the fix to the functionality. I also found another bug where a link like
+this:
+
+ <a href='#name'>local</a>
+
+which should be ignored as it just links to an internal name was actually being
+translated to '/#name' which may unintentionally mean referencing the index
+page. I've fixed this one as well after a lot of debugging to find how best to
+do it.
+
+A final addition is to allow a user to specify a depth of 0 which allows CeWL
+to spider a single page.
+
+I'm only putting this out as a point release as I'd like to rewrite the
+spidering to use a better spider, that will come out as the next major release.
+
+Version 4.0/4.1
+---------------
+
+The main change in version 4.0/1 is the upgrade to run with Ruby 1.9.x, this
+has been tested on various machines and on BT5 as that is a popular platform
+for running it and it appears to run fine. Another minor change is that Up to
+version 4 all HTML tags were stripped out before the page was parsed for words,
+this meant that text in alt and title tags were missed. I now grab the text
+from those tags before stripping the HTML to give those extra few works.
+
+Version 3
+---------
+
+Addresses a problem spotted by Josh Wright. The Spider gem doesn't handle
+JavaScript redirection URLs, for exmaple an index page containing just the
+following:
+
+ <script language="JavaScript">
+ self.location.href =
+ 'http://www.FOO.com/FOO/connect/FOONet/Top+Navigator/Home';
+ </script>
+
+wasn't spidered because the redirect wasn't picked up. I now scan through a
+page looking for any lines containing location.href= and then add the given
+URL to the list of pages to spider.
+
+Version 2
+---------
+
+Version 2 of CeWL can also create two new lists, a list of email addresses
+found in mailto links and a list of author/creator names collected from meta
+data found in documents on the site. It can currently process documents in
+Office pre 2007, Office 2007 and PDF formats. This user data can then be used
+to create the list of usernames to be used in association with the password
+list.
+
+Pronunciation
+=============
+Seeing as I was asked, CeWL is pronounced "cool".
+
+Installation
+============
+CeWL needs the rubygems package to be installed along with the following gems:
+
+* mime-types
+* mini_exiftool
+* rubyzip
+* spider
+
+All these gems were available by running "gem install xxx" as root. The
+mini_exiftool gem also requires the exiftool application to be installed.
+
+Then just save CeWL to a directory and make it executable.
+
+The project page on my site gives some tips on solving common problems people
+have encountered while running CeWL - http://www.digininja.org/projects/cewl.php
+
+Usage
+=====
+Usage: cewl [OPTION] ... URL
+ --help, -h: show help
+ --depth x, -d x: depth to spider to, default 2
+ --min_word_length, -m: minimum word length, default 3
+ --offsite, -o: let the spider visit other sites
+ --write, -w file: write the output to the file
+ --ua, -u user-agent: useragent to send
+ --no-words, -n: don't output the wordlist
+ --meta, -a include meta data
+ --meta_file file: file for metadata output
+ --email, -e include email addresses
+ --email_file file: file for email output
+ --meta-temp-dir directory: the temporary directory used by exiftool when parsing files, default /tmp
+ -v: verbose
+
+ URL: The site to spider.
+
+Ruby Doc
+========
+CeWL is commented up in Ruby Doc format.
+
+Licence
+=======
+This project released under the Creative Commons Attribution-Share Alike 2.0
+UK: England & Wales
+
+( http://creativecommons.org/licenses/by-sa/2.0/uk/ )
+
+
+Alternativelly, you can use GPL-3+ instead the of the original license.
+
+( http://opensource.org/licenses/GPL-3.0 )
diff --git a/cewl.rb b/cewl.rb
new file mode 100755
index 0000000..854281e
--- /dev/null
+++ b/cewl.rb
@@ -0,0 +1,995 @@
+#!/usr/bin/env ruby
+
+# == CeWL: Custom Word List Generator
+#
+# CeWL will spider a target site and generate up to three lists:
+#
+# * A word list of all unique words found on the target site
+# * A list of all email addresses found in mailto links
+# * A list of usernames/author details from meta data found in any documents on the site
+#
+# == Usage
+#
+# cewl [OPTION] ... URL
+#
+# -h, --help:
+# show help
+#
+# --depth x, -d x:
+# depth to spider to, default 2
+#
+# --min_word_length, -m:
+# minimum word length, default 3
+#
+# --email file, -e
+# --email_file file:
+# include any email addresses found duing the spider, email_file is optional output file, if
+# not included the output is added to default output
+#
+# --meta file, -a
+# --meta_file file:
+# include any meta data found during the spider, meta_file is optional output file, if
+# not included the output is added to default output
+#
+# --no-words, -n
+# don't output the wordlist
+#
+# --offsite, -o:
+# let the spider visit other sites
+#
+# --write, -w file:
+# write the words to the file
+#
+# --ua, -u user-agent:
+# useragent to send
+#
+# --meta-temp-dir directory:
+# the temporary directory used by exiftool when parsing files, default /tmp
+#
+# --keep, -k:
+# keep the documents that are downloaded
+#
+# --count, -c:
+# show the count for each of the words found
+#
+# -v
+# verbose
+#
+# URL: The site to spider.
+#
+# Author:: Robin Wood (robin at digi.ninja)
+# Copyright:: Copyright (c) Robin Wood 2014
+# Licence:: CC-BY-SA 2.0 or GPL-3+
+#
+
+VERSION = "5.1"
+
+puts"CeWL #{VERSION} Robin Wood (robin at digi.ninja) (http://digi.ninja)"
+puts
+
+begin
+ require 'getoptlong'
+ require 'spider'
+ require 'nokogiri'
+ require 'net/http'
+rescue LoadError => e
+ # catch error and prodive feedback on installing gem
+ if e.to_s =~ /cannot load such file -- (.*)/
+ missing_gem = $1
+ puts "\nError: #{missing_gem} gem not installed\n"
+ puts "\t use: \"gem install #{missing_gem}\" to install the required gem\n\n"
+ exit
+ else
+ puts "There was an error loading the gems:"
+ puts
+ puts e.to_s
+ exit
+ end
+end
+
+require './cewl_lib'
+
+# Doing this so I can override the allowed? fuction which normally checks
+# the robots.txt file
+class MySpider<Spider
+ @@proxy_host = nil
+ @@proxy_port = nil
+ @@proxy_username = nil
+ @@proxy_password = nil
+
+ @@auth_type = nil
+ @@auth_user = nil
+ @@auth_password = nil
+ @@verbose = false
+
+ def self.proxy (host, port = nil, username = nil, password = nil)
+ @@proxy_host = host
+ port = 8080 if port.nil?
+ @@proxy_port = port
+ @@proxy_username = username
+ @@proxy_password = password
+ end
+
+ def self.auth_creds (type, user, password)
+ @@auth_type = type
+ @@auth_user = user
+ @@auth_password = password
+ end
+
+ def self.verbose (val)
+ @@verbose = val
+ end
+
+ # Create an instance of MySpiderInstance rather than SpiderInstance
+ def self.start_at(a_url, &block)
+ rules = RobotRules.new('Ruby Spider 1.0')
+ a_spider = MySpiderInstance.new({nil => a_url}, [], rules, [])
+ a_spider.auth_type = @@auth_type
+ a_spider.auth_user = @@auth_user
+ a_spider.auth_password = @@auth_password
+
+ a_spider.proxy_host = @@proxy_host
+ a_spider.proxy_port = @@proxy_port
+ a_spider.proxy_username = @@proxy_username
+ a_spider.proxy_password = @@proxy_password
+
+ a_spider.verbose = @@verbose
+ block.call(a_spider)
+ a_spider.start!
+ end
+end
+
+# My version of the spider class which allows all files
+# to be processed
+class MySpiderInstance<SpiderInstance
+ attr_writer :auth_type
+ attr_writer :auth_user
+ attr_writer :auth_password
+
+ attr_writer :proxy_host
+ attr_writer :proxy_port
+ attr_writer :proxy_username
+ attr_writer :proxy_password
+
+ attr_writer :verbose
+
+ # Force all files to be allowed
+ # Normally the robots.txt file will be honoured
+ def allowed?(a_url, parsed_url)
+ true
+ end
+ def start! #:nodoc:
+ interrupted = false
+ trap("SIGINT") { interrupted = true }
+ begin
+ next_urls = @next_urls.pop
+ tmp_n_u = {}
+ next_urls.each do |prior_url, urls|
+ x = []
+ urls.each_line do |a_url|
+ x << [a_url, (URI.parse(a_url) rescue nil)]
+ end
+ y = []
+ x.select do |a_url, parsed_url|
+ y << [a_url, parsed_url] if allowable_url?(a_url, parsed_url)
+ end
+ y.each do |a_url, parsed_url|
+ @setup.call(a_url) unless @setup.nil?
+ get_page(parsed_url) do |response|
+ do_callbacks(a_url, response, prior_url)
+ #tmp_n_u[a_url] = generate_next_urls(a_url, response)
+ #@next_urls.push tmp_n_u
+ generate_next_urls(a_url, response).each do |a_next_url|
+ #puts 'pushing ' + a_next_url
+ @next_urls.push a_url => a_next_url
+ end
+ #exit if interrupted
+ end
+ @teardown.call(a_url) unless @teardown.nil?
+ exit if interrupted
+ end
+ end
+ end while !@next_urls.empty?
+ end
+
+ def get_page(uri, &block) #:nodoc:
+ @seen << uri
+
+ begin
+ if @proxy_host.nil?
+ http = Net::HTTP.new(uri.host, uri.port)
+
+ if uri.scheme == 'https'
+ http.use_ssl = true
+ http.verify_mode = OpenSSL::SSL::VERIFY_NONE
+ end
+ else
+ proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, @proxy_username, @proxy_password)
+ begin
+ if uri.scheme == 'https'
+ http = proxy.start(uri.host, uri.port, :use_ssl => true, :verify_mode => OpenSSL::SSL::VERIFY_NONE)
+ else
+ http = proxy.start(uri.host, uri.port)
+ end
+ rescue => e
+ puts "Failed to connect to the proxy"
+ exit
+ end
+ end
+
+ req = Net::HTTP::Get.new(uri.request_uri, @headers)
+
+ if !@auth_type.nil?
+ case @auth_type
+ when "digest"
+ uri.user = @auth_user
+ uri.password = @auth_password
+
+ res = http.request req
+
+ if not res['www-authenticate'].nil?
+ digest_auth = Net::HTTP::DigestAuth.new
+ auth = digest_auth.auth_header uri, res['www-authenticate'], 'GET'
+
+ req = Net::HTTP::Get.new uri.request_uri
+ req.add_field 'Authorization', auth
+ end
+
+ when "basic"
+ req.basic_auth @auth_user, @auth_password
+ end
+ end
+ res = http.request(req)
+
+ if res.redirect?
+ #puts "redirect url"
+ base_url = uri.to_s[0, uri.to_s.rindex('/')]
+ new_url = URI.parse(construct_complete_url(base_url,res['Location']))
+
+ # If auth is used then a name:pass@ gets added, this messes the tree
+ # up so easiest to just remove it
+ current_uri = uri.to_s.gsub(/:\/\/[^:]*:[^@]*@/, "://")
+ @next_urls.push current_uri => new_url.to_s
+ elsif res.code == "401"
+ puts "Authentication required, can't continue on this branch - #{uri}" if @verbose
+ else
+ block.call(res)
+ end
+ rescue => e
+ puts "Unable to connect to the site, run in verbose mode for more information"
+ if @verbose
+ puts
+ puts"The following error may help:"
+ puts e.to_s
+ end
+ exit
+ end
+ end
+ # overriding so that I can get it to ingore direct names - i.e. #name
+ def construct_complete_url(base_url, additional_url, parsed_additional_url = nil) #:nodoc:
+ if additional_url =~ /^#/
+ return nil
+ end
+ parsed_additional_url ||= URI.parse(additional_url)
+ case parsed_additional_url.scheme
+ when nil
+ u = base_url.is_a?(URI) ? base_url : URI.parse(base_url)
+ if additional_url[0].chr == '/'
+ "#{u.scheme}://#{u.host}#{additional_url}"
+ elsif u.path.nil? || u.path == ''
+ "#{u.scheme}://#{u.host}/#{additional_url}"
+ elsif u.path[0].chr == '/'
+ "#{u.scheme}://#{u.host}#{u.path}/#{additional_url}"
+ else
+ "#{u.scheme}://#{u.host}/#{u.path}/#{additional_url}"
+ end
+ else
+ additional_url
+ end
+ end
+
+ # Overriding the original spider one as it doesn't find hrefs very well
+ def generate_next_urls(a_url, resp) #:nodoc:
+ web_page = resp.body
+ if URI.parse(a_url).path == ""
+ base_url = a_url
+ else
+ base_url = a_url[0, a_url.rindex('/')]
+ end
+
+ doc = Nokogiri::HTML(web_page)
+ links = doc.css('a').map{ |a| a['href'] }
+ links.map do |link|
+ begin
+ if link.nil?
+ nil
+ else
+ begin
+ parsed_link = URI.parse(link)
+ if parsed_link.fragment == '#'
+ nil
+ else
+ construct_complete_url(base_url, link, parsed_link)
+ end
+ rescue
+ nil
+ end
+ end
+ rescue => e
+ puts "There was an error generating URL list"
+ puts "Error: " + e.inspect
+ puts e.backtrace
+ exit
+ end
+ end.compact
+ end
+end
+
+# A node for a tree
+class TreeNode
+ attr :value
+ attr :depth
+ attr :key
+ attr :visited, true
+ def initialize(key, value, depth)
+ @key=key
+ @value=value
+ @depth=depth
+ @visited=false
+ end
+
+ def to_s
+ if key==nil
+ return "key=nil value="+ at value+" depth="+ at depth.to_s+" visited="+ at visited.to_s
+ else
+ return "key="+ at key+" value="+ at value+" depth="+ at depth.to_s+" visited="+ at visited.to_s
+ end
+ end
+ def to_url_hash
+ return({@key=>@value})
+ end
+end
+
+# A tree structure
+class Tree
+ attr :data
+ @max_depth
+ @children
+
+ # Get the maximum depth the tree can grow to
+ def max_depth
+ @max_depth
+ end
+
+ # Set the max depth the tree can grow to
+ def max_depth=(val)
+ @max_depth=Integer(val)
+ end
+
+ # As this is used to work out if there are any more nodes to process it isn't a true empty
+ def empty?
+ if !@data.visited
+ return false
+ else
+ @children.each { |node|
+ if !node.data.visited
+ return false
+ end
+ }
+ end
+ return true
+ end
+
+ # The constructor
+ def initialize(key=nil, value=nil, depth=0)
+ @data=TreeNode.new(key,value,depth)
+ @children = []
+ @max_depth = 2
+ end
+
+ # Itterator
+ def each
+ yield @data
+ @children.each do |child_node|
+ child_node.each { |e| yield e }
+ end
+ end
+
+ # Remove an item from the tree
+ def pop
+ if !@data.visited
+ @data.visited=true
+ return @data.to_url_hash
+ else
+ @children.each { |node|
+ if !node.data.visited
+ node.data.visited=true
+ return node.data.to_url_hash
+ end
+ }
+ end
+ return nil
+ end
+
+ # Push an item onto the tree
+ def push(value)
+ key=value.keys.first
+ value=value.values_at(key).first
+
+ if key==nil
+ @data=TreeNode.new(key,value,0)
+ else
+ # if the depth is 0 then don't add anything to the tree
+ if @max_depth == 0
+ return
+ end
+ if key==@data.value
+ child=Tree.new(key,value, @data.depth+1)
+ @children << child
+ else
+ @children.each { |node|
+ if node.data.value==key && node.data.depth<@max_depth
+ child=Tree.new(key,value, node.data.depth+1)
+ @children << child
+ end
+ }
+ end
+ end
+ end
+end
+
+opts = GetoptLong.new(
+ [ '--help', '-h', GetoptLong::NO_ARGUMENT ],
+ [ '--keep', '-k', GetoptLong::NO_ARGUMENT ],
+ [ '--depth', '-d', GetoptLong::OPTIONAL_ARGUMENT ],
+ [ '--min_word_length', "-m" , GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--no-words', "-n" , GetoptLong::NO_ARGUMENT ],
+ [ '--offsite', "-o" , GetoptLong::NO_ARGUMENT ],
+ [ '--write', "-w" , GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--ua', "-u" , GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--meta-temp-dir', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--meta_file', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--email_file', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--meta', "-a" , GetoptLong::NO_ARGUMENT ],
+ [ '--email', "-e" , GetoptLong::NO_ARGUMENT ],
+ [ '--count', '-c', GetoptLong::NO_ARGUMENT ],
+ [ '--auth_user', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--auth_pass', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--auth_type', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--proxy_host', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--proxy_port', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--proxy_username', GetoptLong::REQUIRED_ARGUMENT ],
+ [ '--proxy_password', GetoptLong::REQUIRED_ARGUMENT ],
+ [ "--verbose", "-v" , GetoptLong::NO_ARGUMENT ]
+)
+
+# Display the usage
+def usage
+ puts "Usage: cewl [OPTION] ... URL
+ --help, -h: show help
+ --keep, -k: keep the downloaded file
+ --depth x, -d x: depth to spider to, default 2
+ --min_word_length, -m: minimum word length, default 3
+ --offsite, -o: let the spider visit other sites
+ --write, -w file: write the output to the file
+ --ua, -u user-agent: useragent to send
+ --no-words, -n: don't output the wordlist
+ --meta, -a include meta data
+ --meta_file file: output file for meta data
+ --email, -e include email addresses
+ --email_file file: output file for email addresses
+ --meta-temp-dir directory: the temporary directory used by exiftool when parsing files, default /tmp
+ --count, -c: show the count for each word found
+
+ Authentication
+ --auth_type: digest or basic
+ --auth_user: authentication username
+ --auth_pass: authentication password
+
+ Proxy Support
+ --proxy_host: proxy host
+ --proxy_port: proxy port, default 8080
+ --proxy_username: username for proxy, if required
+ --proxy_password: password for proxy, if required
+
+ --verbose, -v: verbose
+
+ URL: The site to spider.
+
+"
+ exit
+end
+
+verbose=false
+ua=nil
+url = nil
+outfile = nil
+email_outfile = nil
+meta_outfile = nil
+offsite = false
+depth = 2
+min_word_length=3
+email=false
+meta=false
+wordlist=true
+meta_temp_dir="/tmp/"
+keep=false
+show_count = false
+auth_type = nil
+auth_user = nil
+auth_pass = nil
+
+proxy_host = nil
+proxy_port = nil
+proxy_username = nil
+proxy_password = nil
+
+begin
+ opts.each do |opt, arg|
+ case opt
+ when '--help'
+ usage
+ when "--count"
+ show_count = true
+ when "--meta-temp-dir"
+ if !File.directory?(arg)
+ puts "Meta temp directory is not a directory\n"
+ exit
+ end
+ if !File.writable?(arg)
+ puts "The meta temp directory is not writable\n"
+ exit
+ end
+ meta_temp_dir=arg
+ if meta_temp_dir !~ /.*\/$/
+ meta_temp_dir+="/"
+ end
+ when "--keep"
+ keep=true
+ when "--no-words"
+ wordlist=false
+ when "--meta_file"
+ meta_outfile = arg
+ when "--meta"
+ meta=true
+ when "--email_file"
+ email_outfile = arg
+ when "--email"
+ email=true
+ when '--min_word_length'
+ min_word_length=arg.to_i
+ if min_word_length<1
+ usage
+ end
+ when '--depth'
+ depth=arg.to_i
+ if depth < 0
+ usage
+ end
+ when '--offsite'
+ offsite=true
+ when '--ua'
+ ua=arg
+ when '--verbose'
+ verbose=true
+ when '--write'
+ outfile=arg
+ when "--proxy_password"
+ proxy_password = arg
+ when "--proxy_username"
+ proxy_username = arg
+ when "--proxy_host"
+ proxy_host = arg
+ when "--proxy_port"
+ proxy_port = arg.to_i
+ when "--auth_pass"
+ auth_pass = arg
+ when "--auth_user"
+ auth_user = arg
+ when "--auth_type"
+ if arg =~ /(digest|basic)/i
+ auth_type=$1.downcase
+ if auth_type == "digest"
+ begin
+ require "net/http/digest_auth"
+ rescue LoadError => e
+ # catch error and prodive feedback on installing gem
+ puts "\nError: To use digest auth you require the net-http-digest_auth gem, to install it use:\n\n"
+ puts "\t\"gem install net-http-digest_auth\"\n\n"
+ exit
+ end
+ end
+ else
+ puts "Invalid authentication type, please specify either basic or digest"
+ exit
+ end
+ end
+ end
+rescue
+ usage
+end
+
+if !auth_type.nil? and (auth_user.nil? or auth_pass.nil?)
+ puts "If using basic or digest auth you must provide a username and password\n\n"
+ exit
+end
+
+if auth_type.nil? and (!auth_user.nil? or !auth_pass.nil?)
+ puts "Authentication details provided but no mention of basic or digest"
+ exit
+end
+
+if ARGV.length != 1
+ puts "Missing url argument (try --help)"
+ exit 0
+end
+
+url = ARGV.shift
+
+# Must have protocol
+if url !~ /^http(s)?:\/\//
+ url="http://"+url
+end
+
+# The spider doesn't work properly if there isn't a / on the end
+if url !~ /\/$/
+# Commented out for Yori
+# url=url+"/"
+end
+
+word_hash = {}
+email_arr=[]
+url_stack=Tree.new
+url_stack.max_depth=depth
+usernames=Array.new()
+
+# Do the checks here so we don't do all the processing then find we can't open the file
+if !outfile.nil?
+ begin
+ outfile_file=File.new(outfile,"w")
+ rescue
+ puts "Couldn't open the output file for writing"
+ exit
+ end
+else
+ outfile_file=$stdout
+end
+
+if !email_outfile.nil? and email
+ begin
+ email_outfile_file=File.new(email_outfile,"w")
+ rescue
+ puts "Couldn't open the email output file for writing"
+ exit
+ end
+else
+ email_outfile_file = outfile_file
+end
+
+if !meta_outfile.nil? and email
+ begin
+ meta_outfile_file=File.new(meta_outfile,"w")
+ rescue
+ puts "Couldn't open the metadata output file for writing"
+ exit
+ end
+else
+ meta_outfile_file = outfile_file
+end
+
+begin
+ if verbose
+ puts "Starting at " + url
+ end
+
+ if !proxy_host.nil?
+ MySpider.proxy(proxy_host, proxy_port, proxy_username, proxy_password)
+ end
+
+ if !auth_type.nil?
+ MySpider.auth_creds(auth_type, auth_user, auth_pass)
+ end
+ MySpider.verbose(verbose)
+
+ MySpider.start_at(url) do |s|
+ if ua!=nil
+ s.headers['User-Agent'] = ua
+ end
+
+ s.add_url_check do |a_url|
+ #puts "checking page " + a_url
+ allow=true
+ # Extensions to ignore
+ if a_url =~ /(\.zip$|\.gz$|\.zip$|\.bz2$|\.png$|\.gif$|\.jpg$|^#)/
+ if verbose
+ puts "Ignoring internal link or graphic: "+a_url
+ end
+ allow=false
+ else
+ if /^mailto:(.*)/i.match(a_url)
+ if email
+ email_arr<<$1
+ if verbose
+ puts "Found #{$1} on page #{a_url}"
+ end
+ end
+ allow=false
+ else
+ if !offsite
+ a_url_parsed = URI.parse(a_url)
+ url_parsed = URI.parse(url)
+# puts 'comparing ' + a_url + ' with ' + url
+
+ allow = (a_url_parsed.host == url_parsed.host)
+
+ if !allow && verbose
+ puts "Offsite link, not following: "+a_url
+ end
+ end
+ end
+ end
+ allow
+ end
+
+ s.on :success do |a_url, resp, prior_url|
+
+ if verbose
+ if prior_url.nil?
+ puts "Visiting: #{a_url}, got response code #{resp.code}"
+ else
+ puts "Visiting: #{a_url} referred from #{prior_url}, got response code #{resp.code}"
+ end
+ end
+ body=resp.body.to_s
+
+ # get meta data
+ if /.*<meta.*description.*content\s*=[\s'"]*(.*)/i.match(body)
+ description=$1
+ body += description.gsub(/[>"\/']*/, "")
+ end
+
+ if /.*<meta.*keywords.*content\s*=[\s'"]*(.*)/i.match(body)
+ keywords=$1
+ body += keywords.gsub(/[>"\/']*/, "")
+ end
+
+# puts body
+# while /mailto:([^'">]*)/i.match(body)
+# email_arr<<$1
+# if verbose
+# puts "Found #{$1} on page #{a_url}"
+# end
+# end
+
+ while /(location.href\s*=\s*["']([^"']*)['"];)/i.match(body)
+ full_match = $1
+ j_url = $2
+ if verbose
+ puts "Javascript redirect found " + j_url
+ end
+
+ re = Regexp.escape(full_match)
+
+ body.gsub!(/#{re}/,"")
+
+ if j_url !~ /https?:\/\//i
+
+# Broken, needs real domain adding here
+# http://docs.seattlerb.org/net-http-digest_auth/Net/HTTP/DigestAuth.html
+
+ domain = "http://ninja.dev/"
+ j_url = domain + j_url
+ if verbose
+ puts "Relative URL found, adding domain to make " + j_url
+ end
+ end
+
+ x = {a_url=>j_url}
+ url_stack.push x
+ end
+
+ # strip comment tags
+ body.gsub!(/<!--/, "")
+ body.gsub!(/-->/, "")
+
+ # If you want to add more attribute names to include, just add them to this array
+ attribute_names = [
+ "alt",
+ "title",
+ ]
+
+ attribute_text = ""
+
+ attribute_names.each { |attribute_name|
+ body.gsub!(/#{attribute_name}="([^"]*)"/) { |attr| attribute_text += $1 + " " }
+ }
+
+ if verbose
+ puts "Attribute text found:"
+ puts attribute_text
+ puts
+ end
+
+ body += " " + attribute_text
+
+ # strip html tags
+ words=body.gsub(/<\/?[^>]*>/, "")
+
+ # check if this is needed
+ words.gsub!(/&[a-z]*;/, "")
+
+ # may want 0-9 in here as well in the future but for now limit it to a-z so
+ # you can't sneak any nasty characters in
+ if /.*\.([a-z]+)(\?.*$|$)/i.match(a_url)
+ file_extension=$1
+ else
+ file_extension=""
+ end
+
+ if meta
+ begin
+ if keep and file_extension =~ /^((doc|dot|ppt|pot|xls|xlt|pps)[xm]?)|(ppam|xlsb|xlam|pdf|zip|gz|zip|bz2)$/
+ if /.*\/(.*)$/.match(a_url)
+ output_filename=meta_temp_dir+$1
+ if verbose
+ puts "Keeping " + output_filename
+ end
+ else
+ # shouldn't ever get here as the regex above should always be able to pull the filename out of the url,
+ # but just in case
+ output_filename=meta_temp_dir+"cewl_tmp"
+ output_filename += "."+file_extension unless file_extension==""
+ end
+ else
+ output_filename=meta_temp_dir+"cewl_tmp"
+ output_filename += "."+file_extension unless file_extension==""
+ end
+ out=File.new(output_filename, "w")
+ out.print(resp.body)
+ out.close
+
+ meta_data=process_file(output_filename, verbose)
+ if(meta_data!=nil)
+ usernames+=meta_data
+ end
+ rescue => e
+ puts "Couldn't open the meta temp file for writing - " + e.inspect
+ exit
+ end
+ end
+
+ # don't get words from these file types. Most will have been blocked by the url_check function but
+ # some are let through, such as .css, so that they can be checked for email addresses
+
+ # this is a bad way to do this but it is either white or black list extensions and
+ # the list of either is quite long, may as well black list and let extra through
+ # that can then be weeded out later than stop things that could be useful
+ begin
+ if file_extension !~ /^((doc|dot|ppt|pot|xls|xlt|pps)[xm]?)|(ppam|xlsb|xlam|pdf|zip|gz|zip|bz2|css|png|gif|jpg|#)$/
+ begin
+ if email
+ # Split the file down based on the email address regexp
+ #words.gsub!(/\b([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/i)
+ #p words
+
+ # If you want to pull email addresses from the contents of files found, such as word docs then move
+ # this block outside the if statement
+ # I've put it in here as some docs contain email addresses that have nothing to do with the target
+ # so give false positive type results
+ words.each_line do |word|
+ while /\b([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/i.match(word)
+ if verbose
+ puts "Found #{$1} on page #{a_url}"
+ end
+ email_arr<<$1
+ word=word.gsub(/#{$1}/, "")
+ end
+ end
+ end
+ rescue => e
+ puts "There was a problem generating the email list"
+ puts "Error: " + e.inspect
+ puts e.backtrace
+ end
+
+ if wordlist
+ # remove any symbols
+ words.gsub!(/[^a-z0-9]/i," ")
+ # add to the array
+ words.split(" ").each do |word|
+ if word.length >= min_word_length
+ if !word_hash.has_key?(word)
+ word_hash[word] = 0
+ end
+ word_hash[word] += 1
+ end
+ end
+ end
+ end
+ rescue => e
+ puts "There was a problem handling word generation"
+ puts "Error: " + e.inspect
+ end
+ end
+ s.store_next_urls_with url_stack
+
+ end
+rescue Errno::ENOENT
+ puts "Invalid URL specified"
+ puts
+ exit
+rescue => e
+ puts "Couldn't access the site"
+ puts
+ puts "Error: " + e.inspect
+ puts e.backtrace
+ exit
+end
+
+#puts "end of main loop"
+
+if wordlist
+ puts "Words found\n\n" if verbose
+
+ sorted_wordlist = word_hash.sort_by do |word, count| -count end
+ sorted_wordlist.each do |word, count|
+ if show_count
+ outfile_file.puts word + ', ' + count.to_s
+ else
+ outfile_file.puts word
+ end
+ end
+end
+
+#puts "end of wordlist loop"
+
+if email
+ puts "Dumping email addresses to file" if verbose
+
+ email_arr.delete_if { |x| x.chomp==""}
+ email_arr.uniq!
+ email_arr.sort!
+
+ if (wordlist||verbose) && email_outfile.nil?
+ outfile_file.puts
+ end
+ if email_outfile.nil?
+ outfile_file.puts "Email addresses found"
+ outfile_file.puts email_arr.join("\n")
+ else
+ email_outfile_file.puts email_arr.join("\n")
+ end
+end
+
+#puts "end of email loop"
+
+if meta
+ puts "Dumping meta data to file" if verbose
+ usernames.delete_if { |x| x.chomp==""}
+ usernames.uniq!
+ usernames.sort!
+
+ if (email||wordlist) && meta_outfile.nil?
+ outfile_file.puts
+ end
+ if meta_outfile.nil?
+ outfile_file.puts "Meta data found"
+ outfile_file.puts usernames.join("\n")
+ else
+ meta_outfile_file.puts usernames.join("\n")
+ end
+end
+
+#puts "end of meta loop"
+
+if meta_outfile!=nil
+ meta_outfile_file.close
+end
+
+if email_outfile!=nil
+ email_outfile_file.close
+end
+
+if outfile!=nil
+ outfile_file.close
+end
diff --git a/cewl_lib.rb b/cewl_lib.rb
new file mode 100644
index 0000000..b049ffe
--- /dev/null
+++ b/cewl_lib.rb
@@ -0,0 +1,234 @@
+# == CeWL Library: Library to outsource reusable features
+#
+# Author:: Robin Wood (robin at digininja.org)
+# Copyright:: Copyright (c) Robin Wood 2013
+# Licence:: GPL
+#
+
+begin
+ require 'mini_exiftool'
+ require "zip"
+ require "rexml/document"
+ require 'mime'
+ require 'mime-types'
+ include REXML
+rescue LoadError => e
+ # catch error and prodive feedback on installing gem
+ if e.to_s =~ /cannot load such file -- (.*)/
+ missing_gem = $1
+ puts "\nError: #{missing_gem} gem not installed\n"
+ puts "\t use: \"gem install #{missing_gem}\" to install the required gem\n\n"
+ exit
+ else
+ puts "There was an error loading the gems:"
+ puts
+ puts e.to_s
+ exit
+ end
+end
+
+# Override the MiniExiftool class so that I can modify the parse_line
+# method and force all encoding to ISO-8859-1. Without this the app bombs
+# on some machines as it is unable to parse UTF-8
+class MyMiniExiftool<MiniExiftool
+ def parse_line line
+ line.force_encoding('ISO-8859-1')
+ super
+ end
+end
+
+# == Synopsis
+#
+# This library contains functions to evaulate files found while running CeWL
+#
+# Author:: Robin Wood (dninja at gmail.com)
+# Copyright:: Copyright (c) Robin Wood 2010
+# Licence:: GPL
+#
+
+# Get data from a pdf file using regexps
+def get_pdf_data(pdf_file, verbose)
+ meta_data=[]
+ begin
+ interesting_fields=Array.[]("/Author")
+
+ f=File.open(pdf_file)
+ f.each_line{ |line|
+ line.force_encoding('ISO-8859-1')
+ if /pdf:Author='([^']*)'/.match(line)
+ if verbose
+ puts "Found pdf:Author: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+ if /xap:Author='([^']*)'/i.match(line)
+ if verbose
+ puts "Found xap:Author: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+ if /dc:creator='([^']*)'/i.match(line)
+ if verbose
+ puts "Found dc:creator: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+ if /\/Author ?\(([^\)]*)\)/i.match(line)
+ if verbose
+ puts "Found Author: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+ if /<xap:creator>(.*)<\/xap:creator>/i.match(line)
+ if verbose
+ puts "Found pdf:creator: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+ if /<xap:Author>(.*)<\/xap:Author>/i.match(line)
+ if verbose
+ puts "Found xap:Author: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+ if /<pdf:Author>(.*)<\/pdf:Author>/i.match(line)
+ if verbose
+ puts "Found pdf:Author: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+ if /<dc:creator>(.*)<\/dc:creator>/i.match(line)
+ if verbose
+ puts "Found dc:creator: "+$1
+ end
+ meta_data<<$1.to_s.chomp unless $1.to_s==""
+ end
+
+ }
+ return meta_data
+ rescue => e
+ if verbose
+ puts "There was an error processing the document - " + e.message
+ end
+ end
+ return meta_data
+end
+
+# Get data from files using exiftool
+def get_doc_data(doc_file, verbose)
+ data=[]
+ begin
+ interesting_fields=Array.[]("Author","LastSavedBy","Creator")
+ file = MyMiniExiftool.new(doc_file)
+
+ interesting_fields.each{ |field_name|
+ if file.tags.include?(field_name)
+ data<<file[field_name].to_s
+ end
+ }
+ rescue => e
+ if verbose
+ puts "There was an error processing the document - " + e.message
+ end
+ end
+ return data
+end
+
+# Get data from Office 2007 documents by unziping relivant XML files then
+# checking for known fields
+def get_docx_data(docx_file, verbose)
+ meta_data=[]
+
+ interesting_fields=Array.[]("cp:coreProperties/dc:creator","cp:coreProperties/cp:lastModifiedBy")
+ interesting_files=Array.[]("docProps/core.xml")
+
+ begin
+ Zip::ZipFile.open(docx_file) { |zipfile|
+ interesting_files.each { |file|
+ if zipfile.find_entry(file)
+ xml=zipfile.read(file)
+
+ doc=Document.new(xml)
+ interesting_fields.each { |field|
+ element=doc.elements[field]
+ #puts element.get_text unless element==nil||element.get_text==nil
+ meta_data<<element.get_text.to_s.chomp unless element==nil||element.get_text==nil
+ }
+ end
+ }
+ }
+ rescue => e
+ if verbose
+ # not a zip file
+ puts "File probably not a zip file - " + e.message
+ end
+ end
+ return meta_data
+end
+
+# Take the file given, try to work out what type of file it is then pass it
+# to the relivant function to try to grab meta data
+def process_file(filename, verbose=false)
+ meta_data=nil
+
+ begin
+
+ if File.file?(filename) && File.exist?(filename)
+ mime_types=MIME::Types.type_for(filename)
+ if(mime_types.size==0)
+ if(verbose)
+ puts "Empty mime type"
+ end
+ return meta_data
+ end
+ if verbose
+ puts "Checking "+filename
+ puts " Mime type="+mime_types.join(", ")
+ puts
+ end
+ if mime_types.include?("application/word") || mime_types.include?("application/excel") || mime_types.include?("application/powerpoint")
+ if verbose
+ puts " Mime type says original office document"
+ end
+ meta_data=get_doc_data(filename, verbose)
+ else
+ if mime_types.include?("application/pdf")
+ if verbose
+ puts " Mime type says PDF"
+ end
+ # Running both my own regexp and exiftool on pdfs as I've found exif misses some data
+ meta_data=get_doc_data(filename, verbose)
+ meta_data+=get_pdf_data(filename, verbose)
+ else
+ # list taken from http://en.wikipedia.org/wiki/Microsoft_Office_2007_file_extensions
+ if filename =~ /(.(doc|dot|ppt|pot|xls|xlt|pps)[xm]$)|(.ppam$)|(.xlsb$)|(.xlam$)/
+ if verbose
+ puts " File extension says 2007 style office document"
+ end
+ meta_data=get_docx_data(filename, verbose)
+ elsif filename =~ /.php$|.aspx$|.cfm$|.asp$|.html$|.htm$/
+ if verbose
+ puts " Language file, can ignore"
+ end
+ else
+ if verbose
+ puts " Unknown file type"
+ end
+ end
+ end
+ end
+ if meta_data!=nil
+ if verbose
+ if meta_data.length > 0
+ puts " Found "+meta_data.join(", ")+"\n"
+ end
+ end
+ end
+ end
+ rescue => e
+ puts "Problem in process_file function"
+ puts "Error: " + e.message
+ end
+
+ return meta_data
+end
diff --git a/fab.rb b/fab.rb
new file mode 100644
index 0000000..37e3832
--- /dev/null
+++ b/fab.rb
@@ -0,0 +1,88 @@
+#!/usr/bin/env ruby
+
+# == FAB: Files Already Bagged
+#
+# This script can be ran against files already
+# downloaded from a target site to generate a list
+# of usernames and email addresses based on meta
+# data contained within them.
+#
+# To see a list of file types which can be processed
+# see cewl_lib.rb
+#
+# == Usage
+#
+# fab [OPTION] ... filename/list
+#
+# -h, --help:
+# show help
+#
+# -v
+# verbose
+#
+# filename/list: the file or list of files to check
+#
+# Author:: Robin Wood (robin at digininja.org)
+# Copyright:: Copyright (c) Robin Wood 2011
+# Licence:: GPL
+#
+
+require "rubygems"
+require 'getoptlong'
+require "./cewl_lib.rb"
+
+opts = GetoptLong.new(
+ [ '--help', '-h', GetoptLong::NO_ARGUMENT ],
+ [ "-v" , GetoptLong::NO_ARGUMENT ]
+)
+
+def usage
+ puts"xx
+
+Usage: xx [OPTION] ... filename/list
+ -h, --help: show help
+ -v: verbose
+
+ filename/list: the file or list of files to check
+
+"
+ exit
+end
+
+verbose=false
+
+begin
+ opts.each do |opt, arg|
+ case opt
+ when '--help'
+ usage
+ when '-v'
+ verbose=true
+ end
+ end
+rescue
+ usage
+end
+
+if ARGV.length < 1
+ puts "Missing filename/list (try --help)"
+ exit 0
+end
+
+meta_data=[]
+
+ARGV.each { |param|
+ data=process_file(param, verbose)
+ if(data!=nil)
+ meta_data+=data
+ end
+}
+
+meta_data.delete_if { |x| x.chomp==""}
+meta_data.uniq!
+meta_data.sort!
+if meta_data.length==0
+ puts "No data found\n"
+else
+ puts meta_data.join("\n")
+end
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/forensics/cewl.git
More information about the forensics-changes
mailing list