[game-data-packager] 01/01: add a spider to automate Wikipedia urls lookup

Alexandre Detiste detiste-guest at moszumanska.debian.org
Sun Sep 27 22:40:32 UTC 2015


This is an automated email from the git hooks/post-receive script.

detiste-guest pushed a commit to branch master
in repository game-data-packager.

commit beef605bec68b7d9aa3f821872dc3d1fae103119
Author: Alexandre Detiste <alexandre.detiste at gmail.com>
Date:   Mon Sep 28 00:39:41 2015 +0200

    add a spider to automate Wikipedia urls lookup
---
 data/wikipedia.csv | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/spider.py    | 59 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 124 insertions(+)

diff --git a/data/wikipedia.csv b/data/wikipedia.csv
new file mode 100644
index 0000000..6343ce7
--- /dev/null
+++ b/data/wikipedia.csv
@@ -0,0 +1,65 @@
+arx;http://vi.wikipedia.org/wiki/Arx_Fatalis
+atlantis;http://en.wikipedia.org/wiki/Indiana_Jones_and_the_Fate_of_Atlantis
+black-cauldron;http://en.wikipedia.org/wiki/The_Black_Cauldron_%28game%29
+brokensword1;http://en.wikipedia.org/wiki/Broken_Sword:_The_Shadow_of_the_Templars
+chex;http://en.wikipedia.org/wiki/Chex_Quest
+comi;http://en.wikipedia.org/wiki/Monkey_Island
+dig;http://en.wikipedia.org/wiki/The_Dig
+discworld-1;http://en.wikipedia.org/wiki/Discworld_%28video_game%29
+doom2;http://en.wikipedia.org/wiki/Doom_II
+drbrain1;http://en.wikipedia.org/wiki/Castle_of_Dr._Brain
+drbrain2;http://en.wikipedia.org/wiki/The_Island_of_Dr._Brain
+feeble-files;http://en.wikipedia.org/wiki/The_Feeble_Files
+fullthrottle;http://en.wikipedia.org/wiki/Full_Throttle_(1995_video_game)
+future-wars;http://en.wikipedia.org/wiki/Future_Wars
+glory1;http://en.wikipedia.org/wiki/Quest_for_Glory:_So_You_Want_to_Be_a_Hero
+glory2;http://en.wikipedia.org/wiki/Quest_for_Glory:_Trial_by_Fire
+glory3;http://en.wikipedia.org/wiki/Quest_for_Glory_III:_Wages_of_War
+gobliiins;http://en.wikipedia.org/wiki/Gobliiins
+gobliins2;http://en.wikipedia.org/wiki/Gobliiins
+goblins3;http://en.wikipedia.org/wiki/Gobliiins
+goldrush;http://en.wikipedia.org/wiki/Gold_Rush%21
+grimfandango;http://en.wikipedia.org/wiki/Grim_Fandango
+hexen;http://en.wikipedia.org/wiki/Hexen
+inherit;http://en.wikipedia.org/wiki/Inherit_the_Earth:_Quest_for_the_Orb
+kingsquest1;http://en.wikipedia.org/wiki/King%27s_Quest_I:_Quest_for_the_Crown
+kingsquest2;http://en.wikipedia.org/wiki/King%27s_Quest_II:_Romancing_the_Throne
+kingsquest3;http://en.wikipedia.org/wiki/King%27s_Quest_III:_To_Heir_Is_Human
+kingsquest4;http://en.wikipedia.org/wiki/King%27s_Quest_IV:_The_Perils_of_Rosella
+kingsquest5;http://en.wikipedia.org/wiki/King%27s_Quest_V:_Absence_Makes_the_Heart_Go_Yonder!
+kingsquest6;http://en.wikipedia.org/wiki/King%27s_Quest_VI:_Heir_Today,_Gone_Tomorrow
+kyrandia1;http://en.wikipedia.org/wiki/The_Legend_of_Kyrandia
+kyrandia2;http://en.wikipedia.org/wiki/The_Legend_of_Kyrandia
+kyrandia3;http://en.wikipedia.org/wiki/The_Legend_of_Kyrandia
+lands-of-lore;http://en.wikipedia.org/wiki/Lands_of_Lore:_The_Throne_of_Chaos
+larry1;http://en.wikipedia.org/wiki/Leisure_Suit_Larry_in_the_Land_of_the_Lounge_Lizards
+larry2;http://en.wikipedia.org/wiki/Leisure_Suit_Larry_Goes_Looking_for_Love_%28in_Several_Wrong_Places%29
+larry3;http://en.wikipedia.org/wiki/Leisure_Suit_Larry_3:_Passionate_Patti_in_Pursuit_of_the_Pulsating_Pectorals
+larry5;http://en.wikipedia.org/wiki/Leisure_Suit_Larry_5:_Passionate_Patti_Does_a_Little_Undercover_Work
+larry6;http://en.wikipedia.org/wiki/Leisure_Suit_Larry_6:_Shape_Up_or_Slip_Out%21
+last-crusade;http://en.wikipedia.org/wiki/Indiana_Jones_and_the_Last_Crusade:_The_Graphic_Adventure
+loom;http://en.wikipedia.org/wiki/Loom_(video_game)
+lost-in-time;http://en.wikipedia.org/wiki/Lost_in_Time_%28video_game%29
+manhole;http://en.wikipedia.org/wiki/The_Manhole
+manhunter1;http://en.wikipedia.org/wiki/Manhunter:_New_York
+maniacmansion;http://en.wikipedia.org/wiki/Maniac_Mansion
+nomouth;http://en.wikipedia.org/wiki/I_Have_No_Mouth%2C_and_I_Must_Scream_%28computer_game%29
+policequest1;http://en.wikipedia.org/wiki/Police_Quest
+policequest2;http://en.wikipedia.org/wiki/Police_Quest
+policequest3;http://en.wikipedia.org/wiki/Police_Quest
+return-to-zork;http://en.wikipedia.org/wiki/Return_to_Zork
+sam-and-max;http://en.wikipedia.org/wiki/Sam_%26_Max_Hit_the_Road
+sherlock-holmes1;https://en.wikipedia.org/wiki/The_Lost_Files_of_Sherlock_Holmes
+simon1;http://en.wikipedia.org/wiki/Simon_The_Sorcerer
+soltys;http://pl.wikipedia.org/wiki/So%C5%82tys_%28gra_komputerowa%29
+spacequest1;http://en.wikipedia.org/wiki/Space_Quest_I:_The_Sarien_Encounter
+spacequest2;http://en.wikipedia.org/wiki/Space_Quest_II:_Vohaul%27s_Revenge
+spacequest3;http://en.wikipedia.org/wiki/Space_Quest_III:_The_Pirates_of_Pestulon
+spacequest4;http://en.wikipedia.org/wiki/Space_Quest_IV:_Roger_Wilco_and_the_Time_Rippers
+spacequest5;http://en.wikipedia.org/wiki/Space_Quest_V:_The_Next_Mutation
+t7g;http://en.wikipedia.org/wiki/The_7th_Guest
+tentacle;http://en.wikipedia.org/wiki/Day_of_the_Tentacle
+toonstruck;http://en.wikipedia.org/wiki/Toonstruck
+zak;http://en.wikipedia.org/wiki/Zak_McKracken_and_the_Alien_Mindbenders
+zork-inquisitor;http://en.wikipedia.org/wiki/Zork:_Grand_Inquisitor
+zork-nemesis;http://en.wikipedia.org/wiki/Zork_Nemesis
diff --git a/tools/spider.py b/tools/spider.py
new file mode 100755
index 0000000..e06093c
--- /dev/null
+++ b/tools/spider.py
@@ -0,0 +1,59 @@
+#!/usr/bin/python3
+# encoding=utf-8
+#
+# Copyright © 2015 Alexandre Detiste <alexandre at detiste.be>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+#
+# You can find the GPL license text on a Debian system under
+# /usr/share/common-licenses/GPL-2.
+
+# a simple spider to locate Wikipedia url
+# in per-engine-wiki pages
+# we don't rescan games we already have
+
+import time
+import urllib.request
+from bs4 import BeautifulSoup
+from game_data_packager import load_games
+
+CSV = 'data/wikipedia.csv'
+
+urls = dict()
+with open(CSV, 'r', encoding='utf8') as f:
+    for line in f.readlines():
+        line = line.strip()
+        if not line:
+            continue
+        shortname, url = line.split(';', 1)
+        urls[shortname] = url
+
+def is_wikipedia(href):
+    return href and "wikipedia" in href
+
+for shortname, game in load_games().items():
+    if not game.wiki:
+        continue
+    if shortname in urls:
+        continue
+
+    url = game.wikibase + game.wiki
+    html = urllib.request.urlopen(url)
+    soup = BeautifulSoup(html, 'lxml')
+    for tag in soup.find_all(href=is_wikipedia):
+        urls[shortname] = tag['href']
+
+    #break
+    time.sleep(1)
+
+# write it back
+with open(CSV, 'w', encoding='utf8') as f:
+    for shortname in sorted(urls.keys()):
+        f.write(shortname + ';' + urls[shortname] + '\n')

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/pkg-games/game-data-packager.git



More information about the Pkg-games-commits mailing list