[gopher] Gopher++ scrapped & Internet Archive -style thingy
Kim Holviala
kim at holviala.com
Tue Apr 20 09:25:54 UTC 2010
As part of my project to code a neat search engine to cover the whole
Gopherspace I've (partially) crawled sites and snooped and researched a
lot of stuff.
Let's just say that the Gopherspace is small, but interesting. I'm glad
I started crawling :-).
Anyway.
Whatever I've written about the gopher++ extra headers can now be
considered as "obsolete". I found a few live sites which just cannot
accept anything else than a selector<CRLF> so there's no way I can
insert extra headers without breaking stuff. Those sites even break with
type 7 queries (and gopher+) so I'm kind of giving up now.
All code regarding the header extensions has been scrapped and deleted,
it's all gone for good. The good thing is that my code is now 100%
compatible with ALL early 90's servers but the bad thing is that the
neat charset conversion thingy is now all gone and we're back to 7-bit
US-ASCII (or non-working Latin/UTF). Oh, well.
As my search engines indexer is an offline one my spider basically
crawls around and saves all type 0&1 files to a local cache hierarcy.
This was mostly accidental, but I managed to create something very much
like The Internet Archive but for gopher. Basically, you give the cache
manager an url and it gives you back the cached page (if it has it) AND
it mangles menus so that as long as the pages are in cache you'll stay
in the cache.
It's kind of like a combination of Google's cache and archive.org, only
it works better than either of those...
Here's a cached copy of (partial) Floodgap:
gopher://gophernicus.org/1/cache.q?gopher://gopher.floodgap.com
It even cached itself:
gopher://gophernicus.org/1/cache.q?gopher://gophernicus.org
Notice how the cached Floodgap is much faster than the original one ;D.
I wish there was something like this for teh web....
<turtleneck shirt mode on>
One more thing,
</turtleneck>
I'll be crawling everything in about a month or so, so now is the time
to fix your robots.txt if you don't want your files to end up in the cache.
- Kim
More information about the Gopher-Project
mailing list