[Debian Wiki] crawler not allowed to perform ?action=raw

Frank Lin PIAT fpiat at klabs.be
Tue May 11 06:26:51 UTC 2010


retitle 569191 crawler not allowed to perform ?action=raw
thanks

Andreas B. Mundt wrote:
> we use GET to download a wikipage and further process the data to
> prepare the manual of Debian Edu. The command:
> 	GET "http://wiki.debian.org/DebianEdu/Documentation/Lenny/AllInOne?action=raw"
> works fine in Lenny, but stopped working in squeeze where "You are not
> allowed to access this!" is returned. If you remove "?action=raw" from
> the URL anything is fine. Is this inteded and we have to provide a
> header?

Damyan Ivanov wrote:
> On Lenny (works)
> ================
> User-Agent: lwp-request/0.810
> 
> On Sid (breaks)
> ===============
> User-Agent: lwp-request/5.834 libwww-perl/5.834

Yes, this is moinmoin standard behavior.
The wiki engine has some surge protection mechanisms, to avoid web
crawlers (and users) from DoS'ing the wiki.
Well known web crawlers (including libwww-perl/*) are only allowed to
fetch html rendered pages.

As it was mentioned, you should change your crawler's user-Agent string
(use something meaningful, so the admin can get in touch with you,
rather than just blacklisting the "offending" IPs)

Thanks,

Franklin




More information about the pkg-perl-maintainers mailing list