You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:
I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the Resource.
Appreciate any ideas on how you would approach this.
Cheers,
Peter
The text was updated successfully, but these errors were encountered:
Hi @spekulatius , my apologies for the very late reply.
One way (not tested by me) could be this:
Set the allow_redirects option on the Guzzle request handler when you construct it, and set the option track_redirects to true. This would store info about redirects in the X-Guzzle-Redirect-History and X-Guzzle-Redirect-Status-History headers.
If I am not mistaken, Resource contains the entire response (ResponseInterface), which you can use to inspect the headers.
Hello @mvdbos
I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:
I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the
Resource
.Appreciate any ideas on how you would approach this.
Cheers,
Peter
The text was updated successfully, but these errors were encountered: