User Tools

Site Tools


ubuntu:wget:ignore_robots.txt

This is an old revision of the document!


Ubuntu - wget - Ignore robots.txt

By default wget respects the robots.txt file and thus only downloads the non-private files.

The protocol of the robots exclusion standard is pure advisory, this means that the robots.txt contains rules that a search engine or other robots are not allowed to access certain files but they might ignore them.

Wget can be adviced to ignore that rules and thus it downloads the private files anyway. Set the e option as shown next.

wget -e robots=off -r http://somesite.com
ubuntu/wget/ignore_robots.txt.1575499236.txt.gz · Last modified: 2020/07/15 09:30 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki