
Is the mirror complete and up-to-date?
There is a plethora of switches out there that I would encourage you to explore, but one in particular that I find useful helps to mirror even those places on the site that the developer tells Google, Yahoo, and other crawlers to omit. robots.txt files or meta tags are used to tell browsers to skip archival or search scans. They usually do this to avoid allowing something to be searchable, which is certainly of interest in a pen test. Appending the –s0 option ensures we won't miss anything in our scans.
Updating our mirror is a pretty simple process. Assuming you are storing it in a non-volatile location (/tmp goes away on reboots), you can manually gather or even assign a cron job to automatically update your mirror's content and links. A prompt will ensure that you mean to update the cache; and assuming this is the case, it will remove the lock file and perform an update. I would recommend exploring archival switches that can allow you to keep older files or perform incremental updates to make better use of your time. Some older copies may hold content that was removed to cover an inadvertent disclosure, so it is well worth hoarding some historical files until they have been scanned.
