Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 977 Bytes

README.md

File metadata and controls

30 lines (22 loc) · 977 Bytes

gagamba

Simple spider to check for bad links (404s etc.)

Blog post

See this article for more information.

Rough notes kept during development

debug_print("Links found --> ")
for link in links:
    debug_print("-- DEBUG --> {link}".format(link=link))
debug_print("--------")
regex = r'<a[\s\S]*?href=["\'](\S*?)["\']>'
m = re.findall(regex, r.text, re.MULTILINE)

TODO

  • Check for missing images
  • Stack overflow due to too many levels of recursion when spidering NDP
  • Try to make it work locally - this would be a lot faster https://localhost:3000
  • Use HTML parser or something to process links (my regex is probably not robust)
  • Add checks for links that point offsite but do not crawl them
  • Add exception handling because some sites cause an exception rather than returning an error code. For example if the site doesn;t exist.