Theo Todman's Web Page - Notes Pages


Website Documentation

Website Generator Documentation - Spider

(Work In Progress: output at 23/08/2024)

Previous VersionsNote ReferencesNote Citations


This document covers1 the following functions performed by clicking buttons on the front screen:-

  1. Run Web Spider (cmdSpider_Click)

To see the Code, click on the procedure name above.


Spider
  1. I note here that the files generated by the Backup2 system are utilised by the Spider. The two processes need to be synchronised, especially as the Spider as currently architected only works under Windows 10 – so only on my Laptop.
  2. This process (governed by Spider_Ctrl, and then Spider_Scurry) interrogates the files on my hard drive that form the local copy of my website, and after determining the directory structure, digs out all the hyperlinks by recursively passing through the files as text files using the VBA FileSystemObject.
  3. It is possible to buy, or download free, web-spiders that will check hyperlink integrity across the internet when pointed at a base-URL, but (in my experience) these take forever to run on large sites (if only because the timeout limit has to be set such as to avoid false negatives). Hence, I wrote my own Spider to run quickly on a local drive.
  4. While the Spider will record external links, it does not check them. However, I have a completely separate sub-system that performs this function: see Web Links3.
  5. Despite this being a local checker with no timeout delays, it still takes a long time to run] – currently – September 2003, and on my new desktop – it takes about 3 hours4, so there are various parameters that can be set to control what’s done on a particular run.
  6. Since the slave database5 nearly breaks the 2Gb limit during the full Spider run – despite repeated compacts & repairs – the location of the various files has been under review. I’ve also had a project on the go that removes the common element (C:\Theo's Files\Birkbeck) from all links recorded in the tables. While the process runs to completion, it currently doesn’t work correctly and says there are 500k broken links! There aren’t; the bug is under investigation!
  7. The slave database is now only used to hold the following tables:-
  8. The main database did break the 2Gb limit during the (first, abortive) September 2021 run, so I removed the following tables to a new ‘Spider’ database (Spider.accdb):-
  9. Otherwise, the following three tables are parked in the Backups database:-
  10. I’d thought of using the ‘.Run’ technology exemplified in Sub AppAccess_Test, but – while I proved the technology OK – it doesn’t seem necessary at the moment. That explains the relocation of some of the smaller tables in Spider.
  11. Stepping back a bit, there are three main factors involved in the decisions to hold various tables in the same or different databases:
    1. Database size – especially bloating during the Spider run.
    2. Efficiency of queries: cross-database joins used to be very inefficient.
    3. Ability to compact & repair slave databases (but not the master database) mid-run.
  12. These days – with solid state disks and lots of memory – I think the second factor is less important as the tables are effectively all in memory.
  13. Consequently, some of the architecture may no longer be strictly required.
  14. The three main control tables are opened by cmdSpider_Click, namely:- These can be used to control / limit the run.
  15. Spider_Control: This is a single-row table that contains:-
    • Statistics from the last run.
    • A flag “Update_Since_Last_Run” which, if set to “Yes” limits the run to interrogating pages that have a last-changed timestamp subsequent to the last Spider run. This is less useful than it might be, in that I currently run the Spider after I've regenerated the entire site (using Full Website Re-Gen6), so that in practice almost all7 pages are scanned.
    • Another flag “Stop_Spider” which is checked at the start of the recursive Spider_Scurry and which will stop the run if set to “Yes”. The idea is that the control table should be open in another copy of MS Access so that if you want to stop the run without crashing the program, you can do so by setting this flag8.
  16. Directory_Structure: This table contains one row for each sub-directory (including the root) within the website. It is maintained by the Spider itself as far as the number of rows is concerned. However, the user can set a couple of flags against each row:-
    • “Do_Not_Parse”: this can be set to “Yes” to ignore this directory (and any sub-directories). Any corresponding rows in Site_Map are left unchanged.
    • “Updates_Only”: this can be set to over-ride the “Update_Since_Last_Run” flag in Spider_Control for this directory (and any sub-directories).
  17. Site_Map:
    • This table contains one row for each object (HTML page, PDF, GIF, Word document, and the like) in my website.
    • It is maintained by the Spider as far as the number of rows is concerned. The Spider records, amongst other things, how long it took for the object to be parsed (“Time_To_Update”, in milliseconds).
    • The user can set a flag “Block_Update” that will cause the object not to be parsed in this coming run. This is because some pages are very large and take a very long time to parse.
  18. Blocked_Spider_Files
    • This is a query that selects only those rows of Site_Map that have Block_Update = “Yes”, though it is possible to add others from Site_Map by toggling to the table, setting the flag, and bouncing the query.
    • This is a “fine-tuning” way of restricting the operation of the Spider.
  19. During the Spider process, the three main tables:- Are updated via their local analogues, which are maintained during the bulk of the processing, until the Sub Spider_Copy is called at the end. There is a bit of fancy footwork to ensure the various options don’t either leave redundant links in place, or delete live ones.
  20. Spider_Copy also performs the processing to create the “full links” column in the Raw_Links table.
  21. Spider_Copy also segregates out9 the Section-link within a page, and then runs10 the Spider_Missing_Internal_Links query to pick up broken links within the site.
  22. As noted above, the Spider takes a long time to run. In addition, the activity on the Slave Database that contains the Raw_Links table expands the table from its “resting” size of under 600Mb to over the 2Gb limit11 beyond which the database becomes corrupt. Consequently, I have added two diagnostic and repair functions during the processing-
    1. I use debug.print to timestamp the various sub-processes within the run.
    2. I call the function Compact_Repair periodically to Compact & Repair the Slave Database. If this fails, as it sometimes seems to, the process as a whole can fail.
  23. These diagnostics12 reveal that:-
    1. Spider_Scurry is the longest-running process, as might be expected; it takes 4.4 hours based on the database / website at its current size (1st September 2021).
    2. The process Full_Link_Up_Levels_Gen, called by Spider_Copy, used to take about 3 hours, but now takes about 8 minutes, most of which is taken up with compacts & repairs.
    3. No other process takes more than a few minutes.
  24. How the Link-Checking works: The above is all very well, but how does the actual checking work? Well:-
  25. Description of the detailed processing now follows:-
    • Spider_Ctrl: as might be expected, this controls the other processing. It asks questions about the run required so that parameters can be set. It then calls the two main processes below, and finally calls the three processes that output the Web-Links Test Webpages.
    • Spider_Scurry: this is self-referential procedure that combs the links recursively until a “leaf” page is found with no further links. Presumably there’s a check for loops!
    • Spider_Copy: this process:-
      • Updates the Directory_Structure, Directory_Fine_Structure and Site_Map tables. The first two tables are in the Backups database, the third is in the main database.
      • Updates the Raw_Links table in the Slave database, using the following procedures:-
        1. Full_Link_Same_Directory_Gen: this process simply sets the full link in the simple case where the link is to the same directory. Because so many links are involved (circa 0.5m13 of this category at the moment), this cannot be an update query, but is run through in code, with the slave database being compacted and repaired every 200,000 records.
        2. Full_Link_Up_Levels_Gen: this process is similar to the above14, but caters for the case where links are not to the same directory. In this case, the raw link will have a variable number of “../” directory-shifters, to get back to the Site root-directory, followed by the address from there. For some reason there’s a query of the Directory_Fine_Structure table.
        3. Full_Link_Sections_Fix
  26. … to be completed …

This Note is awaiting further attention15.



In-Page Footnotes:

Footnote 1: Footnotes 4, 12, 13: Footnote 5: Footnote 7: Footnote 8: Footnote 9: Footnote 10: Footnote 11: Footnote 14:


Table of the Previous 7 Versions of this Note:

Date Length Title
06/07/2023 00:43:12 14522 Website Generator Documentation - Spider & System Backups
28/09/2022 10:24:58 10894 Website Generator Documentation - Spider & Slave Database
01/10/2021 13:17:46 10895 Website Generator Documentation - Spider & Slave Database
04/10/2020 00:27:22 9370 Website Generator Documentation - Spider & Slave Database
05/04/2019 10:36:29 9377 Website Generator Documentation - Spider & Slave Database
13/01/2015 19:07:41 5310 Website Generator Documentation - Spider & Slave Database
02/06/2013 10:58:05 347 Website Generator Documentation - Spider & Slave Database



Note last updated Reference for this Topic Parent Topic
06/12/2025 01:36:04 986 (Website Generator Documentation - Spider) Website Generator Documentation - Control Page


Summary of Notes Referenced by This Note

Awaiting Attention (Documentation) Website Generator Documentation - Backups Website Generator Documentation - Full Website Re-Gen Website Generator Documentation - Prune Website Website Generator Documentation - Web Links

To access information, click on one of the links in the table above.




Summary of Notes Citing This Note

Status: Priority Task List (2025 - November), 2, 3 Status: Summary (2025 - September), 2, 3 Status: Web-Tools (2025 - September) Website - Outstanding Developments (2025 - December), 2 Website Generator Documentation - Backups, 2, 3
Website Generator Documentation - Control Page Website Generator Documentation - Functors, 2 Website Generator Documentation - Web Links, 2    

To access information, click on one of the links in the table above.




Text Colour Conventions

  1. Blue: Text by me; © Theo Todman, 2025




© Theo Todman, June 2007 - Dec 2025.Please address any comments on this page to theo@theotodman.com.File output:
Website Maintenance Dashboard
Return to Top of this PageReturn to Theo Todman's Philosophy PageReturn to Theo Todman's Home Page