Theo Todman's Web Page - Notes Pages


Website Documentation

Website Generator Documentation - Spider & System Backups

(Work In Progress: output at 11/09/2023 19:28:42)

Previous VersionsNote ReferencesNote Citations


This document covers the following functions performed by clicking buttons on the front screen:-

  1. Backups (cmdBackup_Click)
  2. Run Web Spider (cmdSpider_Click)

To see the Code, click on the procedure names above.

Backup
  1. There are four options:-
    1. Back-up the System: This is a complex procedure that backs up my C:drive to a 2Tb flash drive. It also has procedures for restores that need reviewing. There are options to back up changes-only since last backup and to back up the website as well as avoiding certain files, using the following parameter tables:-
      Backup_Control
      Backup_Directory_Structure
      Backup_Site_Map
      All this is fully documented below. The control-sub is Backup_Ctrl.
    2. De-duplicate the Backup Disk: This is a complex procedure that became necessary when my 2Tb backup disk filled up. It indexes the disk and determines which files are duplicates (based on name, size and last-edited date) and deletes all but the earliest and latest, logging what it has done. The procedure is fully explained below. The control-sub is Backup_Prune_Ctrl.
    3. Search the Backup_Site_Map table: uses Backup_Site_Map_Search. Searches requested string in file name.
    4. Search the Full_Backup_Site_Map table: uses Full_Backup_Site_Map_Search. Searches requested string in file name.
  2. Back-up the System:
    1. Control-sub: Backup_Ctrl
    2. To be supplied in due course …
  3. De-duplicate the Backup Disk:
    1. Control-sub: Backup_Prune_Ctrl
    2. Other Subs used:-
      Backup_Prune_Scurry
      Compact_Repair
      Flag_For_Deletion
      Zap_Duplicate_Files
    3. Tables used:-
      Full_Backup_Site_Map_Temp
      Full_Backup_Directory_Structure_Temp
      Full_Backup_Site_Map
      Full_Backup_Directory_Structure
    4. Queries used:-
      Full_Backup_Site_Map_Temp_Delete_Control
      Full_Backup_Site_Map_Temp_Delete_Failed
      Full_Backup_Directory_Structure_Add
      Full_Backup_Site_Map_Add
      Full_Backup_Site_Map_Dups_Temp_Gen
      Full_Backup_Site_Map_Temp_Delete_Flag
    5. Summary:-
      • The aim is to scan the backup drive and maintain its directory structure and contents. As noted above, the intention is to retain only the first and last copies of identical files.
      • While the files backed-up are on the backup drive, the full record of what has been backed up lies on the “Backups_Prune” database, while the record of the latest backup, together with the record of the backup runs (and the parameters used) – in table Backup_History – are in the “Backups” database.
      • The first job is to maintain the directory structure of the backup drive as held on the “Backups_Prune” database. To save space, each directory of the Backup disk is given a unique long integer ID, incremented as directories are added. These IDs are used in the tables that record the ‘Site Maps’. Note that while the contents of the directories on the Backup disk may be deleted piecemeal by the pruning process, the directories themselves are (currently) retained. The very first job is to determine the next Directory Id to allocate. New directories are loaded to a temporary file before merging with the full directory structure.
      • Once the directories have been brought up to date, their contents are likewise brought up to date. Again, using a temporary file before merging with the full site map.
      • The next step is to flag for deletion those items on the full site map other than the first and last for each item. Note that as this process has been run before, items already deleted are ignored.
      • Finally, those items flagged for deletion, but not already deleted, are deleted.
    6. Backup Directory Structure Maintenance
    7. Backup Site Map Maintenance
    8. Flagging Site Map Items for Deletion
    9. Deleting Site Map Items Flagged for Deletion



Spider
  1. This process (governed by Spider_Ctrl, and then Spider_Scurry) interrogates the files on my hard drive that form the local copy of my website, and after determining the directory structure, digs out all the hyperlinks by recursively passing through the files as text files using the VBA FileSystemObject.
  2. It is possible to buy, or download free, web-spiders that will check hyperlink integrity across the internet when pointed at a base-URL, but (in my experience) these take forever to run on large sites (if only because the timeout limit has to be set such as to avoid false negatives). Hence, I wrote my own Spider to run quickly on a local drive.
  3. While the Spider will record external links, it does not check them. However, I have a completely separate sub-system that performs this function: see Web Links1.
  4. Despite this being a local checker with no timeout delays, it still takes a long time to run] – currently – September 2003, and on my new desktop – it takes about 3 hours2, so there are various parameters that can be set to control what’s done on a particular run.
  5. Since the slave database3 nearly breaks the 2Gb limit during the full Spider run – despite repeated compacts & repairs – the location of the various files has been under review. I’ve also had a project on the go that removes the common element (C:\Theo's Files\Birkbeck) from all links recorded in the tables. While the process runs to completion, it currently doesn’t work correctly and says there are 500k broken links! There aren’t; the bug is under investigation!
  6. The slave database is now only used to hold the following tables:-
  7. The main database did break the 2Gb limit during the (first, abortive) September 2021 run, so I removed the following tables to a new ‘Spider’ database (Spider.accdb):-
  8. Otherwise, the following three tables are parked in the Backups database:-
  9. I’d thought of using the ‘.Run’ technology exemplified in Sub AppAccess_Test, but – while I proved the technology OK – it doesn’t seem necessary at the moment. That explains the relocation of some of the smaller tables in Spider.
  10. Stepping back a bit, there are three main factors involved in the decisions to hold various tables in the same or different databases:
    1. Database size – especially bloating during the Spider run.
    2. Efficiency of queries: cross-database joins used to be very inefficient.
    3. Ability to compact & repair slave databases (but not the master database) mid-run.
  11. These days – with solid state disks and lots of memory – I think the second factor is less important as the tables are effectively all in memory.
  12. Consequently, some of the architecture may no longer be strictly required.
  13. The three main control tables are opened by cmdSpider_Click, namely:- These can be used to control / limit the run.
  14. Spider_Control: This is a single-row table that contains:-
    • Statistics from the last run.
    • A flag “Update_Since_Last_Run” which, if set to “Yes” limits the run to interrogating pages that have a last-changed timestamp subsequent to the last Spider run. This is less useful than it might be, in that I currently run the Spider after I've regenerated the entire site (using Full Website Re-Gen4), so that in practice almost all5 pages are scanned.
    • Another flag “Stop_Spider” which is checked at the start of the recursive Spider_Scurry and which will stop the run if set to “Yes”. The idea is that the control table should be open in another copy of MS Access so that if you want to stop the run without crashing the program, you can do so by setting this flag6.
  15. Directory_Structure: This table contains one row for each sub-directory (including the root) within the website. It is maintained by the Spider itself as far as the number of rows is concerned. However, the user can set a couple of flags against each row:-
    • “Do_Not_Parse”: this can be set to “Yes” to ignore this directory (and any sub-directories). Any corresponding rows in Site_Map are left unchanged.
    • “Updates_Only”: this can be set to over-ride the “Update_Since_Last_Run” flag in Spider_Control for this directory (and any sub-directories).
  16. Site_Map:
    • This table contains one row for each object (HTML page, PDF, GIF, Word document, and the like) in my website.
    • It is maintained by the Spider as far as the number of rows is concerned. The Spider records, amongst other things, how long it took for the object to be parsed (“Time_To_Update”, in milliseconds).
    • The user can set a flag “Block_Update” that will cause the object not to be parsed in this coming run. This is because some pages are very large and take a very long time to parse.
  17. Blocked_Spider_Files
    • This is a query that selects only those rows of Site_Map that have Block_Update = “Yes”, though it is possible to add others from Site_Map by toggling to the table, setting the flag, and bouncing the query.
    • This is a “fine-tuning” way of restricting the operation of the Spider.
  18. During the Spider process, the three main tables:- Are updated via their local analogues, which are maintained during the bulk of the processing, until the Sub Spider_Copy is called at the end. There is a bit of fancy footwork to ensure the various options don’t either leave redundant links in place, or delete live ones.
  19. Spider_Copy also performs the processing to create the “full links” column in the Raw_Links table.
  20. Spider_Copy also segregates out7 the Section-link within a page, and then runs8 the Spider_Missing_Internal_Links query to pick up broken links within the site.
  21. As noted above, the Spider takes a long time to run. In addition, the activity on the Slave Database that contains the Raw_Links table expands the table from its “resting” size of under 600Mb to over the 2Gb limit9 beyond which the database becomes corrupt. Consequently, I have added two diagnostic and repair functions during the processing-
    1. I use debug.print to timestamp the various sub-processes within the run.
    2. I call the function Compact_Repair periodically to Compact & Repair the Slave Database. If this fails, as it sometimes seems to, the process as a whole can fail.
  22. These diagnostics10 reveal that:-
    1. Spider_Scurry is the longest-running process, as might be expected; it takes 4.4 hours based on the database / website at its current size (1st September 2021).
    2. The process Full_Link_Up_Levels_Gen, called by Spider_Copy, used to take about 3 hours, but now takes about 8 minutes, most of which is taken up with compacts & repairs.
    3. No other process takes more than a few minutes.
  23. How the Link-Checking works: The above is all very well, but how does the actual checking work? Well:-
  24. Description of the detailed processing now follows:-
    • Spider_Ctrl: as might be expected, this controls the other processing. It asks questions about the run required so that parameters can be set. It then calls the two main processes below, and finally calls the three processes that output the Web-Links Test Webpages.
    • Spider_Scurry: this is self-referential procedure that combs the links recursively until a “leaf” page is found with no further links. Presumably there’s a check for loops!
    • Spider_Copy: this process:-
      • Updates the Directory_Structure, Directory_Fine_Structure and Site_Map tables. The first two tables are in the Backups database, the third is in the main database.
      • Updates the Raw_Links table in the Slave database, using the following procedures:-
        1. Full_Link_Same_Directory_Gen: this process simply sets the full link in the simple case where the link is to the same directory. Because so many links are involved (circa 0.5m11 of this category at the moment), this cannot be an update query, but is run through in code, with the slave database being compacted and repaired every 200,000 records.
        2. Full_Link_Up_Levels_Gen: this process is similar to the above12, but caters for the case where links are not to the same directory. In this case, the raw link will have a variable number of “../” directory-shifters, to get back to the Site root-directory, followed by the address from there. For some reason there’s a query of the Directory_Fine_Structure table.
        3. Full_Link_Sections_Fix
  25. … to be completed …

This Note is awaiting further attention13.



In-Page Footnotes:

Footnotes 2, 10, 11: Footnote 3: Footnote 5: Footnote 6: Footnote 7: Footnote 8: Footnote 9: Footnote 12:


Table of the Previous 7 Versions of this Note:

Date Length Title
06/07/2023 00:43:12 14522 Website Generator Documentation - Spider & System Backups
28/09/2022 10:24:58 10894 Website Generator Documentation - Spider & Slave Database
01/10/2021 13:17:46 10895 Website Generator Documentation - Spider & Slave Database
04/10/2020 00:27:22 9370 Website Generator Documentation - Spider & Slave Database
05/04/2019 10:36:29 9377 Website Generator Documentation - Spider & Slave Database
13/01/2015 19:07:41 5310 Website Generator Documentation - Spider & Slave Database
02/06/2013 10:58:05 347 Website Generator Documentation - Spider & Slave Database



Note last updated Reference for this Topic Parent Topic
11/09/2023 19:28:44 986 (Website Generator Documentation - Spider & System Backups) Website Generator Documentation - Control Page


Summary of Notes Referenced by This Note

Awaiting Attention (Documentation) Website Generator Documentation - Full Website Re-Gen Website Generator Documentation - Prune Website Website Generator Documentation - Web Links  

To access information, click on one of the links in the table above.




Summary of Notes Citing This Note

Status: Priority Task List (2023 - September) Status: Summary (2023 - June), 2, 3, 4 Status: Web-Tools (2023 - June) Website - Outstanding Developments (2023 - September), 2 Website Generator Documentation - Control Page, 2
Website Generator Documentation - Functors, 2 Website Generator Documentation - Web Links, 2      

To access information, click on one of the links in the table above.




Text Colour Conventions

  1. Blue: Text by me; © Theo Todman, 2023




© Theo Todman, June 2007 - Sept 2023.Please address any comments on this page to theo@theotodman.com.File output:
Website Maintenance Dashboard
Return to Top of this PageReturn to Theo Todman's Philosophy PageReturn to Theo Todman's Home Page