Theo Todman's Web Page - Notes Pages
Website Documentation
Website Generator Documentation - Spider & System Backups
(Work In Progress: output at 11/09/2023 19:28:42)
This document covers the following functions performed by clicking buttons on the front screen:-
- Backups (cmdBackup_Click)
- Run Web Spider (cmdSpider_Click)
To see the Code, click on the procedure names above.
Backup
- There are four options:-
- Back-up the System: This is a complex procedure that backs up my C:drive to a 2Tb flash drive. It also has procedures for restores that need reviewing. There are options to back up changes-only since last backup and to back up the website as well as avoiding certain files, using the following parameter tables:-
→ Backup_Control
→ Backup_Directory_Structure
→ Backup_Site_Map
All this is fully documented below. The control-sub is Backup_Ctrl.
- De-duplicate the Backup Disk: This is a complex procedure that became necessary when my 2Tb backup disk filled up. It indexes the disk and determines which files are duplicates (based on name, size and last-edited date) and deletes all but the earliest and latest, logging what it has done. The procedure is fully explained below. The control-sub is Backup_Prune_Ctrl.
- Search the Backup_Site_Map table: uses Backup_Site_Map_Search. Searches requested string in file name.
- Search the Full_Backup_Site_Map table: uses Full_Backup_Site_Map_Search. Searches requested string in file name.
- Back-up the System:
- Control-sub: Backup_Ctrl
- To be supplied in due course …
- De-duplicate the Backup Disk:
- Control-sub: Backup_Prune_Ctrl
- Other Subs used:-
→ Backup_Prune_Scurry
→ Compact_Repair
→ Flag_For_Deletion
→ Zap_Duplicate_Files
- Tables used:-
→ Full_Backup_Site_Map_Temp
→ Full_Backup_Directory_Structure_Temp
→ Full_Backup_Site_Map
→ Full_Backup_Directory_Structure
- Queries used:-
→ Full_Backup_Site_Map_Temp_Delete_Control
→ Full_Backup_Site_Map_Temp_Delete_Failed
→ Full_Backup_Directory_Structure_Add
→ Full_Backup_Site_Map_Add
→ Full_Backup_Site_Map_Dups_Temp_Gen
→ Full_Backup_Site_Map_Temp_Delete_Flag
- Summary:-
- The aim is to scan the backup drive and maintain its directory structure and contents. As noted above, the intention is to retain only the first and last copies of identical files.
- While the files backed-up are on the backup drive, the full record of what has been backed up lies on the “Backups_Prune” database, while the record of the latest backup, together with the record of the backup runs (and the parameters used) – in table Backup_History – are in the “Backups” database.
- The first job is to maintain the directory structure of the backup drive as held on the “Backups_Prune” database. To save space, each directory of the Backup disk is given a unique long integer ID, incremented as directories are added. These IDs are used in the tables that record the ‘Site Maps’. Note that while the contents of the directories on the Backup disk may be deleted piecemeal by the pruning process, the directories themselves are (currently) retained. The very first job is to determine the next Directory Id to allocate. New directories are loaded to a temporary file before merging with the full directory structure.
- Once the directories have been brought up to date, their contents are likewise brought up to date. Again, using a temporary file before merging with the full site map.
- The next step is to flag for deletion those items on the full site map other than the first and last for each item. Note that as this process has been run before, items already deleted are ignored.
- Finally, those items flagged for deletion, but not already deleted, are deleted.
- Backup Directory Structure Maintenance
- Backup Site Map Maintenance
- Flagging Site Map Items for Deletion
- Deleting Site Map Items Flagged for Deletion
Spider
- This process (governed by Spider_Ctrl, and then Spider_Scurry) interrogates the files on my hard drive that form the local copy of my website, and after determining the directory structure, digs out all the hyperlinks by recursively passing through the files as text files using the VBA FileSystemObject.
- It is possible to buy, or download free, web-spiders that will check hyperlink integrity across the internet when pointed at a base-URL, but (in my experience) these take forever to run on large sites (if only because the timeout limit has to be set such as to avoid false negatives). Hence, I wrote my own Spider to run quickly on a local drive.
- While the Spider will record external links, it does not check them. However, I have a completely separate sub-system that performs this function: see Web Links1.
- Despite this being a local checker with no timeout delays, it still takes a long time to run] – currently – September 2003, and on my new desktop – it takes about 3 hours2, so there are various parameters that can be set to control what’s done on a particular run.
- Since the slave database3 nearly breaks the 2Gb limit during the full Spider run – despite repeated compacts & repairs – the location of the various files has been under review. I’ve also had a project on the go that removes the common element (C:\Theo's Files\Birkbeck) from all links recorded in the tables. While the process runs to completion, it currently doesn’t work correctly and says there are 500k broken links! There aren’t; the bug is under investigation!
- The slave database is now only used to hold the following tables:-
- The main database did break the 2Gb limit during the (first, abortive) September 2021 run, so I removed the following tables to a new ‘Spider’ database (Spider.accdb):-
- Otherwise, the following three tables are parked in the Backups database:-
- I’d thought of using the ‘.Run’ technology exemplified in Sub AppAccess_Test, but – while I proved the technology OK – it doesn’t seem necessary at the moment. That explains the relocation of some of the smaller tables in Spider.
- Stepping back a bit, there are three main factors involved in the decisions to hold various tables in the same or different databases:
- Database size – especially bloating during the Spider run.
- Efficiency of queries: cross-database joins used to be very inefficient.
- Ability to compact & repair slave databases (but not the master database) mid-run.
- These days – with solid state disks and lots of memory – I think the second factor is less important as the tables are effectively all in memory.
- Consequently, some of the architecture may no longer be strictly required.
- The three main control tables are opened by cmdSpider_Click, namely:- These can be used to control / limit the run.
- Spider_Control: This is a single-row table that contains:-
- Statistics from the last run.
- A flag “Update_Since_Last_Run” which, if set to “Yes” limits the run to interrogating pages that have a last-changed timestamp subsequent to the last Spider run. This is less useful than it might be, in that I currently run the Spider after I've regenerated the entire site (using Full Website Re-Gen4), so that in practice almost all5 pages are scanned.
- Another flag “Stop_Spider” which is checked at the start of the recursive Spider_Scurry and which will stop the run if set to “Yes”. The idea is that the control table should be open in another copy of MS Access so that if you want to stop the run without crashing the program, you can do so by setting this flag6.
- Directory_Structure: This table contains one row for each sub-directory (including the root) within the website. It is maintained by the Spider itself as far as the number of rows is concerned. However, the user can set a couple of flags against each row:-
- “Do_Not_Parse”: this can be set to “Yes” to ignore this directory (and any sub-directories). Any corresponding rows in Site_Map are left unchanged.
- “Updates_Only”: this can be set to over-ride the “Update_Since_Last_Run” flag in Spider_Control for this directory (and any sub-directories).
- Site_Map:
- This table contains one row for each object (HTML page, PDF, GIF, Word document, and the like) in my website.
- It is maintained by the Spider as far as the number of rows is concerned. The Spider records, amongst other things, how long it took for the object to be parsed (“Time_To_Update”, in milliseconds).
- The user can set a flag “Block_Update” that will cause the object not to be parsed in this coming run. This is because some pages are very large and take a very long time to parse.
- Blocked_Spider_Files
- This is a query that selects only those rows of Site_Map that have Block_Update = “Yes”, though it is possible to add others from Site_Map by toggling to the table, setting the flag, and bouncing the query.
- This is a “fine-tuning” way of restricting the operation of the Spider.
- During the Spider process, the three main tables:- Are updated via their local analogues, which are maintained during the bulk of the processing, until the Sub Spider_Copy is called at the end. There is a bit of fancy footwork to ensure the various options don’t either leave redundant links in place, or delete live ones.
- Spider_Copy also performs the processing to create the “full links” column in the Raw_Links table.
- Spider_Copy also segregates out7 the Section-link within a page, and then runs8 the Spider_Missing_Internal_Links query to pick up broken links within the site.
- As noted above, the Spider takes a long time to run. In addition, the activity on the Slave Database that contains the Raw_Links table expands the table from its “resting” size of under 600Mb to over the 2Gb limit9 beyond which the database becomes corrupt. Consequently, I have added two diagnostic and repair functions during the processing-
- I use debug.print to timestamp the various sub-processes within the run.
- I call the function Compact_Repair periodically to Compact & Repair the Slave Database. If this fails, as it sometimes seems to, the process as a whole can fail.
- These diagnostics10 reveal that:-
- Spider_Scurry is the longest-running process, as might be expected; it takes 4.4 hours based on the database / website at its current size (1st September 2021).
- The process Full_Link_Up_Levels_Gen, called by Spider_Copy, used to take about 3 hours, but now takes about 8 minutes, most of which is taken up with compacts & repairs.
- No other process takes more than a few minutes.
- How the Link-Checking works: The above is all very well, but how does the actual checking work? Well:-
- Description of the detailed processing now follows:-
- Spider_Ctrl: as might be expected, this controls the other processing. It asks questions about the run required so that parameters can be set. It then calls the two main processes below, and finally calls the three processes that output the Web-Links Test Webpages.
- Spider_Scurry: this is self-referential procedure that combs the links recursively until a “leaf” page is found with no further links. Presumably there’s a check for loops!
- Spider_Copy: this process:-
- Updates the Directory_Structure, Directory_Fine_Structure and Site_Map tables. The first two tables are in the Backups database, the third is in the main database.
- Updates the Raw_Links table in the Slave database, using the following procedures:-
- Full_Link_Same_Directory_Gen: this process simply sets the full link in the simple case where the link is to the same directory. Because so many links are involved (circa 0.5m11 of this category at the moment), this cannot be an update query, but is run through in code, with the slave database being compacted and repaired every 200,000 records.
- Full_Link_Up_Levels_Gen: this process is similar to the above12, but caters for the case where links are not to the same directory. In this case, the raw link will have a variable number of “../” directory-shifters, to get back to the Site root-directory, followed by the address from there. For some reason there’s a query of the Directory_Fine_Structure table.
- Full_Link_Sections_Fix
- … to be completed …
This Note is awaiting further attention13.
In-Page Footnotes:
Footnotes 2, 10, 11:
- I need to write a Functor to import these and other stats associated with this process into this document.
Footnote 3:
- As its name (Web_Generator_Performance.accdb) indicates, it used to hold performance statistics, but the code and data has now been removed from the system as being of very little value, and itself slowing down the system.
Footnote 5:
- The exceptions are those pages that - for one reason or another - aren't regenerated.
- See Prune Website for a process that "prunes" pages that ought to be regenerated, but aren't, because they have become redundant.
- This is an important process, but there are two issues:-
- It only prunes items from the local copy of my website. I’ll need to tidy the live site manually, and therefore need the information to do so .
- This is a dangerous activity, and could easily prune items that need to be retained. It needs documenting in the Note above.
Footnote 6:
- It ought to be possible to stop the program by Ctrl-break, but my experience is that with computationally-intensive programs it’s not possible to get a look in, and in any case you then have to step through the code until a convenient break-point for termination is reached.
- To set this flag, you need to open another copy of MS Access and then open the Spider database, and then open and set the flag in the Spider_Control table.
- Double-clicking .mdb files doesn't work any better than Ctrl-break! You have to go via Windows Start to open MS Access, and then chose the file.
- The last two times I tried using this function, it didn’t work!
- The first time was right at the end when I though the process was looping and it had no effect. Processing ended normally.
- The second time was at the beginning – before it had completed the root directory – it seemed to then carry on to complete the processing without the iterative use of Spider_Scurry (I wasn’t watching the process and didn’t respond to any request to continue). This resulted in the Spider database reaching the 2Gb limit. Processing aborted as the database was corrupt (though it compacted and repaired down to 5Mb – the directory structures had been emptied).
- It needs investigation.
Footnote 7:
- What does this mean, and why is it important?
Footnote 8:
- Or used to ... this is currently commented out!
Footnote 9:
- This 2Gb limit stems from MS Access originally being a 32-bit application, which only allows 4Gb to be addressed.
- But it seems that MS used the high-order bit to indicate whether the memory was system-only or not, thus halving the available addressability.
- See Quora: MS Access 2Gb Limit.
Footnote 12:
- Though currently has a compact/repair every 200,000 records.
Table of the Previous 7 Versions of this Note:
Summary of Notes Referenced by This Note
To access information, click on one of the links in the table above.
Summary of Notes Citing This Note
To access information, click on one of the links in the table above.
Text Colour Conventions
- Blue: Text by me; © Theo Todman, 2023