Theo Todman's Web Page - Notes Pages
Website Generator Documentation - Spider & Slave Database
(Text as at 28/09/2022 10:24:58)
This document covers the following functions performed by clicking buttons on the front screen:-
- Backups (cmdBackup_Click)
- Run Web Spider (cmdSpider_Click)
To see the Code, click on the procedure names above. Backup
- This compacts & repairs, and then backs up the slave database to the name Web_Generator_Performance_Temp_YYMMDD, using function Compact_Repair.
- I need to add a compact & repairs + backs-up function for the Spider database.
- Note that there is an entirely separate subsystem (admittedly based on this one) to back-up the file-system on this (or any other) PC – see Backup_Ctrl, etc.
- This process (governed by Spider_Ctrl, and then Spider_Scurry) interrogates the files on my hard drive that form the local copy of my website, and after determining the directory structure, digs out all the hyperlinks by recursively passing through the files as text files using the VBA FileSystemObject.
- It is possible to buy, or download free, web-spiders that will check hyperlink integrity across the internet when pointed at a base-URL, but (in my experience) these take forever to run on large sites (if only because the timeout limit has to be set such as to avoid false negatives). Hence, I wrote my own Spider to run quickly on a local drive.
- While the Spider will record external links, it does not check them. However, I have a completely separate sub-system that performs this function: see Web Links1.
- Despite this being a local checker with no timeout delays, it still takes a long time to run, so there are various parameters that can be set to control what’s done on a particular run.
- The slave database is now2 only used to hold the following tables:-
- Since the slave database nearly breaks the 2Gb limit during the full Spider run, the location of the various files is under review.
- The main database did break the 2Gb limit during the (first, abortive) September 2021 run, so I removed the following tables to a new ‘Spider’ database:-
- I’d thought of using the ‘.Run’ technology exemplified in Sub AppAccess_Test, but – while I proved the technology OK – it doesn’t seem necessary at the moment. That explains the relocation of some of the smaller tables in Spider.
- Stepping back a bit, there are three main factors involved in the decisions to hold various tables in the same or different databases:
- Database size – especially bloating during the Spider run.
- Efficiency of queries: cross-database joins used to be very inefficient.
- Ability to compact & repair slave databases (but not the master database) mid-run.
- These days – with solid state disks and lots of memory – I think the second factor is less important as the tables are effectively all in memory.
- Consequently, some of the architecture may no longer be strictly required.
- The three main control tables are opened by cmdSpider_Click, namely:- These can be used to control / limit the run.
- Spider_Control: This is a single-row table that contains:-
- Statistics from the last run, together with
- A flag “Update_Since_Last_Run” which, if set to “Yes” limits the run to interrogating pages that have a last-changed timestamp subsequent to the last Spider run. This is less useful than it might be, in that I currently run the Spider after I've regenerated the entire site (using Full Website Re-Gen3), so that in practice almost all4 pages are scanned.
- Another flag “Stop_Spider” which is checked at the start of the recursive Spider_Scurry and which will stop the run if set to “Yes”. The idea is that the control table should be open in another copy of MS Access so that if you want to stop the run without crashing the program, you can do so by setting this flag5.
- Directory_Structure: This table contains one row for each sub-directory (including the root) within the website. It is maintained by the Spider itself as far as the number of rows is concerned. However, the user can set a couple of flags against each row:-
- “Do_Not_Parse”: this can be set to “Yes” to ignore this directory (and any sub-directories). Any corresponding rows in Site_Map are left unchanged.
- “Updates_Only”: this can be set to over-ride the “Update_Since_Last_Run” flag in Spider_Control for this directory (and any sub-directories).
- This table contains one row for each object (HTML page, PDF, GIF, Word document, and the like) in my website.
- It is maintained by the Spider as far as the number of rows is concerned. The Spider records, amongst other things, how long it took for the object to be parsed (“Time_To_Update”, in milliseconds).
- The user can set a flag “Block_Update” that will cause the object not to be parsed in this coming run. This is because some pages are very large and take a very long time to parse.
- This is a query that selects only those rows of Site_Map that have Block_Update = “Yes”, though it is possible to add others from Site_Map by toggling to the table, setting the flag, and bouncing the query.
- This is a “fine-tuning” way of restricting the operation of the Spider.
- During the Spider process, the three main tables:- Are updated via their local analogues, which are maintained during the bulk of the processing, until the Sub Spider_Copy is called at the end. There is a bit of fancy footwork to ensure the various options don’t either leave redundant links in place, or delete live ones.
- Spider_Copy also performs the processing to create the “full links” column in the Raw_Links table.
- Spider_Copy also segregates out6 the Section-link within a page, and then runs7 the Spider_Missing_Internal_Links query to pick up broken links within the site.
- As noted above, the Spider takes a long time to run. In addition, the activity on the Slave Database that contains the Raw_Links table expands the table from its “resting” size of under 600Mb to over the 2Gb limit8 beyond which the database becomes corrupt. Consequently, I have added two diagnostic and repair functions during the processing-
- I use debug.print to timestamp the various sub-processes within the run.
- I call the function Compact_Repair periodically to Compact & Repair the Slave Database periodically. If this fails, as it sometimes seems to, the process as a whole can fail.
- These diagnostics reveal that:-
- Spider_Scurry is the longest-running process, as might be expected; it can take up to 5 hours based on the database / website at its current size.
- The only other significant process is Full_Link_Up_Levels_Gen, called by Spider_Copy, which takes about 3 hours.
- No other process takes more than a few minutes.
- Description of the detailed processing now follows:-
- Spider_Ctrl: as might be expected, this controls the other processing. It asks questions about the run required so that parameters can be set. It then calls the two main processes below, and finally calls the three processes that output the Web-Links Test Webpages.
- Spider_Scurry: this is self-referential procedure that combs the links recursively until a “leaf” page is found with no further links. Presumably there’s a check for loops!
- Spider_Copy: this process:-
- Updates the Directory_Structure, Directory_Fine_Structure and Site_Map tables. The first two tables are in the Backups database, the third is in the main database.
- Updates the Raw_Links table in the Slave database, using the following procedures:-
- Full_Link_Same_Directory_Gen: this process simply uses the query Full_Link_Same_Directory_Updt to set the full link in the simple case where the link is to the same directory. Because so many links are involved (circa 0.5m of this category at the moment), this cannot be an update query, but is run through in code, with the slave database being compacted and repaired every 300,000 records.
- Full_Link_Up_Levels_Gen: this process is similar to the above9, but caters for the case where links are not to the same directory. In this case, the raw link will have a variable number of “../” directory-shifters, to get back to the Site root-directory, followed by the address from there. For some reason there’s a query of the Directory_Fine_Structure table.
- … to be completed …
This Note is awaiting further attention10.
In-Page Footnotes:Footnote 2:
- As its name indicates, it used to hold performance statistics, but the code and data has now been removed from the system as being of very little value, and itself slowing down the system.
- The exceptions are those pages that - for one reason or another - aren't regenerated.
- See Prune Website for a process that "prunes" pages that ought to be regenerated, but aren't, because they have become redundant.
- It ought to be possible to stop the program by Ctrl-break, but my experience is that with computationally-intensive programs it’s not possible to get a look in, and in any case you then have to step through the code until a convenient break-point for termination is reached.
- To set this flag, you need to open another copy of MS Access and then open a second copy of the Generator, and then open and set the flag in the Spider_Control table.
- Double-clicking .mdb files doesn't work any better than Ctrl-break!
- What does this mean, and why is it important?
- Or used to ... this is currently commented out!
- This stems from MS Access originally being a 32-bit application, which allows 4Gb to be addressed.
- But it seems that MS used the high-order bit to indicate whether the memory was system-only or not, thus halving the available addressability.
- See Quora: MS Access 2Gb Limit.
- Though currently has a compact/repair every 200,000 records.
Table of the Previous 5 Versions of this Note:
Summary of Notes Referenced by This Note
To access information, click on one of the links in the table above.
Summary of Notes Citing This Note
To access information, click on one of the links in the table above.
Text Colour Conventions
- Blue: Text by me; © Theo Todman, 2022