Theo Todman's Web Page - Notes Pages


Website Documentation

Website Generator Documentation - Spider & Slave Database

(Work In Progress: output at 09/07/2018 00:39:11)

(For earlier versions of this Note, see the table at the end)


This document covers1 the following functions performed by clicking buttons on the front screen:-

  1. Backups + ZoomSearch (cmdBackup_Click)
  2. Run Web Spider (cmdSpider_Click)

To see the Code, click on the procedure names above.

Backup
  1. This compacts & repairs, and then backs up the slave database to the name Web_Generator_Performance_Temp_YYMMDD, using function Compact_Repair.
  2. The slave database is now2 only used to hold Raw_Links table.
  3. Note that there is an entirely separate subsystem (admittedly based on this one) to back-up the file-system on this (or any other) PC – see Backup_Ctrl, etc.

Spider
  1. This process (governed by Spider_Ctrl, and then Spider_Scurry) interrogates the files on my hard drive that form the local copy of my website, and after determining the directory structure, digs out all the hyperlinks by recursively passing through the files as text files using the VBA FileSystemObject.
  2. It is possible to buy, or download free, web-spiders that will check hyperlink integrity across the internet when pointed at a base-URL, but (in my experience) these take forever to run on large sites (if only because the timeout limit has to be set such as to avoid false negatives). Hence, I wrote my own Spider to run quickly on a local drive. While it will record external links, it cannot check them. Indeed, I don’t know how to do this, and maybe some sites are resistant to robots.
  3. Despite this being a local checker with no timeout delays, it still takes a long time to run, so there are various parameters that can be set to control what’s done on a particular run.
  4. The three main control tables are opened by cmdSpider_Click, namely:- These can be used to control / limit the run.
  5. Spider_Control: This is a single-row table that contains:-
    • Statistics from the last run, together with
    • A flag “Update_Since_Last_Run” which, if set to “Yes” limits the run to interrogating files that have a last-changed timestamp subsequent to the last Spider run.
    • Another flag “Stop_Spider” which is checked at the start of the recursive Spider_Scurry and which will stop the run if set to “Yes”. The idea is that the control table should be open in another copy of MS Access so that if you want to stop the run without crashing the program, you can do so by setting this flag3.
  6. Directory_Structure: This table contains one row for each sub-directory (including the root) within the website. It is maintained by the Spider itself as far as the number of rows is concerned. However, the user can set a couple of flags against each row:-
    • “Do_Not_Parse”: this can be set to “Yes” to ignore this directory (and any sub-directories). Any corresponding rows in Site_Map are left unchanged.
    • “Updates_Only”: this can be set to over-ride the “Update_Since_Last_Run” flag in Spider_Control for this directory (and any sub-directories).
  7. Site_Map:
    • This table contains one row for each object (HTML page, PDF, GIF, Word document, and the like) in my website.
    • It is maintained by the Spider as far as the number of rows is concerned. The Spider records, amongst other things, how long it took for the object to be parsed (“Time_To_Update”, in milliseconds).
    • The user can set a flag “Block_Update” that will cause the object not to be parsed in this coming run. This is because some pages are very large and take a very long time to parse.
  8. Blocked_Spider_Files is a query that selects only those rows of Site_Map that have Block_Update = “Yes”, though it is possible to add others from Site_Map by toggling to the table, setting the flag, and bouncing the query. This is a “fine-tuning” way of restricting the operation of the Spider.
  9. During the Spider process, the three main tables:- Are updated via their local analogues, which are maintained during the bulk of the processing, until the Sub Spider_Copy is called at the end. There is a bit of fancy footwork to ensure the various options don’t either leave redundant links in place, or delete live ones.
  10. Sub Spider_Copy also performs the processing to create the “full links” column in the Raw_Links table. It also segregates out the Section-link within a page, and then runs the Spider_Missing_Internal_Links query to pick up broken links within the site.
  11. … to be completed …

This Note is awaiting further attention4.



In-Page Footnotes:

Footnote 1: Footnote 2: It used – as its name indicates – to hold performance statistics, but the code and data has now been removed from the system as being of very little value.

Footnote 3: It ought to be possible to stop the program by Ctrl-break, but my experience is that with computationally-intensive programs it’s not possible to get a look in, and in any case you then have to step through the code until a convenient break-point for termination is reached.


Printable Version:



Table of the Previous 2 Versions of this Note:

Date Length Title
13/01/2015 19:07:41 5310 Website Generator Documentation - Spider & Slave Database
02/06/2013 10:58:05 347 Website Generator Documentation - Spider & Slave Database



Note last updated Reference for this Topic Parent Topic
09/07/2018 00:39:12 986 (Website Generator Documentation - Spider & Slave Database) Website Generator Documentation - Control Page

Summary of Note Links from this Page

Awaiting Attention (Documentation)        

To access information, click on one of the links in the table above.




Summary of Note Links to this Page

Website Generator Documentation - Control Page, 2 Website Generator Documentation - Web Links, 2      

To access information, click on one of the links in the table above.




Text Colour Conventions

  1. Black: Printable Text by me; © Theo Todman, 2018
  2. Blue: Text by me; © Theo Todman, 2018




© Theo Todman, June 2007 - July 2018.Please address any comments on this page to theo@theotodman.com.File output:
Website Maintenance Dashboard
Return to Top of this PageReturn to Theo Todman's Philosophy PageReturn to Theo Todman's Home Page