Theo Todman's Web Page - Notes Pages


Website Documentation

Website Generator Documentation - Web Links

(Work In Progress: output at 03/09/2018 13:25:51)

(For earlier versions of this Note, see the table at the end)


Purpose of this Note


Overview of Use of WebRefs
  1. Initially, web references are entered into Notes, Book or Paper Abstracts & Comments, or Author Narratives in their standard URL form, recognised by either or both of “HTTP” and “WWW”.
  2. Subsequent processes interrogate the table rows corresponding to these objects and search for the above indicators of the start of a URL, and then look for various possibilities for the end. URLs without “HTTP” have it added. Having found a candidate URL, a table is interrogated to see if it has been used before, and its key thereby determined, otherwise it is added to a table and a unique integer key generated.
  3. This key is then used to encode the reference in the form “+WnnnnW+” in the table record in which it was found.
  4. Subsequently, whenever a web-page is formatted, this encoding is interrogated and the web-link reconstituted.
    • The default display text is “Link”, though this can be overridden by adding the required text to the “Display Text” column in the Webrefs_Table.
    • An option exists to show the URL explicitly, as in “printable Notes”.
    • If the ID is flagged as defunct, the link is followed by “Defunct” in brackets.

Code / Functions
Spider
  1. See Spider_Ctrl and this Note13.
  2. This process crawls the local copy of my website, using Spider_Scurry and digs out all the links – internal and external. These are originally logged to Raw_Links and then (via WebLinkCheck) to Webrefs_Table if they are not already present.
  3. These processes find links by searching for the HTML tags, the URLs within the tags having been supplied either from manually-encoded HREFs, or ultimately from Webrefs_Table itself, so this should really only add rows for manually-encoded HREFs.
  4. However, there was a case recently where a raft of rows were added by the Spider. These rows are loaded with Issue set to “Created by Spider”.
  5. I investigated the above, and it seemed to be a problem for Notes and Notes_Archive whereby manually encoded hyperlinks had URLs that couldn’t be found on Webrefs_Table. Probably this was down to correcting the original matching entries – either for trivial reasons (secured, trailing slash) or because they had genuinely changed. I’ve hopefully corrected this problem by a new routine (Translate_Hrefs_To_Webrefs) that has encoded these hyperlinks have been encoded in the +WW+ format.

Error Conditions
  1. These are, currently, when generated automatically:-
    • Manual Check OK: Set manually. This routine doesn’t check again.
    • Created by Spider: Loaded by Spider_Scurry if there’s no record. These situations should only occur very rarely, where there is a hard-coded “HREF”.
    • Timeout: If slow response means the default 3k or 6k checks are exceeded.
    • File Type Uncheckable: My routines don’t work for Word or PowerPoint documents (.doc, .docx, .pps), or .mp3 files. When I spot these, The WebRefs will be flagged as “Manual Check”.
    • URL Not found: if the URL returned is Link (Defunct) or Link.
    • Page Not Found: if initially “URL Not Found”, but the requested URL contains “.htm” or “.shtm”.
    • Document Not Found: if initially “URL Not Found”, but not overridden to “Page Not Found”, and the last 6 characters of the Requested URL contained a “.”.
    • URL Secured: if the URL returned is identical to that requested, but with “https” rather than “http”.
    • URL with trailing slash : if the URL returned is identical to that requested apart from that one or other has a trailing “/”.
    • URL Differs: Any difference other than those listed above.
    • URL Translated OK: Set manually. Used when using the Translate_Webrefs function. Naturally, there should be no pages to which this error applies
  2. If an error of “URL Differs” is returned, and the Requested URL contains “youtube”, then this error is blanked out if either or both the requested or returned URL terminates with "&t=" (followed by the time in seconds), but the URLs are otherwise identical when this suffix is removed.
  3. These settings may be overridden manually, so occasionally they don’t have the precise meaning above.
  4. In particular, sometimes the website returns a “not found” page without changing the URL, so I don’t recognise the problem – but when I do find out I set the Error Condition manually.

Detailed Processing
  1. Functions / Processes Used14:-
  2. WebEncode
    • The comment says “This is a new routine to convert hard-coded external hyperlinks into my +WW+ format”.
    • Searches are made for the start of a URL – either “HTTP” or “WWW”, whichever comes first.
    • If the prospective URL is proceeded by15 an “HREF” the link is taken to be manually encoded, so is ignored (and remains “as is” in the ensuing HTML). Otherwise …
    • URLs that start with “WWW” are prefixed with “HTTP16” before being written to Webrefs_Table.
    • Checks for “WRx”, where x is a delimiter. This is to allow for WebRefs with characters in them that would normally cause “end of URL” prematurely. Instead, this tells the function to look for the termination character “x” as (immediately following) the end of the URL. The “WRx” is deleted.
    • Now checks for the termination of the URL. The earliest of:-
      → Space
      → chr(9)
      → chr(10)
      → chr(13)
      → “<”
      → “)”
      → “|”,
      → “;”
      → “, ” (note the space: commas are OK within URLs)
    • If none of these terminators is found, then it is assumed that the URL continues to the end of the text17.
    • There’s then a check for an open-ended list of identifiers that can indicate the end of the URL, but can also be part of it. I currently check for:-
      → “.”
      → “:”, and
      → “)”.
      Basically, I assume – subject to the further checks below – that these would not be the final characters of the URL, so just lop them off from the provisional URL if they had been initially presumed to be the last character, and assume that any earlier occurrences would truly be part of the URL.
    • Some further adjustments to the above:-
      1. “.” can be a URL-terminator (as in Link). Currently I only allow for “Jnr.”, but will add others if they turn up.
      2. “)” can also be a URL-terminator (as in Link or Link). I have a clever little routine that adds up the opening and the closing brackets in the putative URL, and if they are equal numbers, the final closing bracket is part of that URL, else not.
      3. Some combinations of such terminators won’t work – these have to be sorted manually.
    • A search is made for the URL in the Webrefs_Table. If it’s found, the URL is replaced in the document as +WnnnnW+, where “nnnn” if the ID returned. If it’s not found, and is > 12 characters long18, it is added to the Webrefs_Table, and replaced as above.
    • Note that no hyperlinks are created by this process.
  3. Convert_Webrefs
    • Called by cmdRecalculate_Click, and invokes WebEncode (above) for all changed objects.
    • Convert_Webrefs is called with the type of Object – Paper, Book, Author, Note, Note_Archive.
    • In the case of Papers and Books, there’s a doubling up whereby both the Abstract and the Comment are addressed.
    • The business of this Function is to use WebEncode to encode all URLs in the Object. If the Object returned differs from that sent – ie. Some translations of raw URLs have taken place – the Object is updated. This is the only place where this will now take place, so it is important that this process is run!
    • Currently, the process decides which Objects – of the relevant type – by using Maintainable_Objects (and other tables) that are supposed to show which objects have changed.
    • Since the above process doesn’t always work, I need to add a process of forcing the encoding by checking all objects of the chosen type for the presence of HTTP or WWW, and updating the rows returned.
  4. Reference_Webrefs
    • Searches are made for items of the form “+WnnnnW+”.
    • There are length checks for “nnnn” – anything greater than 5 characters is rejected, as is any non-numeric string – like the one just appearing in this document!
    • Valid IDs are used to interrogate the Webrefs_Table.
    • If the reference is not found,
      1. “Missing Reference” is substituted for the “+WnnnnW+”.
      2. A debug message is output, and
      3. A row is added to WebRef_Missing_IDs.
    • Otherwise,
      1. If the table row does not have the “defunct reference” set, the reference is formatted as a hyperlink, to open in the same tab, using the URL retrieved.
      2. The link name is defaulted to “Link”, but is overridden if a Display Text had been entered.
      3. If the “Show Link” parameter was set, the URL is displayed (in brackets).
      4. If the table row does have the “defunct reference” set, the (failing) hyperlink is retained, but with “Defunct” (in brackets) after it.
    • If the Calling type <> “X”, then
      1. Cross_Reference_Add is called to add a cross-reference.
      2. A “name” HTML tag is added so that this place in the page can be linked to19.
  5. Webrefs_Update
    • This function requires its own Note20, though most of its functionality was described above under “Code / Functions, section 2”.
    • However, as noted above, it checks the URLs in Webrefs_Table against the Web, and updates the table with the URL returned, where this differs from the requested, and logs any problems encountered.
    • For Timeout re-checks, the number of checks is doubled to 6,000, and any Timeout more than an hour old is selected.
  6. Translate_Webrefs
    • This function translates selected +WnnnnW+ references to +WmmmmW+ in the tables containing the relevant objects: namely, Authors, Books (Abstracts & Comments, separately), Notes, Notes_Archive & Papers (Abstracts & Comments, separately).
    • The code is a bit clunky, having 7 loops in sequence.
  7. Map_Webrefs
    • This routine creates the table WebRef_Maps, which shows which WebRefs feature in which objects.
    • Rows are created for all the primary objects; ie. for Authors, Books, Notes, Notes_Archive and Papers.
    • The grunt work is undertaken by Map_WebRefs_Mapper.
    • The error function “Resume next” is used if the same WebRef is multiply attested in the same Object to circumvent the duplicate key.



In-Page Footnotes:

Footnote 2: Footnote 3: Footnote 4: Footnote 5: Footnote 6: Footnote 7: Footnote 8: Footnotes 9, 10: See the list later on …

Footnote 11: Footnote 12: Footnote 14: Footnote 15: Footnote 16: Footnote 17: Footnote 18: Footnote 19:


Table of the Previous 2 Versions of this Note:

Date Length Title
05/04/2018 10:48:00 22402 Website Generator Documentation - Web Links
05/01/2018 00:11:31 22260 Website Generator Documentation - Web Links



Note last updated Reference for this Topic Parent Topic
03/09/2018 13:25:52 1247 (Website Generator Documentation - Web Links) Website Generator Documentation - Control Page

Summary of Note Links from this Page

Awaiting Attention (Documentation) Website Generator Documentation - Spider & Slave Database      

To access information, click on one of the links in the table above.




Summary of Note Links to this Page

Website - Progress to Date, 2 Website Generator Documentation - Control Page Website Generator Documentation - Spider & Slave Database    

To access information, click on one of the links in the table above.




Text Colour Conventions

  1. Blue: Text by me; © Theo Todman, 2018




© Theo Todman, June 2007 - Sept 2018.Please address any comments on this page to theo@theotodman.com.File output:
Website Maintenance Dashboard
Return to Top of this PageReturn to Theo Todman's Philosophy PageReturn to Theo Todman's Home Page