Theo Todman's Web Page - Notes Pages
Website Documentation
Website Generator Documentation - Web Links
(Work In Progress: output at 22/09/2023 04:00:49)
Purpose of this Note
- The main point of this Note is to describe the overall functioning of the routines that create and maintain the links to external websites from my own site.
- The problem with external links is that they get out of date or become defunct. It is also difficult to remember or determine the pages the links are from, and ensure that all pages with the same link are corrected if a link changes. I have built a complex system to address these problems, but:-
- It has some bugs in it (to be detailed later, though these have now mostly been fixed), and
- It requires regular policing, facilities for which have been provided, but which may require further work in the light of on-going experience.
- There is some overlap between this Note and that for the Spider1. The relevance here is that the spider crawls a local copy of my website and digs out all the hyperlinks (most of which are internal to my site) into the Raw_Links table.
- There are - as of 22/09/2023 03:54:53 - about 2.61 million rows in the Raw_Links table, but only around 97.5k are external links (of Link_Type = "Web"), which is still rather a lot, though most are "repeats"! There are about 9.4k active unique external links.
- I’ve now added a process to extract all the encoded hyperlinks in my database for use in cross-checking. This is a much simpler process than the Spider as it only runs against well-formatted items in database objects rather than on HTML pages.
- See the Section ‘Outstanding Developments’ at the end of this Note for issues with Internet Explorer and Windows 11.
Overview of Use of WebRefs
- Initially, web references are entered into Notes, Book or Paper Abstracts & Comments, or Author Narratives in their standard URL form, recognised by either or both of “HTTP” and “WWW”.
- Subsequent processes interrogate the table rows corresponding to these objects and search for the above indicators of the start of a URL, and then look for various possibilities for the end. URLs without “HTTP” have it added. Having found a candidate URL, a table is interrogated to see if it has been used before, and its key thereby determined, otherwise it is added to a table and a unique integer key generated.
- This key is then used to encode the reference in the form “+WnnnnW+” in the table record in which it was found.
- Subsequently, whenever a web-page is formatted, this encoding is interrogated and the web-link reconstituted.
- The default display text is “Link”, though this can be overridden by adding the required text to the “Display Text” column in the Webrefs_Table.
- An option exists to show the URL explicitly, as in “printable Notes”.
- If the ID is flagged as defunct, the link is followed by “Defunct” in brackets.
Code / Functions
- Encoding and logging of URLs to Webrefs_Table is performed by WebEncode.
- Actual updating of the database objects containing the URLs is performed by Convert_WebRefs, which calls WebEncode with the relevant objects. Thereafter the objects in the database no longer contain URLs but only WebRefs of the form +WnnnnW+.
- Decoding of WebRefs so that hyperlinks exist and function in the relevant web pages is performed by Reference_Webrefs.
- Note that it is important that – wherever the two routines WebEncode and Reference_Webrefs are run in conjunction with other decoders (for Papers, Books and Notes in Footnotes) that WebEncode is run first, and Reference_Webrefs is run after the other decoders. This is because URLs can contain “+x”, where “x” is B, P or N. This confuses the referencing routines for Books, Papers and Notes in Footnotes which use such strings to indicate the start of the encoded reference. So the URLs need to be got rid of by encoding, and only decoded after the other decoding has been completed.
- The following functions are performed by clicking a button on the front screen:-
→ Update / Check Web Links: cmdWebLinks_Click
Six options are presented:-
- Update the Webrefs_Table Table
- Query qrySearch_Webrefs_Table is run. A parameter for a search is requested; null returns the whole table in ascending ID sequence, else the URLs are searched for the term.
- The URL or other information for selected IDs can then be updated. Reasons and actions include:-
→ there has been an encoding error,
→ the URL has genuinely changed,
→ an “Alt_Ref2” can be added if attempts to change the URL lead to a duplicate key as the new URL is already “taken”
→ automated checking fails for whatever reason, but a manual check proves OK. In that case, the Issue can be set to “Manual Check OK3”
→ a “Display Text” can be added or changed
→ the ID can be marked as “defunct”
→ the “Alt_Ref” or Issue can be changed or suppressed (sometimes needed during error-correction procedures)
- Check the Webrefs_Table External Links
- This invokes the sub Webrefs_Update, which checks the links in Webrefs_Table against the Web, and updates the table with the results of the check, noting any irregularities. This invokes a slave copy of Internet Explorer, basically checking that the URL returned is that requested, and recording the time taken and the number of tries4.
- The various parameters are displayed and then three options are presented:-
- Run Full Check
- Run Check for New Links since last run
- Run for Time-outs in last run
- The parameters are actually in the code, as:-
- The “recent5” parameter is set to 10 days, though this is reset to one hour if the “timeout” option is taken.
- The maximum run-time is 0.5 hours.
- Internet Explorer is “bounced6” every minute, or after a dud URL7 or timeout.
- The maximum number of checks Internet Explorer makes for each URL is 3,000.
- Regenerate the External Links Test Pages: this invokes three procedures to produce five sets of8 web-pages:-
- Spider_WebLinks_Tester_Brief_Page_Gen:
→ Generates WebLinks_Tester_Brief.htm, which has one line for each row on Webrefs_Table, in ascending ID sequence.
- Spider_WebLinks_Tester_Page_Gen:
→ Generates WebLinks_Tester.htm, which has a page for each error condition9.
→ For each ID subject to the relevant condition all the pages that use the URL associated with the ID are listed.
- Spider_WebLinks_Tester_Page_Full_Gen:
→ Generates WebLinks_Tester_Full.htm, which has a row for each link from my website.
→ These are sorted in ascending ID sequence, with links to the pages and any issues arising.
→ This procedure has to run after the above, as it uses tables created therein.
- Spider_WebLinks_Tester_Page_Gen:
→ Run again, this time with the “Map” parameter. This uses the WebRef_Maps table created by Map_WebRefs.
→ Generates WebLinks_Tester_Map.htm, which has a page for each active error condition10.
→ For each ID subject to the relevant condition all the pages that use the URL associated with the ID are listed.
- Spider_WebLinks_Tester_Page_Full_Gen:
→ Run again, this time with the “Map” parameter.
→ Generates WebLinks_Tester_Full_Map.htm, which has a row for each link from my website.
→ These are sorted in ascending ID sequence, with links to the pages and any issues arising.
→ This procedure has to run after the above, as it uses tables created therein.
Note: Before the above procedures are run, the previously-generated pages are deleted from the local site11.
- Display Webrefs Errors: This runs three queries & displays one table:-
- Convert WebRefs:
- This is a function Translate_Webrefs that translates selected +WnnnnW+ references to +WmmmmW+ in the tables containing the relevant objects.
- The nnnn → mmmm mapping is held in the Webrefs_Translation table.
- Map WebRefs:
- This is a function Map_WebRefs that searches Authors, Books, Notes, Notes_Archive and Papers to check for WebRefs (encoded as +WnnnnW+).
- The mappings are held in the WebRef_Maps table.
- This is a different take on things to the Spider, which derives the links from the actual HREFs in the pages. Thereby, the Spider
- Shows non-encoded links (now fairly few).
- Shows links from composite pages (Notes_Print, BookPaperAbstracts).
- Is usually out-of-date, as it’s only run monthly.
- Even so, there are more discrepancies than I’d expect and require investigation.
Spider
- See Spider_Ctrl and this Note14.
- This process crawls the local copy of my website, using Spider_Scurry and digs out all the links – internal and external. These are originally logged to Raw_Links and then (via WebLinkCheck) to Webrefs_Table if they are not already present.
- These processes find links by searching for the HTML tags, the URLs within the tags having been supplied either from manually-encoded HREFs, or ultimately from Webrefs_Table itself, so this should really only add rows for manually-encoded HREFs.
- However, there was a case recently where a raft of rows were added by the Spider. These rows are loaded with Issue set to “Created by Spider”.
- I investigated the above, and it seemed to be a problem for Notes and Notes_Archive whereby manually encoded hyperlinks had URLs that couldn’t be found on Webrefs_Table. Probably this was down to correcting the original matching entries – either for trivial reasons (secured, trailing slash) or because they had genuinely changed. I’ve hopefully corrected this problem by a new routine (Translate_Hrefs_To_Webrefs) that has encoded these hyperlinks have been encoded in the +WW+ format.
Error Conditions
- These are, currently, when generated automatically:-
- Manual Check OK: Set manually. This routine doesn’t check again.
- Created by Spider: Loaded by Spider_Scurry if there’s no record. These situations should only occur very rarely, where there is a hard-coded “HREF”.
- Timeout: If slow response means the default 3k or 6k checks are exceeded.
- File Type Uncheckable: My routines don’t work for Word or PowerPoint documents (.doc, .docx, .pps), or .mp3 files. When I spot these, The WebRefs will be flagged as “Manual Check”.
- URL Not found: if the URL returned is Link (Defunct) or Link (Defunct).
- Page Not Found: if initially “URL Not Found”, but the requested URL contains “.htm” or “.shtm”.
- Document Not Found: if initially “URL Not Found”, but not overridden to “Page Not Found”, and the last 6 characters of the Requested URL contained a “.”.
- URL Secured: if the URL returned is identical to that requested, but with “https” rather than “http”.
- URL with trailing slash : if the URL returned is identical to that requested apart from that one or other has a trailing “/”.
- URL Differs: Any difference other than those listed above.
- URL Translated OK: Set manually. Used when using the Translate_Webrefs function. Naturally, there should be no pages to which this error applies
- If an error of “URL Differs” is returned, and the Requested URL contains “youtube”, then this error is blanked out if either or both the requested or returned URL terminates with "&t=" (followed by the time in seconds), but the URLs are otherwise identical when this suffix is removed.
- These settings may be overridden manually, so occasionally they don’t have the precise meaning above.
- In particular, sometimes the website returns a “not found” page without changing the URL, so I don’t recognise the problem – but when I do find out I set the Error Condition manually.
Other Detailed Processing
- Functions / Processes Used:
Note the distinction between functions performed on database rows and on the web-pages created from such rows. The list below is discussed in detail in the bullets following.
- WebEncode
- The comment says “This is a new routine to convert hard-coded external hyperlinks into my +WW+ format”.
- Searches are made for the start of a URL – either “HTTP” or “WWW”, whichever comes first.
- If the prospective URL is proceeded by15 an “HREF” the link is taken to be manually encoded, so is ignored (and remains “as is” in the ensuing HTML). Otherwise …
- URLs that start with “WWW” are prefixed with “HTTP16” before being written to Webrefs_Table.
- Checks for “WRx”, where x is a delimiter. This is to allow for WebRefs with characters in them that would normally cause “end of URL” prematurely. Instead, this tells the function to look for the termination character “x” as (immediately following) the end of the URL. The “WRx” is deleted.
- Now checks for the termination of the URL. The earliest of:-
→ Space
→ chr(9)
→ chr(10)
→ chr(13)
→ “<”
→ “)”
→ “|”,
→ “;”
→ “, ” (note the space: commas are OK within URLs)
- If none of these terminators is found, then it is assumed that the URL continues to the end of the text17.
- There’s then a check for an open-ended list of identifiers that can indicate the end of the URL, but can also be part of it. I currently check for:-
→ “.”
→ “:”, and
→ “)”.
Basically, I assume – subject to the further checks below – that these would not be the final characters of the URL, so just lop them off from the provisional URL if they had been initially presumed to be the last character, and assume that any earlier occurrences would truly be part of the URL.
- Some further adjustments to the above:-
- “.” can be a URL-terminator (as in Wikipedia: John Forbes Nash, Jr. (Defunct)). Currently I only allow for “Jnr.”, but will add others if they turn up.
- “)” can also be a URL-terminator (as in Wikipedia: Levi - The Periodic Table or Wikipedia: The Hardest Part (Coldplay song)). I have a clever little routine that adds up the opening and the closing brackets in the putative URL, and if they are equal numbers, the final closing bracket is part of that URL, else not.
- Some combinations of such terminators won’t work – these have to be sorted manually.
- A search is made for the URL in the Webrefs_Table. If it’s found, the URL is replaced in the document as +WnnnnW+, where “nnnn” if the ID returned. If it’s not found, and is > 12 characters long18, it is added to the Webrefs_Table, and replaced as above.
- Note that no hyperlinks are created by this process.
- Convert_Webrefs
- Called by cmdRecalculate_Click, and invokes WebEncode (above) for all changed objects.
- Convert_Webrefs is called with the type of Object – Paper, Book, Author, Note, Note_Archive.
- In the case of Papers and Books, there’s a doubling up whereby both the Abstract and the Comment are addressed.
- The business of this Function is to use WebEncode to encode all URLs in the Object. If the Object returned differs from that sent – ie. Some translations of raw URLs have taken place – the Object is updated. This is the only place where this will now take place, so it is important that this process is run!
- Currently, the process decides which Objects – of the relevant type – by using Maintainable_Objects (and other tables) that are supposed to show which objects have changed.
- Since the above process doesn’t always work, I need to add a process of forcing the encoding by checking all objects of the chosen type for the presence of HTTP or WWW, and updating the rows returned.
- Reference_Webrefs
- Searches are made for items of the form “+WnnnnW+”.
- There are length checks for “nnnn” – anything greater than 5 characters is rejected, as is any non-numeric string – like the one just appearing in this document!
- Valid IDs are used to interrogate the Webrefs_Table.
- If the reference is not found,
- “Missing Reference” is substituted for the “+WnnnnW+”.
- A debug message is output, and
- A row is added to WebRef_Missing_IDs.
- Otherwise,
- If the table row does not have the “defunct reference” set, the reference is formatted as a hyperlink, to open in the same tab, using the URL retrieved.
- The link name is defaulted to “Link”, but is overridden if a Display Text had been entered.
- If the “Show Link” parameter was set, the URL is displayed (in brackets).
- If the table row does have the “defunct reference” set, the (failing) hyperlink is retained, but with “Defunct” (in brackets) after it.
- If the Calling type <> “X”, then
- Cross_Reference_Add is called to add a cross-reference.
- A “name” HTML tag is added so that this place in the page can be linked to19.
- Webrefs_Update
- This function requires its own Note20, though most of its functionality was described above under “Code / Functions, section 2”.
- However, as noted above, it checks the URLs in Webrefs_Table against the Web, and updates the table with the URL returned, where this differs from the requested, and logs any problems encountered.
- For Timeout re-checks, the number of checks is doubled to 6,000, and any Timeout more than an hour old is selected.
- Translate_Webrefs
- This function translates selected +WnnnnW+ references to +WmmmmW+ in the tables containing the relevant objects: namely, Authors, Books (Abstracts & Comments, separately), Notes, Notes_Archive & Papers (Abstracts & Comments, separately).
- The code is a bit clunky, having 7 loops in sequence.
- Map_Webrefs
- This routine creates the table WebRef_Maps, which shows which WebRefs feature in which objects.
- Rows are created for all the primary objects; ie. for Authors, Books, Notes, Notes_Archive and Papers.
- The grunt work is undertaken by Map_WebRefs_Mapper.
- The error function “Resume next” is used if the same WebRef is multiply attested in the same Object to circumvent the duplicate key.
- Display Webrefs Errors
- I don’t think I’ve got more to say that was covered in the relevant sub-section of the Code / Functions section.
- External Links Test Pages.
- I don’t think I’ve got more to say that was covered in the relevant sub-section of the Code / Functions section.
Outstanding Developments
- The primary issue is that I run most of my procedures on my new Desktop machine, which runs Windows 11. However, Internet Explorer won’t run under Windows 11, only under Windows 10 on my Laptop.
- I have a manual work-around which simply involves copying my main Access database (Web_Generator_Prototype.accdb) from my Desktop to my Laptop, running the WebRefs checker on the laptop and then copying the database back again.
- However, in the future it would be best to use Microsoft Edge if I can work out how to do it. I asked Bing’s AI and it came up with the following21:-
- You will also need to install SeleniumBasic, a library that allows you to control browsers from VBA.
→ See Stack Overflow: Change from IE to MS Edge Browser, VBA web Scraping.
Here are the steps to set up your environment:
- Install Microsoft Edge (Chromium) or Microsoft Edge (EdgeHTML) on your computer. To confirm that you have Microsoft Edge (Chromium) installed, go to edge://settings/help in the browser, and verify the version number is Version 75 or later.
→ See Stack Overflow: Automating Edge Browser using VBA without downloading Selenium.
Download the corresponding version of Edge WebDriver from this link: Microsoft Edge WebDriver.
→ See Stack Overflow: Automating Edge Browser using VBA without downloading Selenium.
Find the path of SeleniumBasic which is C:\Users\%username%\AppData\Local\SeleniumBasic in my situation (it might also be in this path C:\Program Files\SeleniumBasic ), copy the Edge WebDriver msedgedriver.exe to this path.
→ See Stack Overflow: Change from IE to MS Edge Browser, VBA web Scraping.
In your VBA editor, go to Tools > References and check Selenium Type Library.
→ See Stack Overflow: Change from IE to MS Edge Browser, VBA web Scraping.
Once you have done these steps, you can use VBA code to create a web spider using Edge and Selenium. For example, you can use this code to open a webpage and get its title: Dim driver As New Selenium.WebDriver
driver.Start "edge", "https://www.bing.com"
driver.Get "/"
MsgBox driver.Title
You can find more examples and documentation on how to use Selenium with VBA here: https://github.com/florentbr/SeleniumBasic/wiki. This link produces a warning message (in French!).
In-Page Footnotes:
Footnote 2:
- This functionality – whatever it is – needs to be replaced!
Footnote 3:
- There will then be no further checks in subsequent runs.
- cmdWebLinks_Click issues a warning to perform a manual re-check if the latest check was more than 3 months ago.
Footnote 4:
- This process requires further explanation.
- I’d have preferred not to use IE, but it was the only one whose interface from MS Access I could find.
- One day I may be able to re-write the reference-checking routines in Python, or some other more appropriate language.
Footnote 5:
- I’m not sure what this is for, but will leave investigation until later!
- I suspect it has to do with “restarts” being required in a full run.
Footnote 6:
- ie. closed and re-opened.
- Internet Explorer frequently becomes unresponsive; this is a process to limit the occurrences, but doesn’t always work.
- It is necessary to use Task Manager to close any orphan copies of IE; this isn’t always necessary for one of the short runs, but is always so for those that have to be re-started.
Footnote 7:
- For “dud” read page or document not found, or Returned URL differs from the Requested URL in a non-trivial way.
Footnote 8:
- Because these pages were getting so large, they are now split into a number of pages, with suffices.
Footnotes 9, 10: See the list later on …
Footnote 11:
- A manual check and deletion is needed to delete un-regenerated pages from the live site.
- This condition occurs when errors are fixed and the number of error-pages reduces.
Footnote 12:
- The query links to another copy of the Webrefs_Table to provide information on the status of the “Alt_Ref”.
- This “Alt_Ref” needs explanation. It was provided because some conditions could be fixed by changing the URL on Webrefs_Table, but this was not possible as the new URL was already “taken”, with its own ID.
- “Alt_Ref” quotes this ID, but the “wheeze” hasn’t altogether worked, and requires investigation.
Footnote 13:
- This is the core of the URL, with HTTP or WWW prefixes and any trailing slash omitted.
- They are maintained by the query WebRef_Kernels_Updt, which is invoked whenever cmdWebLinks_Click is clicked.
- The purpose of this list is to alert to situations where the URL cannot simply be amended, but where use of the (Translate_Webrefs) facility is more appropriate.
Footnote 15:
- I check the preceding 10 character for an “HREF”.
Footnote 16:
- While this is fine, in a way, there are at least two issues with this:-
- Sometimes the prefix should be “HTTPS”, which is the URL that is returned, causing a subsequent mismatch.
- I’ve forgotten the second – or at least can’t articulate it. It’s another mismatch issue.
- Return to this footnote later!
Footnote 17:
- For some reason, I add a “.” after translating the URL to “+WnnnnW+” format.
- Check this – do I still?
Footnote 18:
- This is to allow for URLs that are intended as text, like “WWW” or “HTTP”.
Footnote 19:
- I’m not sure whether this functionality is used, but it sounds useful to determine which Webref IDs are used.
Footnote 21:
- I’ve reformatted the text slightly and added the hyperlinks as ‘See xxx’ addenda.
- I’d come across Selenium before by doing my own Googling.
- I don’t yet know whether this advice is consistent or whether it will work, but it’s the best I’ve got!
- Even after I get the infrastructure to work, I dare say there will be lots of detailed code changes.
Table of the Previous 4 Versions of this Note:
Summary of Notes Referenced by This Note
To access information, click on one of the links in the table above.
Summary of Notes Citing This Note
To access information, click on one of the links in the table above.
Text Colour Conventions
- Blue: Text by me; © Theo Todman, 2023
- Mauve: Text by correspondent(s) or other author(s); © the author(s)