Theo Todman's Web Page - Notes Pages
Website Documentation
Website Generator Documentation - Cross-Referencing
(Text as at 11/04/2022 00:01:26)
Introduction
- This document covers1 the use of the table Cross_Reference, both in how it is maintained, and what it is used for.
- In the course of writing this Note I will hopefully complete item 193 in my Development Log, ie. “Complete XRef-re-engineering project”:-
- Ensure all links and link-pages use the new XRef table, and pension off the old tables.
- Write out2 specific object-identifiers, and linking thereto for Citations, rather than paragraph references. An issue – hopefully fixed – is of multiple instances of the same object in a document.
- Comment out the global variable NameRef and fix any compilation errors.
- Check all link-types still work and fix any errors.
- Complete the auto-triggering of regeneration of “associated” link pages.
- Fix update bug in Convert_Webrefs.
- Fix Bug whereby PaperSummary pages seem to have “Works-” and “Books/Papers-” Citings that refer to the same link-pages.
- Document the process!
- This is probably one of the most complex parts of my website generator, but the cross-referencing function is really the reason I embarked on this project in the first place.
The Cross_Reference Table Itself
- The record-counts appear in the following table (provided by Functor_21 using query Cross_Reference_By_Type):-
Type_Calling ↓ | Type_Called → | A | B | F | I | N | N_A | P | W | TOTAL |
A | | 39 | 37 | | | 261 | | 34 | 385 | 756 |
B | | 1,021 | 1,085 | | | 2,171 | | 748 | 853 | 5,878 |
N | | 2,663 | 5,279 | | 2,298 | 12,610 | | 21,540 | 4,404 | 48,794 |
N_A | | 17,833 | 35,547 | | 8,024 | 3,394 | 45,312 | 92,180 | 32,423 | 234,713 |
P | | 5,000 | 6,957 | 4,425 | 26 | 30,573 | | 23,495 | 7,854 | 78,330 |
TOTAL | | 26,556 | 48,905 | 4,425 | 10,348 | 49,009 | 45,312 | 137,997 | 45,919 | 368,471 |
---|
- Key:-
- A = Author
- B = Book
- I = Image
- N = Note
- N_A3 = Archived Note
- P = Paper
- W = WebRef
- “Calling” types are in the first column, “called” types are the other column headings.
- What is meant be “calling” is that references to objects of the “called” type appear in objects of the “calling” type.
- Note that Images and WebRefs, by their nature, can be called, but cannot call.
- The table ought to be self-regulating, and maybe it is. However, there are (as of 01/03/2022 05:16:16, using Functor_22 & query Cross_Reference_By_Year) the following counts of records on the table with timestamps in the years below:-
- 2015: 936
- 2016: 68
- 2017: 3
- 2018: 766
- 2019: 1
- 2020: 4,270
- 2021: 622
- 2022: 361,805
- Since the website is fully regenerated on a monthly basis, cross-references prior to the latest regeneration should not exist, and require investigation. Presumably, any such are either the result of errors during development or of interventions in the system. There are (according to Functor_23, option 1, using query Dud_Cross_References_This_Year) no record for 2022 prior to the February regeneration, which is encouraging. I have – this is down to an accidental duplicate Paper that I deleted. There needs to be a method for hoovering these up!
- In addition to the Cross_Reference table there are the following related tables:-
- Cross_Reference_Changes
- Cross_Reference_Zapper
- The Cross_Reference table is maintained by the following generic Routines & Queries:-
- Two of these queries are particularly time-consuming – in that they take about a second, which multiplies up drastically when run thousands of times. They are both Append queries, discussed later:
- A potential issue is the table’s primary key – an ID which is a Long integer with a maximum value of 2,147,483,647. It is needed to enable the deletion of records. These are replaced wholesale, so this ID is romping up. Using Functor_23 and query Cross_Reference_MaxID, as of 01/03/2022 05:16:18 it is 36,507,359 - but it's taken 7 years to get this far (though the number of links has been rising). So, it looks as though it won’t be under pressure for a while! It would be possible to reset it, or – given I have a 64-bit system – change the data type to LongLong, which has a maximum value of 9,223,372,036,854,775,807.
Detailed Processing
- Cross_Reference Deletions
- Zap_Cross_References deletes any old links prior to the addition of the new cross-references. It is called from the modules that generate the pages for which these cross-reference are being generated. So, it is called from:-
- Cross_Reference Additions
- Cross_Reference_Add is a simple routine that just adds rows to the table, based on a string of parameters supplied. However, it doesn’t add Cross References for Note 8744, which is the temporary Note used for the generation of +LL+ “Links” pages.
- The functions that directly call Cross_Reference_Add are those that convert the “+ΧΧ+” references in text of whatever sort to hyperlinks, ie:-
- Both Notes and Archived Notes are covered by the same routines.
- The Name reference is of the format Xnnnn_i, where:-
- X = A, B, N or P (Images don’t have Name references and WebRefs have their own method).
- nnnn is the object ID
- i is an incremental counter for the number of times this object has appeared in the calling object (determined by a query on the Cross_Reference table itself.
- Cross_Reference_Changes
- Following the first round of investigation and documentation, I’ve decided to delete all rows from this table more than 40 days old (or prior to the last Website Regen (as determined by query Website_Regen_Last_Run_Start) if this is earlier), using the Sub Cross_Reference_Changes_Prune (which uses Cross_Reference_Zapper). This is a temporary expedient until I introduce changes for non-Notes (Notes are already fully implemented). I’ve done this to see if it improves performance, which does seem to be the case.
- The record-counts now appear in the following table (provided by Functor_21 using query Cross_Reference_Changes_By_Type):-
Type_Calling ↓ | Type_Called → | A | B | N | P | W | TOTAL |
B | | | 45 | | | | 45 |
N | | 22,723 | 5,110 | 4 | 14,558 | 14,487 | 56,882 |
P | | 4,406 | 172 | | 473 | 6,544 | 11,595 |
TOTAL | | 27,129 | 5,327 | 4 | 15,031 | 21,031 | 68,522 |
---|
- Key:-
- A = Author
- B = Book
- I = Image
- N = Note
- N_A6 = Archived Note
- P = Paper
- W = WebRef
- Calling types are in the first column, called types are the other column headings
- Note that Images and WebRefs, by their nature, can be called, but cannot call.
- This table (according to Functor_23, option 3) has 68,522 rows, as of 01/03/2022, split by month (using Functor_22, Cross_Reference_Changes_By_Month):-
- 2022_01: 15,545
- 2022_02: 52,977
- Rows are added using two complex queries, but before describing them it’s worth describing what’s been going on. The table Cross_Reference_Zapper is populated with all the cross-references from the changed calling objects held in Cross_Reference, prior to the new ones being added in. They are removed from the Cross_Reference table ready for these new cross-references to be loaded. By the time we get to adding rows to Cross_Reference_Changes, the changes to Cross_Reference have already been applied, but comparison with Cross_Reference_Zapper tell us which pages to regenerate based on both deleted and added cross-references.
- So, the queries are:-
- Cross_Reference_Changes_Deletions_Add is run first. If anything that was deleted hasn’t been replaced, the called pages have to be regenerated.
- Cross_Reference_Changes_Additions_Add which is slow because of an inner join to the query Cross_Reference_Latest (which is a summation query on Cross_Reference_Zapper) and an outer join to the table Cross_Reference_Zapper (for which, see below).
- Something very cunning is going on here! Pages have to be regenerated whenever objects that call them have references either added or deleted, hence the two queries. Also, there needs to be some conflict avoidance.
- In order to improve the run-times of a full website regeneration (where variable Full_Regen is set to True), I’ve removed the updates of Cross_Reference_Changes (but not – of course – of Cross_Reference) from all places where they are invoked. Improvements (as determined by Functor_23, options 4 – 8) have been:-
- CreateAbstractWebPages (Paper Abstracts: run time has reduced from 8.17 hours to 1.68 hours on 06/02/2022)
- CreateAuthorsWebPages (Authors: Had already reduced to 16 minutes; now 10 minutes on 06/02/2022).
- CreateBookPaperAbstractsWebPages (Book/Paper Abstracts: run time reduced from 72 minutes to 13 minutes on 06/02/2022).
- Notes_Text_Format
→ Notes: run time reduced from 3.62 hours to 50 minutes on 06/02/2022.
→ Notes Archived: run time reduced from 2.32 hours to 1.62 hours on 06/02/2022.
This is a sensible move because – on a full re-gen – all pages are being regenerated in any case.
- Rows are deleted by cmdRecalculate_Click using SQL driven by table Page_Regen, but only for Called_Type of “N”. So, the table only contains a few very recent rows of this type, but multitudes of rows for others, as is shown in the table above. I need to explain why this is the case: if looks like deletions may just have been forgotten.
- So, what is the table actually used for? Most usages are either diagnostic or maintenance, and the only serious one seems to be Page_Regen_GEN, also invoked by cmdRecalculate_Click.
- I suspect a fault in that this function regenerates the wrong pages. So, we might be on to something here! However, most pages – ie. authors, book and paper summaries – are regenerated by the badly-named cmdPaperSummaries_Click.
- On investigation, using query Page_Regen_GEN_Test, a non-updating version of Page_Regen_GEN, there were (before Cross_Reference was truncated to the latest 40-days) 21.1k rows output to Page_Regen, including 4 to Author ID=0 and 2 to Image ID=0 (but these represented over 100k and 3k rows, respectively). Not sure of the purpose of including Images since they don’t have pages to regenerate (WebRefs are already excluded for that reason).
- Table Page_Regen is then used 4 times in cmdRecalculate_Click:-
- to warn how many Notes with be regenerated
- to delete all its rows
- to regenerate all its rows, as above
- to regenerate all “called” Notes based on the rows just created.
- No queries use the table other than in the circumstances just listed. So, it seems that the table is not used other than to regenerate Notes implicated in changes to other objects (including Notes).
- Hence, it looks like the functions envisaged for the Cross_Reference_Changes table have not been fully implemented, and that it can be truncated until they have been!
- Note that it’s not straightforward to fully implement regeneration of the “impacted” pages, as some are cross-references … more on this later.
- I now delete all rows more than 40 old days in cmdRecalculate_Click.
- Cross_Reference_Zapper
Use of Links in Cross-Reference Pages
- There are several consolidated sets of pages that are produced that show the links between various (classes of) objects on my website.
- In general, there’s an alphabetic jump table with links to Objects with titles or Authors starting with the letter in question. These jump tables link to Alphabetical lists which themselves link to individual Object-specific pages.
- The Object-Specific pages are themselves linked to from the Summary or Abstract pages for these Objects.
- These pages and routines have grown up over time and were rushed out when needed using a variety of routines using inconsistent methods with a lot of code duplication. The whole lot needs sorting out! In particular, many routines use individual cross-reference tables noted in the next section, rather than Cross_Reference.
- The list the jump tables is as below:-
- Before we get down to details, I note here that several blocks of functionality are responsible for regenerating some of the temporary tables needed for functionality or performance reasons and for generating pages that display the linkages in one form or another.
- They are segregated in accord with the documentation stubs I’ve set up, which may or may not be the most appropriate … but it’ll do for now!
- The process-groups are:-
- I’ve made a stab at documenting the first three of these. The others require documenting, which I intend to do in the first instance insofar as this is relevant to the purposes of this Note.
- Eventually, I’ll incorporate cross-reference-relevant points from these Notes in this one.
- So, these pages are generated by the following routines:-
- Further Comments on the Above:-
- Authors Summary
- These pages list all Authors, their Books and Papers (plus other authors with the same surname) and summarises their citations, with links to the corresponding Authors, Books, Papers and Notes that cite the Author.
- At the lowest level are the pages for individual Authors, created by CreateAuthorsWebPages. This has been updated to use the new reference naming convention.
- Citations are added using a union query (Authors_Cited_By_All_List_Query) of queries that all use Cross_Reference:-
→ Authors_Cited_By_Authors_List
→ Authors_Cited_By_Books_List
→ Authors_Cited_By_Notes_List
→ Authors_Cited_By_Papers_List
- This query is invoked by Authors_Cited_By_All_List_Gen to generate Authors_Cited_By_All_List.
- This table is used by AddCitations_List, which adds the actual citations, and has been updated to use the new referencing convention.
- The lists of the Author’s Books and Papers still needs to use the tables:-
→ Author_Book_Links. and
→ Author_Paper_Links
Because items can appear on Cross_Reference for other reasons, and Author won’t appear against Book or Paper in this table unless the Author’s name is explicitly encoded in the Abstract or Comment.
- To be completed!
- Book Citations
- This set of pages lists Books that are referenced by Papers or other Books.
- The Jump and Alphabetical lists are created by WebpageGenBookCitings (with parameter "Yes" for a full regen).
- The individual Book pages are generated by WebpageGenBookCitingsPage.
- During a full re-gen, the query is Book_Citings_List_New_List,driven from the materialised view – table Book_Citings_List_New, created by Book_Citings_List_New_Gen, using Book_Citings_List_New_Query, which is itself a union of:-
→ book_Book_List_New, and
→ Book_Paper_List_New
Both of which already use the Cross_Reference table.
- Otherwise, driven from the query Book_Citings_List_Changed_New, which is itself a union of:-
→ Book_Book_List_Changed_New, and
→ Book_Paper_List_Changed_New
Both of which use a rather complex series of queries … yet to be documented!
- The individual Papers pages are generated by CreateBookCitingsWebPages, which has now been tweaked to reflect the new referencing convention.
- Books-to-Notes Link Pages
- This set of pages seeks to identify those Books that are cited by Notes, divided into those that are only cited by Archived Notes, and those that are (also or only) cited by Live Notes.
- The query that selects the list of Books, and makes the segregation is Note_Book_Links_Prelim. There was a bug in the segregation algorithm, now hopefully fixed.
- The individual pages are produced by CreatePapersToNotesWebPages, with parameter ‘Book’.
- This, in turn, calls BooksToNotes_Prelims, which uses BooksToNotes_Prelims_GEN to create Cross_Reference_Prelims, items from Auto-XRef Note 874 being removed.
- This looks like it should all work as the above already uses Cross_Reference.
- But CreatePapersToNotesWebPages had to be fixed for the new referencing convention.
- It looks like there’s a problem with the referencing from the individual pages during a full re-gen.
- Paper Citations
- This set of pages lists Papers that are referenced by Books or other Papers.
- The Jump and Alphabetical lists are created by WebpageGenPaperCitings (with parameter "Yes" for a full regen).
- The individual Paper pages are generated by WebpageGenPaperCitingsPage.
- During a full re-gen, the query is Paper_Citings_List_New_List,driven from the materialised view – table Paper_Citings_List_New, created by Paper_Citings_List_New_Gen, using Paper_Citings_List_New_Query, which is itself a union of:-
→ Paper_Book_List_New, and
→ Paper_Paper_List_New
Both of which already use the Cross_Reference table.
- Otherwise, driven from the query Paper_Citings_List_Changed_New, which is itself a union of:-
→ Paper_Book_List_Changed_New, and
→ Paper_Paper_List_Changed_New
Both of which use a rather complex series of queries … yet to be documented!
- The individual Papers pages are generated by CreatePaperCitingsWebPages, which has now been tweaked to reflect the new referencing convention.
- Papers-to-Notes Link Pages
- This set of pages seeks to identify those Papers that are cited by Notes, divided into those that are only cited by Archived Notes, and those that are (also or only) cited by Live Notes.
- The query that selects the list of papers, and makes the segregation is Note_Paper_Links_Prelim. There was a bug in the segregation algorithm, now hopefully fixed.
- The individual pages are produced by CreatePapersToNotesWebPages, with parameter ‘Paper’.
- This, in turn, calls PapersToNotes_Prelims, which uses PapersToNotes_Prelims_GEN to create Cross_Reference_Prelims, items from Auto-XRef Note 874 being removed.
- This looks like it should all work as the above already uses Cross_Reference.
- But CreatePapersToNotesWebPages had to be fixed for the new referencing convention.
- It looks like there’s a problem with the referencing from the individual pages during a full re-gen.
- Pages from which individual link-pages are linked:-
- Authors: CreateAuthorsWebPages: This was covered adequately under ‘Authors Summary’ above.
- Book Summaries: CreateBookSummariesWebPages.
- Authors Citings Link: uses SQL on Cross_Reference.
- Book Citings Link: This is allegedly included in Paper Citings, but I couldn’t see that it was.
- Note Citings Link: Doesn’t provide a link.
- Paper Citings Link: uses table Book_Citings_List_New.
- Book-Paper Abstracts: CreateBookPaperAbstractsWebPages
- Authors Citings Link: Not provided.
- Book Citings Link: This is allegedly included in Paper Citings, but I couldn’t see that it was.
- Note Citings Link: uses query Book_Note_Counts, which runs off the legacy table Note_Book_Links.
- Paper Citings Link: uses table Book_Citings_List_New.
- Notes: CreateNotesWebPages, calling OutputNotesWebPage, which does the work. Important items in this context are:-
- Regenerate the Links: If selected, this runs the following procedures:-
- Regen_Note_Links
- Regen_Note_Links_Archived
- Regen_Note_Book_Links
- Regen_Note_Paper_Links
- Summary of Note Links from this Page: Uses Cross_Reference.
- Summary of Note Links to this Page: Also uses Cross_Reference. The code has now been updated to use the new referencing standard.
- Authors, Books & Papers Citing this Note: Uses AddCitations_List, which adds the actual citations, and has been updated to use the new referencing convention.
- References & Reading List: Uses AddReading_List. There is something seriously wrong with this procedure. It includes lots of irrelevant stuff, and omits lots of works actually cited.
- Notes_Archived: Re-created by OutputNotesWebPage_Archived. Important items in this context are:-
- Summary of Note Links from this Page: Uses Cross_Reference.
- Summary of Note Links to this Page: Also uses Cross_Reference. The code has now been updated to use the new referencing standard.
- Authors, Books & Papers Citing this Note:
- References & Reading List:
- Paper Summaries: CreatePaperSummariesWebPages
- Authors Citings Link: uses SQL on Cross_Reference.
- Book Citings Link: This is allegedly included in Paper Citings, but I couldn’t see that it was.
- Note Citings Link: Doesn’t provide a link.
- Paper Citings Link: uses table Paper_Citings_List_New.
- Paper Abstracts: CreateAbstractWebPages
- Authors Citings Link: uses SQL on Cross_Reference.
- Book Citings Link: This is allegedly included in Paper Citings, but I couldn’t see that it was.
- Note Citings Link: uses query Paper_Note_Counts, which runs off the legacy table Note_Paper_Links.
- Paper Citings Link: uses table Paper_Citings_List_New.
Improvements and Rationalisation Required
- Once the process has been fully documented, it needs to be improved to make the links more stable throughout the month between site-regenerations. The in-page links now contain the ID of the called object, so that the return link is exact. Where the same object is referenced multiple times within a document there is an incremental number as well.
- The Cross_Reference table was designed to be a generic replacement for – or, maybe, a supplement to16 – various cross-object tables:-
- The majority of the above tables should probably have been retired by now! However, I need to be careful as some may be used as temporary tables, ultimately populated from Cross_Reference.
Performance Improvements
- Further, I note here that there are some very complex queries that dig out the links when pages are being generated. During a full regeneration, these queries (that take a second or so to run) mount up when run thousands of times and bloat the site-regeneration time to around 11 hours. One of the main purposes of this documentation is to determine where the pinch-points are, to avoid unnecessary processing and to “materialise” certain “views” (ie. convert queries to tables).
- The materialised views currently developed bloat the size of the database, so a more subtle method may need to be developed in due course. They are:-
In-Page Footnotes:
Footnote 1:
- Or will do, once it has been completed!
Footnote 2:
- This was implemented on 2nd Feb 2022, but the rest of the processing isn’t joined up yet!
Footnote 3:
- The Type_Calling and Type_Called are actually “N”, just as for Live Notes, but I’ve added the “_A” to distinguish the two types where the corresponding Calling or Called Timestamps are non-zero.
Footnote 5:
- It looks like this is how it works – in that I changed Notes_Text_Format to adopt the new Nnnnn_i Name convention, and Note footnotes were left using the old convention, and then changed when I updated Reference_Notes (or was it the other way round!).
- However, why is it necessary? Other objects in footnotes changed happily enough immediately.
- Also, Notes_Text_Format only deals with Notes referenced in Notes.
- Is this because Note links differ in in their encoding in Footnotes to that in main text?
- If so, is there an issue with this convention elsewhere – eg, in Footnotes in Paper Abstracts, for instance?
- While I’m investigating all this, I need to sort the issue whereby the process loops if I forget the Note referencing format within Footnotes.
Footnote 6:
- The Type_Calling and Type_Called are actually “N”, just as for Live Notes, but I’ve added the “_A” to distinguish the two types where the corresponding Calling or Called Timestamps are non-zero.
- For some reason, all these rows and columns have disappeared.
- There were rows (but – unsurprisingly – no columns) in the two earliest versions of this Note.
- Investigate in due course – it may not actually be an issue.
Footnote 16:
- It’s use is that it is two-way: it shows those objects that have been impacted by changes to other objects, in that the links to it have changed.
- This functionality has only been implemented for Notes, it seems.
Table of the Previous 4 Versions of this Note:
Note last updated |
Reference for this Topic |
Parent Topic |
11/04/2022 00:01:26 |
1300 (Website Generator Documentation - Cross-Referencing) |
None |
Summary of Notes Referenced by This Note
To access information, click on one of the links in the table above.
Summary of Notes Citing This Note
To access information, click on one of the links in the table above.
Text Colour Conventions
- Blue: Text by me; © Theo Todman, 2023