When I first started this site, I made an effort to check each web page to ensure that its links were valid and that the HTML syntax was correct. Laziness soon settled in and I never seemed to get around to verifying new web pages in the rush to post them and get on with new interests. Having just renewed the contract with the web hosting provider for this site, it seems like a good time to go back to better work habits. Unfortunately, some of the tools that I used long ago are no longer maintained or even available. Luckily, I did find useful tools that will hopefully be used in the future and retroactively on files already on the site.
This choice of tools is admittedly idiosyncratic, reflecting constraints imposed by some design decisions and particular requirements. Among the latter was the wish to use local tools instead of web-based applications. I also did not wish to install any Python scripts that required the use of Python version 2.7.x which has been deprecated for a number of years. I am sure that there are other tools and it could be worthwhile to seek them out.
Table of Contents
Counting and Listing Files
Over the years, numerous files have been added to this site. I was curious to get a handle on their number. A short bash script can do that.
Here is the script, which could easily be modified to meet other needs as will be shown later.
Among the changes that could be made, listing all the HTML files in the site along with their relative path could be useful. While the find
utility could be used, its alphabetical sorting of files is not the best. So I installed the tree
package and used that utility because it does sort directories and files separately which is what I want.
The utility's help message provides a list of command line options and by using a few of these the output is almost what I want.
Unfortunately, unneeded bare directory names are in the list. There is also unnecessary repetition of the root directory, /var/www/html/michel/
, at the start of each file name. On the other hand, it can be useful to show the list of files as URLs. This is what is done by default in this final version of the script.
As can be seen sed
, the stream editor for filtering and transforming text, is used multiple times in the last line. Here is a quick explanation of each step.
-
sed -r "s/.{$len}//"
eliminates the firstlen
characters from each string. In other words, this removes the root directory from the full path. -
sed '/html$/!d'
eliminates any line fromstdin
that does not end inhtml
. In other words, this removes bare directory names. -
sed -e "s#^#$prefix#"
appends the prefix at the start of each line. Since the variableprefix
may contain the slash character "/", the hash character "#" is used to separate the strings in thesed
command line.
There are two index files in the root directory of my site, index_fr.html
and index_en.html
, while the default index file, index.html
is a symbolic link to one of these files. Some care is needed to ensure that the symbolic link is not counted nor listed as an HTML file. This is the reason for the -type f
option which ensures that the find
command lists only regular files. Similarly, in the tree
command the exclusion flag is used -I 'index.html'
.
There are other complications on my site. Some HTML files are not posts but are meant to be downloaded. The web site on my desktop computer contains a directory, called local
, with files not copied to the publicly available site. These anomalies explain the fact that the sum of French and English HTML files is less than the number of HTML files. Since I do not want these files in the list of HTML files generated by the script, I have included yet another pass through sed
.
-
sed "/\/dnld\/\|^local/d"
eliminates any line containing/dnld/
and any line beginning withlocal
. Insert this pass through the filter before inserting the path prefix with| sed -e "s#^#$prefix#"
.
The script can be downloaded: sitestats. Save the file in a directory in the search path such as ~/.local/bin
and make it an executable. Don't forget to eliminate the | sed "/\/dnld\/\|^local/d"
pipe if there is no need to deal with dnld/
and local/
directories.
Typically, I use that script as follows.
This file contains a list of all the HTML files in the web site in alphabetical order, except for the final ten files which are in the files in the root directory.
I imported that file into a spreadsheet which will be used to track some information on a file by file basis. In one column, I enter the date of the last time I checked the file with the Nu HTML checker (see next section) and the last time the links in the file were checked is entered in a third column. The second example shows how to use sitestats
to produce a list of HTML files that can be used to locally check the complete web site with Nu HTML Checker and LinkChecker from the command line as will be explained later.
Here is a a look at the result.
It is possible to directly check all the local copies of the HTML files making up the web site with Nu HTML Checker without going through the web server. Here is how to generate the needed list of files.
When run this way, the script starts by stripping the /var/www/html/michel/
root for the path of each file and then ends by tacking it back to the start of each path. I was just too lazy to rework sitestats
after I found out that the true path of files could be used by the Nu Html Checker when run from the command line.
If the number of HTML files was incorrect because of downloadable files or because of local documents, then running the following script against any one of the three lists generated by sitestats
will give the correct number of HTML files.
Here is an example of the output.
Nu Html Checker
The World Wide Consortium (W3C) provides tools for developers including Nu HTML Checker (a.k.a v.Nu). The W3C is adamant its checker does not certify that a web page meets any standard.
The Nu Html Checker should not be used as a means to attempt to unilaterally enforce pass/fail conformance of documents to any particular specifications; it is intended solely as a checker, not as a pass/fail certification mechanism.
...
Why validate [then]?
...
To catch unintended mistakes—mistakes you might have otherwise missed—so that you can fix them.Source: Nu HTML Checker
v.Nu can be used as a web-based tool as shown in the next subsection, but I prefer to install it on my desktop machine to run checks locally. How to do this is shown in the subsequent subsections.
Web-Based Checking
Click on the link: https://validator.w3.org/nu/ to access the validator. The web-based checker can verify only one file at a time. I prefer to test the local copy of my site which is on the same desktop machine in which the source code is edited.
As can be seen, the file index.html
of the copy of the site on the desktop machine is being checked using the file upload
method. That is because the validator will not use a local address such as localhost/michel/index.html
(localhost
could be replaced with 127.0.0.1
or or the actual IP address of the desktop machine, it will not change anything).
The same file, available from the web site, can be checked using the address
method as shown above. This is not as useful for me because I never correct the HTML file directly. I need to correct the GTML source code when errors are found and then generate the corrected HTML file with the preprocessor and upload the corrected web page to my web hosting site before verifying the correction. It is much more straightforward to do all this on the desktop machine.
Local Installation of Nu Html Checker
The Nu Html Checker can also be installed locally but it does require a Java run time environment. It must be version 8 or newer. As it happens version 11 of openjdk
is installed on my desktop machine.
So the prerequisite Java run time environment is installed. Otherwise, it can be installed easily with the usual package manager.
Get latest version of Nu Html Checker (v.Nu) from https://github.com/validator/validator/releases. Currently this is version 20.6.30. I copied the zip
file into a subdirectory of my download directory (called ~/Téléchargements
in French systems.)
I then extracted the content of the archive to
my local binary directory ~/.local/bin
.
This local copy of the validator can be used immediately as shown in the next section. However, if you want to start the web server from the menu, then it is best to create a .desktop
file. In that case I suggest getting a copy the Nu Html Checker icon to be displayed in the system menu.
Then create a .desktop
file to add the checker to the Mint menu.
Here is the content of the file.
Notice the Terminal=true
line. Usually, the terminal is hidden, but showing it was the easiest way I found to stop the checker once done with the application. Otherwise the process remains in the background until it is explicitly killed or the computer is rebooted.
Local Checking
To start the Nu Html Checker web server on the desktop, open a terminal from the system menu or with the keyboard shortcut AltCtrlT and enter the following command at the system prompt.
Another possibility is to use the menu entry. Search for Nu Html Checker and click on it. A terminal window will open and the Java program will be launched.
The difference in the IP address of the service is because of the difference in the way the checker was started.
Open the Nu Html Checker in a browser on the same computer using the following URL http://localhost:8888. The same page as found at w3org
will be available locally.
As before, the Show source
box is checked. When the Check button is pressed not only will errors and warnings be shown, but following that list, the HTML source will be displayed with highlights corresponding to the errors and warnings. This makes it much easier to locate the errors in the corresponding GTML source file.
If the checker was started from the system menu, the process continues to run even after the connection to the application's web server is terminated. It can be stopped by closing the terminal from which the checker was launched by pressing the CtrlC key combination.
Checking from the Command Line
It is possible to validate more than one file at a time when using the Nu Html Checker from the command line.
The HTML files can be passed on to the script directly instead of going through the web server as shown above.
If there is no output, then there is no error according to the validator. That is what happened with the index_xx.html
files. Obviously, there are errors in the about_fr.html
file which are clearly identified by their line and column coordinates. Checking all files in a directory is easily done. But be careful this is recursive!
Note how the actual directory containing the HTML files is specified just as if we were uploading each of the files in the directory as we did in the first example above. Trying to access the HTML files in that same directory through the local web server will not work in this case.
When a directory does not contain a default HTML file (typically named index.html
) the local web server should not be called upon to obtain the HTML files.
How can there be only two errors or warnings in the second case? It is because the local web server transmitted a 403 error page thus blocking acces to the HTML files.
Note the addition of the--stdout
option because, otherwise, the error and warning messages would have been sent to stderr
and wc
would not have seen them.
Instead of recursively checking all files in a directory and hoping that the checker will see every file, I prefer to supply the list of files to the tool. Unfortunately, I have not found a way to do this and had to write a batch file which loops through each filename in the list passing it on to the Checker.
Add the --verbose
option when running the checker if you want to see the name of the file being checked. As shown, only errors will be displayed on the terminal. As before, I saved that script in the ~.local/bin/
directory and made it executable with the chmod +x nvu command.
To test the bash file, I created a file, top_level.txt
, with the full path to all the HTML files in /michel/
, the top level directory of my personal web site.
The script ran the local copy of Nu Html Checker against every file in that list and reported some errors.
Running the script against all files in the web site gave a disheartening total number of errors.
To get a better handle on what is going on, I modified the nvu
script.
Only 8% of the HTML files on the site passed the syntax check. Clearly, I overestimated my diligence in checking the files. This could be more than the cognitive bias known as the overconfidence effect and borders on an illusory superiority. Perhaps it confirms the findings of David Dunning and Justin Kruger. Wanting to assuage the pain to my bruised ego, I imported result.txt
into a spreadsheet and discovered that a mere 10 files contained half the errors and 23 files account for two thirds of the errors. This highly skewed distribution of errors may be in large part attributable to the knock-on effect of some errors. Forget the trailing quotation mark on an inline style attribute or a hyperlink reference and chances are that the checker will report another two or three errors that will not need to be fixed. Incorrectly spell an internal style name in the <head>
section and the error count will be increased by the number of times the style is used in the page. Besides, many so-called errors could be just as easily be seen as warnings. They include things such as putting a width the opening tag of a table cell (as in <td width="18">
) instead of using a style sheet. Do these observations manage to rehabilitate my sense of self-worth? Hardly, what of invalid hyperlinks and spelling and grammatical errors? The mind shudders, but these things can also be checked.
Other Syntax Checkers
As stated in the introduction, there are numerous HTML syntax checkers. Here are a few that I have looked at.
HTML Tidy
Tidy by HTML Tidy Advocacy Community Group (HTACG) (pronounced H-Task) is a "smart" HTML pretty printer or formatter. By smart I mean that the application will correct common mistakes such as mismatched end tags or missing end tags and so on. See What Tidy does in the documentation for more details. The same document says "It’s probable that you already have an outdated version of HTML Tidy. It comes pre-installed on Mac OS X and many distributions of GNU/Linux and other UNIX-type operating systems.". However this is not the case in Mint Mate 20.1.
While the repository does contain a tidy
package, it is out of date. Accordingly, I downloaded the current .deb
package to my Downloads
(called Téléchargements
in French language systems) and installed it the dpkg
utility.
As far as I can tell, a man
page is not installed but there is extensive help from the command line. See it with the tidy --help command. Let's test-drive tidy
on a file that Nu Html Checker found had no error.
Note that the -e
option ensures the program lists errors only, there is no "pretty printed" output of the source file. Perhaps not surprisingly, tidy
also reports that it found no errors in the file. If you want the discreet output usual for Linux utilities, then add the -q
option.
Now let's compare the two when looking at a file which does have some errors.
That's comforting, because the exact same errors are reported. The two programs are not identical by any means. As seen Nu Html Checker can check CSS stylesheets, while Tidy can check accessibility.
One can look up these Accès:[a.b.c.d]
codes in HTML Tidy Accessibility Checker. As can be seen, my site is not up to the better standards in this respect.
Perhaps the most interesting thing about Tidy is it's ability to fix common errors. Here is an example of what it can do. First we will display the errors in one of my HTML files, the run it through Tidy and then test the corrected output again with Nu Html Checker.
That's a very good result. Of course, there's a but. In my case it is not the HTML output that needs to be corrected, it is the GTML source file used to generate the HTML file that must be fixed. Here is what happens when a GTML source is "corrected" by Tidy.
Original Text | Tidy Output |
---|---|
#define TITLE Web Site Offline in Previous Three Days #define ORGDATE 2021-09-1 #define ORGVERSION September 1, 2021 ##define REVDATE 2019-11-07 ##define REVVERSION November 7, 2019 #define MODAUTHOR Michel Deslierres #define LOCSTYLE .lmargin {margin-left: 15px} #define LANG en #define LANGLINK major_incident_fr #include "2_head.gtt" #include "2_topmenu.gtt" ##define LEFT ha/rpi/new_stretch_en.html ##define LEFT_TITLE Updating Raspbian to Stretch ##define RIGHT ha/rpi/guide_buster_02_en.html ##define RIGHT_TITLE Home Automation Servers on Raspbian Buster Lite ##define RIGHT2 ha/rpi/guide_buster_03_en.html ##define RIGHT2_TITLE Various Hardware with Raspbian Buster Lite #include "2_links_top.gtt" #literal ON <div class="content"> C O N T E N T H E R E <div class="scrn"> michel@hp:~$ <span class="cmd">ls /dev/tty*</span> ... /dev/tty18 /dev/tty33 /dev/tty49 /dev/tty7 /dev/ttyS20 /dev/ttyS8 /dev/tty19 /dev/tty34 /dev/tty5 /dev/tty8 /dev/ttyS21 /dev/ttyS9 /dev/tty2 /dev/tty35 /dev/tty50 /dev/tty9 /dev/ttyS22 <b>/dev/ttyUSB0</b> /dev/tty20 /dev/tty36 /dev/tty51 /dev/ttyprintk /dev/ttyS23 ... michel@hp:~$ <span class="cmd">dmesg | grep tty</span> [ 0.000000] console [tty0] enabled [25490.513501] usb 3-14: ch341-uart converter now attached to ttyUSB0 </div> C O N T E N T H E R E </div> #literal OFF #include "2_links_bottom.gtt" #include "2_foot.gtt" |
<!DOCTYPE html> <html> <head> <meta name="generator" content= "HTML Tidy for HTML5 for Linux version 5.8.0"> <title></title> </head> <body> #define TITLE Web Site Offline in Previous Three Days #define ORGDATE 2021-09-1 #define ORGVERSION September 1, 2021 ##define REVDATE 2019-11-07 ##define REVVERSION November 7, 2019 #define MODAUTHOR Michel Deslierres #define LOCSTYLE .lmargin {margin-left: 15px} #define LANG en #define LANGLINK major_incident_fr #include "2_head.gtt" #include "2_topmenu.gtt" ##define LEFT ha/rpi/new_stretch_en.html ##define LEFT_TITLE Updating Raspbian to Stretch ##define RIGHT ha/rpi/guide_buster_02_en.html ##define RIGHT_TITLE Home Automation Servers on Raspbian Buster Lite ##define RIGHT2 ha/rpi/guide_buster_03_en.html ##define RIGHT2_TITLE Various Hardware with Raspbian Buster Lite #include "2_links_top.gtt" #literal ON <div class="content"> C O N T E N T H E R E <div class="scrn"><span class="cmd">michel@hp:~$ <span class= "cmd">ls /dev/tty*</span> ... /dev/tty18 /dev/tty33 /dev/tty49 /dev/tty7 /dev/ttyS20 /dev/ttyS8 /dev/tty19 /dev/tty34 /dev/tty5 /dev/tty8 /dev/ttyS21 /dev/ttyS9 /dev/tty2 /dev/tty35 /dev/tty50 /dev/tty9 /dev/ttyS22 <b>/dev/ttyUSB0</b> /dev/tty20 /dev/tty36 /dev/tty51 /dev/ttyprintk /dev/ttyS23 ... michel@hp:~$ <span class= "cmd">dmesg | grep tty</span> [ 0.000000] console [tty0] enabled [25490.513501] usb 3-14: ch341-uart converter now attached to ttyUSB0</span></div> C O N T E N T H E R E </div> #literal OFF #include "2_links_bottom.gtt" #include "2_foot.gtt" </body> </html> |
The text with a silver background added by tidy
will cause a problem because the 2_head.gtt
and 2_foot.gtt
templates will be expanded into the proper HTML header and footer so there will be duplicates. Then all the GTML macro definitions that begin with #define
are mangled because each definition must be on a single line beginning with #define
. While it may be possible to fix this problem, I can't see how the other problem visible above would be fixed. The scrn
style used with the <div>
tag to show terminal commands and results is equivalent to the pre
HTML tag which means that the text between the opening and closing tags must be formatted. Unfortunately, tidy
output cannot preserve spaces, line breaks and tabs which means that all the formatting in a <div class="scrn"> ... </div>
block will be lost as seen above (see Preserving original indenting not possible in the Tidy documentation).
It is unfortunate that I can't use Tidy because I think it would have automatically fixed many of the reported syntax problems.
Dr. Watson
Created more than 20 years ago, Dr. Watson is a "free service to analyze your web page on the Internet. You give it the URL of your page and Watson will get a copy of it directly from the web server. Watson can also check out many other aspects of your site, including link validity, download speed, search engine compatibility, and link popularity."
This is a web-based application that cannot be installed locally as far as I can make out. This makes it a bit impractical for checking the many older posts on my site but it could be used to verify new additions to the site. Unfortunately, there is an unspecified size constraint as I found out when I tried to check one of the more popular posts on my site and got the following error.
which, it turned out, was not much.
Checking CSS Files
The Nu Html Checker can verify HTML, CSS and SVG documents. However for checking CSS style sheets, I prefer to use the CSS Validation Service. The reason for this preference is that it returns a corrected version of the submitted file. It is a web-based application but it was not important for me to see if this validator can be installed locally. I have only 3 CSS style sheets and they are rarely changed.
W3C Markup Validator
As far as I can make out, before hosting Nu Html Checker, W3C already had a verification tool called W3C Markup Validation Service. It is "a perl-based CGI script that uses DTD to verify the validity of HTML3, HTML4 and XHTML documents; it also incorporates by reference the NU Validator used to validate HTML5 / HTML LS documents" (source).
If I interpret this correctly, this web-based application validates older HTML3 and HTML4 documents against their DTD, but is uses Nu Validator when the document is HTML5. If that is accurate, it would mean that this validator would not be of much use to verify my site.
Hyperlink Checkers
The invalid link is a vexing problem for both users and creators of web content. There are two types of errors related to hyperlinks on my site: those that entirely my fault and those created by others. Most of the self-inflicted errors are stupid spelling mistakes, simple inversions of letters while typing in a URL, or hurried changes in the name of an id
attribute while building the menu found in most of the substantial posts. Careful verification before posting a new web page should eliminate this problem but "things happen" as "they say" (whoever "they" are, and I do know that they usually say something a bit more scatological). The other common type of error is the disappearing site. Back in 1998 Sir Tim Berners-Lee listed arguments put forth for changing URIs (Uniform Resource Identifier) and argued their invalidity: Cool URIs don't change. The message has not reached everyone (and that includes me, unfortunately), so many links to outside resources end up pointing to something that no longer exists or that has been given a new address. Try this link https://www.google.com/not-found-file.html to see how Google reports a 404 not found error. My own version https://sigmdel.ca/michel/not-found-file.html is even more terse. Experience has shown that fixing remote site 404 errors can be time consuming because there is no indication if the wanted resource has been entirely removed or if it remains available on the same host but with a different URL or on a different host. The latter is an inevitable consequence when individual creators that do not have a personal domain move their site to a different web hosting provider.
Numerous link checkers are available, but some will not work for me because when I created this site I made a couple of decisions which were not optimal. One was that I decided to use relative instead of absolute URLs when linking to other documents on my site. While there are arguments against this practice (Why relative URLs should be forbidden for web developers) and the W3C Link Checker will only work with absolute URLs, this would not have had much impact had I not also decided to use the <base href="/michel/">
HTML element. This seems to confuse many link checkers especially when it comes to internal id
attributes used in links to specific positions within an HTML document.
As before, I am interested in tools that I can install on my desktop machine. In the end, I have installed only two hyperlink checkers and truth be told only one of them works well with my site.
LinkChecker
Luckily, LinkChecker, a Python 3 script, can handle links on my site. Version 9.4.0 is available as a package in Debian Buster while the latest version (10.0.1) is available in Debian Bullseye and Sid.Unfortunately, these packages are not available in the standard Mint 20.1 repository so I decided to install the script in a virtual environment but there are other methods of installing LinkChecker. The installation was a simple three-step procedure: create a virtual environment, enable it and then install the package within it with pip
(actually pip3
since the virtual environment is created with Python 3).
The next step was to set up the configuration file which is called linkcheckerrc
and which should be in a directory named .linkchecker
in the user's home directory.
It is an INI file with section headers in square brackets []
and keys which are name and value pairs.
The recursionlevel=1
key-value pair in the [checking]
section will ensure that only links in the file provided on the command line are verified. Without that, the default behaviour of LinkChecker would be to follow every HTML link and check its own links and to do that recursively. This unlimited recursion would be desirable when verifying all links in a web site. However this could take a long time and it would be very discouraging when all that is desired is to check a page about to be added to a site.
The next setting, checkextern=1
in the filtering
section, ensures that all links to resources outside the web site are verified. No recursive verification of links in external HTML files is ever done, no matter the setting of recursionlevel
.
The colour-coded HTML output, as set with the key-value pair log=html
, makes it very easy to spot errors especially when the verbose output is enabled. Setting verbose=1
is a good way to verify that the checker is investigating all the links in the HTML source file.
Finally, the section header AnchorCheck
which enables the AnchorCheck
plug-in which is important for my site. This will ensure that internal document links to elements with id
names are verified. There are other plug-ins including checking each file with Nu Html Checker. All the plug-ins can be listed.
Most settings can be set with options on the command line, except for enabling plug-ins that can only be done in the configuration file. Settings set with command line options take precedence over any settings set in the configuration file and I make good use of this fact in what follows. Here is part of the output when checking one file on the site. Note that the command line option -o text
to more easily to display the output from LinkChecker. The option overrides the log=html
setting in the configuration file.
In this example an old post is tested without overriding or adding any settings beyond those in the configuration file shown above. While the output to stdout
is redirected to a file, the progress reports that are shown below are not redirected.
The results can be viewed. Since verbose
was not enabled, only warnings (there were none) and errors (there were two) are shown in the result file, The program checked 23 links many of which referred to other pages on the web site. Within 10 seconds or so all but one link had been verified. One of those got a 404 error, meaning that the external file no longer exists. The last link timed out. The default timeout value is 60 seconds which explains why the program ran for slightly more than a minute.
To test my whole site, I timed the following command.
Note the -r -1
command line option that sets the recursion level to a negative value implying that all links on the web site will be checked recursively. I trust that the program keeps a list of visited pages to avoid infinite loops! Obviously, that pitfall was avoided because in all over four thousand links were checked in slightly over 18 minutes, with 126 warnings mostly about invalid anchor names and 102 errors such as 404 file missing error.
What if recursion is turned off and LinkChecker is given the list of files to check? Explicitely, the checker will obtain its list of URL to check from the file list.txt
redirected to stdin
.
In closing the discussion about this excellent tool, two things bear mentioning. There is a slight problem with Unicode and even when specifying the utf_8
encoding, code points above ASCII 126 will not be displayed correctly. This is a known issue (Encoding strings doesn't work #533), but to be fair this does not materially reduce the value of the program. Also be aware that the "original" LinkChecker by Bastian Kleineidam (wummel) is still available on Github. There have been no updates to the code since June 2016 and it only works with Python 2.7.x. It is a bit unfortunate that there is not any mention of the newer version anywhere that I could find.
Linklint
In the past, I verified links with a Perl script called linklint
. Version 2.3.5 is available in the Mint repository and could be installed with the GUI package front ends Synaptic and mintinstall. According to the Linklint - fast html link checker home page, version 2.3.5 dated August 13, 2001 is the latest version. However this is not the case and if one goes down the list of archives in the download directory, there are 2.3.6.c and 2.3.6.d versions dated 2022-12-12. I installed 2.3.6.d. It's a single Perl file which I copied to ~.local/bin
. Unfortunately, it gets confused with id
attribute and it reported hundreds of error.
No broken links? Seems very unlikely. At the same time, there are 1510 missing named anchors, things like the id
attribute of each section and subsection in the posts. A bit more on that later. The falsely optimistic result is in part caused by the fact that all checks have not been performed. Remote URL Checking explains that a second step is required.
That result is more in line with the sorry state of my web site. The lack of support for "named anchors" is disappointing because, in both 2.3.6.c and 2.3.6.d versions of the script, the following is found.
Let's look at the details in the output file ~/linktest/errorAX.html
.
That "missing named anchor": /michel/3d/3d/intro_openscad_01_en.html#futureversion
looks very suspicious. Here are all the lines in HTML file containing the string futureversion
.
There is nothing that obviously accounts for the extraneous 3d/
in the link calculated by linklint
. However, I edited the HTML file, removing the <base href="/michel/">
and adding the "/michel/" suffix to all relative URLs in the file. On running linklint
against this modified HTML file, the links to named anchors were all verified as correct. In my mind that confirms that the <base href=>
HTML tag is the source of the confusion.
The conclusion appears to be that linklint
is a viable link checker as long as the HTML files do not contain the <base href=>
HTML tag and id
attributes to identify named anchors.
Strategy
More than 19,000 syntax errors and more than 100 invalid hyperlinks seems like an overwhelming task. Just where to begin? So far I have removed all syntax errors reported by Nu Html Checker for the 10 files in the root of my personal web site at sigmdel.ca/michel. It makes sense that the files reached by clicking on any icon at the top of each page should be error free. But where to go from there? Should I start with the 10 pages which account for half the syntax errors or should the 10 most visited pages be checked initially?
Again, the W3C provides a tool to help establish a priority. Called the Log Validator it is a " A free, simple and step-by-step tool to improve dramatically the quality of your website. Find the most popular invalid documents, broken links, etc., and prioritize the work to get them fixed.". The W3C goes beyond that and provides instructions: Making your website valid: a step by step guide.
I did install the tool in my account with my web host. Details can be found in W3C LogValidator in cPanel. The initial results from using that tool were disappointing to say the least.
To start removing errors on this site, I'll begin with the most visited pages and then look at the worst pages. If I were a betting man, I would not wager on ever getting to every page.