Let’s give a cheer for the catalogers, for the metadata experts, for the people who make information access possible. I stumbled into cataloging during library school and that unplanned experience has had a huge impact. I am not a cataloger but I value what these organizers do. One of the challenges they face with online materials is making sure that things we don’t own or control remain accessible. My recent task was to find a good, inexpensive MARC 856 field link checker. I was surprised at how slim the field was.

This was particularly noticeable because a web site owner has almost too many options for link checking. If I want to crawl a web site – mine or someone else’s – I can do that using a variety of tools. But if I use an integrated library system, I’m talking about a different animal.

Our catalog can generate a URL report listing the contents of all of the 856 fields (currently about 3300 URLs on the open internet). But it doesn’t have the ability to verify that the links are working. I was surprised, as I started to look, that a lot of the solutions that libraries are using are either custom, or old, or both.

Part of me thinks that I may just be in a silo and that there are obvious tools I’m missing. If you have any that will ingest a spreadsheet and validate links to suggest, please do so!

This can create a challenge for smaller libraries whose systems may not use MARC or have much functionality beyond collection management. They are also the least likely to be able to afford paid tools, although the cost may be greater to check links manually.

But there are some good solutions out there for organizations the size of most law libraries that are free and can ingest a URL report for link checking.

Google Spreadsheet

My first stop was a blog post I saw back in 2017. A search engine optimization (SEO) consultant developed this nice little hack for link checking. This made a lot of sense because the SEO community cares a lot about links working.

In fact, you can find a lot of free link checking tools online. You can cut and paste a list of URLs into the tool and it will check them. Or you can download a tool and run it from your computer. If your list is 500 links or fewer, these are great options. I’ll look at 2 of them below.

But if you have a longer list or want to keep the URLs within the context of the catalog report (like which bib ID belongs to which URL), they’re not so functional. That’s why the spreadsheet idea works so nicely. You can take your current report from your catalog and just add a column.

Libraries that use Libguides will have an idea what this looks like. John King’s instructions are pretty clear so I won’t duplicate them here. The only tweak I made to his approach was to apply conditional formatting to the URL status column.

A URL’s status is 200 when it works and something else when it doesn’t. When we say that the page returned a 404 error, that has a specific meaning. In this case, though, I was only interested in whether a link returned a 200 or not.

[embedded content]

A video showing a Google Spreadsheet link checking in action. I add the formula from John King’s blog post, and then drag the formula over the lower rows. When a 200 status is returned, the cell turns green.

The benefit of this approach is that you can just drag that formula down the URL status row for as many URLs as you have. It took about 15 minutes to check 3300 URLs. And when we run the report again, I can just paste it OVER the old report. The URL status formula will automatically run again when you past in new content.

What I really like is that the spreadsheet can handle the default URL report. From a staff perspective, it’s a cut/paste. I tried this with Excel as well, but wasn’t able to create a macro that would do the same check as reliably.

This approach does not work if you are checking URLs behind a paywall or that requires a login. For that, I returned to an older, weird app, called Xenu.

URL List Checking

There are two good options that are free if you can check just a list of URLs. The two I took a look at are Xenu and HeadMaster SEO. Both are free to acquire and you can import a text list of URLs. The output is very similar.

Xenu has been around for a long time but doesn’t appear to have been updated since 2010. It is free but not open source. I’ve used it for web site link checking but had not used it for a spreadsheet. It’s fast and a bit finicky and it only tests a list of links.

If you download the beta, you will find an option on the main menu to check a link list. It’s a text file of URLs. I stripped out the column from our spreadsheet and saved it as a text file. Then I imported it into Xenu and ran the check.

The link checking screen in Xenu’s beta app showing the list of library catalog URLs being tested. The status column indicates whether the page exists or not.

The interface is intuitive and skips the server error codes. A 200 is still in green but just says ok. An error isn’t a 404 or a 503, it’s just text: server error or, if the server has disappeared no connection.

You should check Xenu’s options because it will, by default, look at external links. That means that if a page on Canadiana.ca has a URL in our 856 fields, Xenu will check links that are on that page. My URL list was only a couple hundred but Xenu found over 9,000 links to check until I toggled that feature off.

The HeadMaster SEO link checker is just as fast and the interface is a bit more modern and clean. The free version is unlimited except by a cap on URLs checked at a time. As far as I can tell, if you have more than 500 links to check, you can just run them in batches. It’s pretty inexpensive software if you want to go past that cap, though, and for C$269, you can get a lifetime, unlimited license.

HeadMaster SEO is a bit more informative than Xenu because you can see if a page still exists but the URL has changed (server code 301).

Why Error Codes Matter

Between the two, I like HeadMaster SEO’s app better than Xenu. It’s just as fast and that bit of extra detail – whether something is a 301 or a 200 – matters. When a page has moved permanently, you want that new URL. If a tool just reports back, “Yes, I found something“, you don’t really know what it found without the server code.

Here are two examples from the report. One site has revamped its folder / slug structure so that:

http://adric.ca/resources/journal-articles

is now

http://adric.ca/useful-links/journal-articles

A link check on that will return an OK in Xenu because it arrives somewhere. But a library cataloger will want to know that there is a new link.

Similarly, a government document may move around. In this case:

http://cjc-ccm.gc.ca/cmslib/general/

has become

http://cjc-ccm.ca/cmslib/general/

The domain name itself has changed. But you might as well capture the different URL. As a web site owner, I know there can come a time when you turn off redirects. A library catalog will get errors then that it could have fixed before.

Choice?

My recommendation, in order would be:

  • Google Spreadsheet
  • HeadMaster SEO
  • Xenu

I prefer the spreadsheet option because it (a) can use your library system generated report and keep the links in context, (b) you can save state as a spreadsheet and (c) you can reuse the spreadsheet perpetually by just pasting in the report each time.

But I’m not the person doing the work and so I will just report back to my technical services team about what I found. There are definitely pros and cons about using any of these apps. And there’s still the issue of testing links within paywalled content that requires a login.