Content Discoverability through Intelligent Indexing

2017-09-12 by Sean Harrison

Can you imagine a publishing future in which marketers will compete for the services of experienced indexers? That might seem far-fetched, but I can imagine it. In fact, I think we're not that far from it.

Search engines are not very good at returning useful results, because most search indexing is based on automatic processing of the content. We have all experienced frustration in using search engines to find information, especially when the topics we are exploring move beyond the purely factual into the realms of ideas, imagination, and feelings. But some of the most important topics about which we are searching for answers are in those “soft,” non-fact-oriented realms. In areas like these, search engines have not delivered the value they promised.

The more thoughtfully and intelligently the index has been crafted, the more likely it will be that the search index will yield useful results.

In light of this reality, I disagree with those who still put hope in automatic semantic tagging of content as a stand-in for intelligent indexing. I believe we are coming to the end of the era in which automatic search indexing is accepted as a replacement for intelligent human indexing.1 Good content discovery requires the creation of intelligent indexes by human indexers. The more thoughtfully and intelligently the index has been crafted, the more likely it will be that the search index will yield useful results.

An good index is a semantic map of the content of a book. The better the index, the better the semantic map it provides. This semantic map cannot be adequately engineered by keyword tagging or automated search indexing.

. . . the intellectual part of indexing – the analysis of meaning, significance and uniqueness, then modelling the likely behaviour of human readers and providing for their predicted access paths – cannot be automated.2

A good index has long been a valuable component of non-fiction books: It helps readers to find where particular topics are discussed. But indexes are even more important in the digital future of publishing. Not only will indexes help people discover content within a particular book, but they will help people discover content across a whole library of books and other content. In other words, indexes will become a crucial part of content and product discovery.

Here is how it will work:

Index entries will be embedded in the content and attached to a range of text — as small as a word, as large as a group of paragraphs or a section.
The embedded index entries will be used to create, not just print indexes, but also a semantic map of the content in that book (or article, or feature).
This semantic map will be loaded into a search engine. Using the intelligently-crafted semantic map provided by the indexer, the search engine will learn from the indexers what the content is about and provide intelligent results.
The reader will have a much better search experience using the search engine that is created in this way. The more thoughtfully and intelligently the index has been crafted, the more likely it will be that the search index will yield useful results.
Readers who discover content in this way are more likely to find the content they are looking for, and more likely to purchase the publication that contains that content.

An excellent search experience is like excellent design: It is largely invisible, but very effective. In marketing, excellent design translates into more sales, even when buyers aren't aware of the design and don't realize they are responding to it. In product development, excellent design translates into greater customer satisfaction with the user experience. Similarly, an excellent search experience leads readers right to the content they are looking for, and can motivate them to come back again in the future and, if the search results are linked to product purchase opportunities, to buy the book.

As publishers, we don't need to wait for ebook reading software or global search engines to incorporate more intelligent indexing into their systems. In fact, as long as they don't, it is a business opportunity for us: We can use our books’ indexes to provide our customers with a better search and content-discovery experience on our own websites and on the web at-large.

In order to provide these benefits, index entries need to be embedded in the content. The traditional approach to indexing, in which indexers create a word-processor document with all the page numbers, will not suffice. Such “static indexes” are fine for print publications, but digital publications need index entries to do more than just point to the print page.

Digital index entries need to link to the precise location in the content where the entry occurs, so that readers can find what they are looking for. Static print indexes can be enhanced with hyperlinks to the location of the top of printed page, but that location is meaningless in a digital context, and it is often several screens away from the actual content, especially on small devices. What works in print does not work well in digital. By contrast, when the index entries are embedded right in the content, the compiled index can link right to the location in the content where the topic is discussed, given the reader a much better user experience.
Digital index entries can semantically tag the content to make that content more readily discoverable by search engines, but this only works if the index entries are clearly associated with the relevant content. The most straightforward way to associate index entries with particular content is to embed the index entries in the content and associate each one with a particular range of text.

All of this means that we need to find, hire, and train excellent human indexers to embed index entries in our publications. Then we can use all that embedded intelligence to enhance content discovery on our own websites and in our own apps. And if we make our content available to global search engines, we can use the embedded index entries to tell them what each piece of content is “about,” enhancing the chances that search engines will connect our content with those topics.3

Eventually, the world around us will catch up: The EPUB standards already contain clear guidance about how to embed index entries in the content of a publication.4 Ebook reading platforms will (probably, eventually…) support the EPUB standards for embedded index entries, and expose to users interfaces that make use of these indexes in more powerful ways. When that happens, publishers who have done the groundwork in advance, by creating ebooks with properly linked and embedded index entries, will have a significant leg-up over those who have not.5

Note: While originally preparing this piece, I went on Adobe Stock to find a photograph of a book index, but most of what I found was pictures of phone books, dictionaries in German, and other irrelevant things. Does everyone have this much trouble with Adobe Stock? I would have been glad to pay Adobe for a good image or group of images related to indexes; instead, I scanned a few pages from The Chicago Manual of Style. It seems that Adobe Stock needs to hire some experienced indexers to improve the semantic tagging of their photos.

1 This is true even as machine learning takes the world by storm. Machine learning cannot replace human intelligence. We will find, I think, that the best uses of machine learning are those that supplement human intelligence.

2 Bill Johncocks, “New technology and public perception,” The Indexer, vol. 30, No. 1 [March 2012], p. 10. Link: http://www.ingentaconnect.com/content/index/tiji/2012/00000030/00000001/art00003 (accessed April 25, 2016).

3 The “semantic web” is the practice of embedding index entries (often called “semantic markup”) into content on the web, in order to inform search engines as to what that content is about. There is an overview page on Wikipedia (https://en.wikipedia.org/wiki/Semantic_Web), a W3C standard (http://www.w3.org/standards/semanticweb/), and an introductory site on the topic of creating semantic web pages (http://semanticweb.org). Once you have index entries embedded in content, creating semantic web pages that use these index entries becomes a mechanical exercise.

4 See the EPUB3 indexes specification: http://www.idpf.org/epub/idx/.

5 The U.K.-based Society of Indexers has a very good introductory page on “Standards and Technologies” for indexing in the digital age: http://www.ptg-indexers.org.uk/about/technologies.htm.