Future publishing technologies and indexes

A recent discussion on an indexing discussion list got me thinking that what indexers needed was a guide to everything they didn’t know about the future indexing technologies. So here it is.

In the beginning was the scroll. Some time later came the book – the scroll was chopped up into fixed size pages which were then numbered sequentially. That opened the way for the index – a collection of headings together with indicators showing on which page the required information appeared. And now, and in the future, we are going back to the scroll. Instead of books with multiple pages, books are being published with a single, very-very long page. These books are not on physical paper, as with the scrolls of antiquity, of course, but on a computer screen. So imagine a web page containing the whole contents of a book. In your browser you see a single page on the screen, and can move to the next page by clicking on the scroll bar. You still view the book a page at a time, but, and this is the key point, the size of the page is not fixed. You can make the browser window larger or smaller, zoom to increase or decrease the font size, and the page size changes. You might think that very few books are published in this way, but you would be wrong. This is exactly how the Kindle works. To publish on a Kindle the book is converted into a webpage and then displayed a page at a time on an e-ink screen. By changing the font size or jumping to a particular starting point in the text, you change where your pages start and end. If you view the book on the bigger screen Kindle DX you get a different pagination.

So the future of books is the scroll, which leaves a problem for indexes. They relied on the pages. The solution is to have the indexes point to something in the text smaller than pages – specific paragraphs, specific words or individual character positions. There are two ways of doing this – tagging and embedding.

Tagging is adding little markers in the text and then using those markers in the index. This could be done simply using the character number – instead of “page 123″ we refer to “character 492,761″. Alternatively a smaller number of more convenient tags can be added by the publisher, perhaps one for each line, and the index uses those (Elsevier does this). Or perhaps the indexer themselves adds tags to the text and uses those in the index (CUP-XML does this). What the actual locators look like doesn’t really matter because when displayed on the e-book screen it is simply a link for the reader to click on.

Embedding actually stores the index headings themselves directly in the text at the required character position. A separate run of a computer program is required to go through the text and create an index using the character numbers we talked about in tagging, but that can be done very quickly.

Storing these tags and headings in the actual text itself would, of course, be unacceptable if they could be seen by the reader, so it is necessary to have some form of text which is more complex than simply characters and letters, which allows information to be stored invisibly. There are many different ways of doing this as no-one has come up with a format which everyone agrees to be perfect. All word-processors have their own formats, such as MS Word with .doc and .docx, or OpenOffice with .odt. Web pages use a format called HTML which uses tags inside angle brackets to enclose invisible information, so <title>  </title> indicates indicates that the text in between the tags is to be used as a title for the browser window but would not be displayed to the reader as part of the text. XML also uses angle bracket tags but goes a step further and allows the document creator to define their own tags. So if you wanted to use <browserheading> </browserheading> instead of the <title> tag then you could do so. It also means that you can create tags for things which no-one else has thought of. As a publisher you can come up with a tag system better than anyone else and then reap the commercial rewards of that system.

So what does this mean for the indexer? First, that indexing has to be more precise. Rather than identifying on which page a concept appears the indexer must identify exactly where the topic starts and ends, right down to the character position. That always involves more work. Second, the indexer will have to use a range of software tools or techniques to record the index information in the document. This might be software tools involving drop-down menus or special keystrokes, or techniques involving colored numbers printed on PDFs. Furthermore, these software tools and techniques will come mostly from the publishers, who are the inventors of their own systems, often designing them concentrating on facilities to handle page presentation, such as illustrations and tables, and not the work patterns of the indexer.

As I mentioned there is no single format on which everyone agrees, nor is their any sign of one being agreed in the near future. As indexers we need to be agile. Investigate and find a format in which you think there will be demand for indexes. Create your own techniques and even tools, using word-processor macros, spreadsheets or programmable function keys on keyboards, to make your indexing process efficient for that format. Pursue work in that format. Blog and tweet about your learning experiences and maybe work will find you. If that doesn’t take off, learn another format. That adds to your portfolio of skills, your menu of services offered. Having said that, it does seem that indexing is in danger of becoming a Red Queen’s Race.


No related posts.

About James Lamb

James Lamb has a degree in Computer Science and Mathematics from London University, worked for over 20 years as a senior IT technician and team leader, much of that time for dealing rooms of international banks, and became a full-time, professional indexer in 2004.
This entry was posted in e-books, embedded indexing, Kindle, SIdelights (SI newsletter), technology, XML. Bookmark the permalink.

8 Responses to Future publishing technologies and indexes

  1. Ann Hudson says:

    Very useful article, James – gives a clear explanation.
    Best wishes,

  2. Marie-Pierre Evans says:

    Do you know – roughly – what percentage of indexes are produced using the tagging or embedding methods these days?

    • James Lamb says:

      Marie-Pierre Evans asked:

      Do you know – roughly – what percentage of indexes are produced using the tagging or embedding methods these days?

      It would be very difficult to come up with any kind of figure.
      Publishers want to use XML in order to be able to present their books on different platforms, such as e-book readers, and if they want indexes then they have to use either embedding or tagging. Cambridge University Press, for example, use XML for all their books and so all their indexes are done that way. However a concern is that one option publishers have is to provide the book without any index at all, possibly believing that the ability to search the book is an adequate substitute or possibly simply thinking that it is cheaper and they can get away with it.
      Personally around half the indexes I provide are done using either embedding or tagging, but I wouldn’t expect that to be representative of indexers as a whole.

  3. Glyn Sutcliffe says:

    Do you think many publishers will simply append free-text searching software to their ‘electronic scrolls’ and settle for this as a cost-effective option in many cases?

    Since writers divide their texts into sections irrespecitve of page numbers perhaps there is a case for ‘electronic scrolls’ to be numbered by chapter and paragraph. Such numbering could then be used as index locators.
    Perhaps a modified form of the Biblical system of chapter and verse, or a report numbering system is set to have a wider application as page numbering becomes defunct.

    • James Lamb says:

      That is a good point, Glyn.
      Some e-books have text searching built in, such as Kindle, and, yes, I fear publishers may choose to believe that is an adequate substitute. I have written elsewhere why I think searches short-change the reader but they are conveniently cheap for the publisher.

      Numbered paragraphs have been in use in some books for a long time, for example in law books or in loose-leaf books which are updated frequently. Scripture, as you say, is numbered in that way (and that system was designed when books really were physical scrolls), and so indexes there refer to chapter and verse rather than pages. Those techniques are valid “tagging” techniques and would work in the e-book realm. Whether it is acceptable to have the numbering visible in the text of a normal book is another matter.

  4. Marie-Pierre Evans says:

    Do you think the SI would be interested in conducting a survey amongst its members to find out exactly? A survey which could be updated every year to see how things evolve? Or would that be too much work?

    • James Lamb says:

      Marie-Pierre Evans asked:

      Do you think the SI would be interested in conducting a survey amongst its members to find out exactly? A survey which could be updated every year to see how things evolve? Or would that be too much work?

      We could include a question in the SI annual rates survey, but whether indexers themselves would have accurate records is another matter. Of course what will be more informative is how things change year on year, but that will take several years before we get that data.

  5. Pingback: Embedded indexing | SI Conference Blog

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>