How to create an ebook for Project Gutenberg Canada

Creation and distribution of Project Gutenberg Canada ebooks

Thank you for your interest in creating ebooks for Project Gutenberg Canada! It is volunteers like you who make Project Gutenberg possible.

You should be aware before submitting an ebook to Project Gutenberg Canada that under Canadian law a book in the public domain is exactly that: if you create a new edition of an existing book, you do not acquire any kind of copyright over the new edition. The book is still in the public domain, and other people can use the edition you have created without your knowledge or permission.

It is our standard practice to include a one-line acknowledgment of your work at the top of the ebook, as a way of thanking you for contributing the ebook.

If a book is already in or at a future time enters the public domain of other countries which have a Project Gutenberg site, we encourage these sites to use the Project Gutenberg Canada editions. Also, Project Gutenberg ebooks are meant to be distributed as widely as possible. So you may find that the ebook you created becomes widely available on the Internet!

Criteria for Project Gutenberg Canada ebooks

In general, we accept titles which

We also accept other works under some circumstances, using criteria similar to those outlined in Project Gutenberg US's Submitting Your Own Work document. We are especially interested in translations.

Copyright verification

Before you start work on creating an ebook, you should ensure that it is in the public domain of Canada. If the book was published during the author's lifetime, and the author died 51 years ago or more, the book is in the Canadian public domain.

Also, please send us a scan of the title page and of the back of the title page of the edition that you used. You may wish to do this before you start working in the ebook, if only to lessen the chance that someone else is already working on the book!

The Text ebook and the HTML ebook

Every Project Gutenberg Canada ebook must exist in a Text version. The reason for this is that ASCII files can be read on essentially all current and future computers. You will find the Text information you need here.

In addition we ask volunteers if at all possible to produce HTML versions of the ebooks they create. The HTML you use to produce your ebook should be as simple as possible: please do not use automatic HTML conversion done by word processing and other programs: they often produce HTML that is complex and inefficient. Coding HTML by hand is simple and pleasurable. You will find more information here.

With current internet technology, it is likely that the HTML ebook will be the one that most people read. But the basic, mandatory Project Gutenberg Canada ebook is the Text version. It will be readable decades and centuries from now.

Creating an ebook: one volunteer's personal account

Creating an ebook is a simple and pleasurable exercise. One of our volunteers has provided us with the following procedure in creating ebooks for Project Gutenberg Canada. He uses Windows XP, but points out that the process can be followed with only minor changes on Linux and Macintosh systems.

Scanning the book

We got a new Canon scanner about six months ago. It's a wonderful scanner, a true miracle compared to scanners of 5-10 years ago. The scanning part is done through an included art program, and works well.

I scan two pages at a time, since this saves time and (as I explain below) actually simplifies proofing. I number the scans as follows:

The "000" pages are the ones at the beginning of the book which are either unnumbered or have Roman numeral page numbers. I use "001" rather than "1" because that way all of the pages are listed in their correct order when I look in a directory. If I ever have to deal with a book of more than 999 pages, I will have to use "0001" as my convention!

I create jpegs, for no special reason beyond that being the default, and it works.

Rotating the scans

The scans I get are tilted 90 degrees. There may be a way to avoid this, but fixing it is so easy that I haven't done any research.

I go to the directory with the scans, and open the first scan using the Windows Picture and Fax Viewer. I can do a clockwise rotation simply by clicking on the appropriate icon, or (more quickly) by pressing "Control" and "K" simultaneously.

Then I go to the next scan simply by pressing the Right Arrow key.

Once I've rotated all the scans, I go through them very quickly, just looking at the page numbers to make sure that there are no missing or duplicate scans. Since the page numbers are in the same position on every page, this is a very, very fast check.

I'm a strong believer in defining tasks rather narrowly. I find that a fairly large number of very specific tasks is easier to deal with than a small number of grouped tasks, and leads to better results.

OCR'ing the book

The OCR program included with the scanner, ScanSoft OmniPage SE, is excellent. If the original book is clearly printed, the results provided by the program are very close to being perfect.

I save the OCR output as Text with Line Breaks. Originally I saved the output in individual files, which I later assembled into one big file. Then I discovered how to set the Save As part of the OCR program so that each new OCR section was added to the previous one.

Before I start the actual proofing, I ensure that only ISO-8859-1 characters are used in the file. For this I use the Open Office word processor's "Save as Encoded Text" feature. Open Office is an open-source office suite of high quality that is available for free and exists in versions for Windows, Linux, Mac OS X and other systems as well.

Proofing the text

This is pretty much what it sounds like: a careful comparison and correction of the OCR text to ensure that it matches the original book. I display the original scan using the Windows picture viewer, and the OCR with a very fine and extremely useful free program called Notepad++, or with the Notepad program that comes with Windows.

Modest though it is, my notebook computer is widescreen, so everything fits on the screen quite nicely. When I'm proofing what was the left-hand page on a scan, I move the Notepad++ window to the right, so the scan isn't blocked. When I'm proofing the right-hand page, I move the Notepad++ window to the left.

I follow the Project Gutenberg Distributed Proofreading rules while doing the proofing, for example rejoining words that were split across the end of one line and the start of the next. I keep a log of items needing research, such as split words where the hyphen might need to be retained, and so on.

One major difference from the DP method is that I am very careful to keep page number information. I put a blank line before and after each page number, and highlight the page number so that it is very easy to spot:


If the next line is the start of a sentence, I specifically indicate whether or not it is also the start of a new paragraph:


The second major divergence from DP is the way that I handle italicization. Instead of single leading and trailing underscores


I use a single leading and a double trailing underscore:


This greatly simplifies creating the HTML edition: first, I replace double underscores with "</i>", indicating the end of the italicization, and then I replace single underscores with "<i>", indicating the start of the italicization.

Later on, when creating the official Text version, I simply do a general change, replacing every double underscore with a single underscore.

I make sure that any passages of poetry are indented from the left margin, and that I put in special [**REMARKS] pointing out where there is a section of indented prose in the original book, for example a quoted passage from a letter.

At the end of the book, I add an annotation

[End of _Title__ by Author]

to make it clear that this is the end of the book, and the ebook file is complete.

Checkover of proofed text

This involves such things as:

First major save: the PROOFED version

Actually I'm an addict when it cames to saving intermediate versions. But this save is particularly important, because it archives the proofed OCR text, before any further text manipulation happens.

Creation of the BASE version

This step involves text manipulation which is needed for the final Text version and also the final HTML version. It involves

(1) Manually removing the page numbers from the file.

(2) Removing line breaks within paragraphs. For this I use Tom's eTextReader, a free and very useful program designed specifically for viewing Gutenberg text files and for reformatting them.

(3) Reformatting the file so that lines have a standard length of 65-70 characters. I do this using Notepad++'s TextFX Edit "ReWrap Text" tool. I have also done this rewrapping using the KATE text editor included with the free Kubuntu version of desktop Linux.

(4) Saving what is now the BASE file, the common parent of the final Text and HTML versions.

Creation of the TEXT version

I simply take a copy of the BASE version, and do the following:

Creation of the HTML version

I take a copy of the BASE version, and do the following:

Final readthrough

I simply read the HTML version with all the care that I can muster, and correct (in both the Text and the HTML versions) any remaining errors I find. This is where the archived file with the page numbers becomes valuable. If I see something that looks not quite right, I copy the words involved, and do a search for this text in the saved file with page numbers. Once I have the page numbers, it's a simple matter to find the original scanned page and see what the original text was.


So this is my procedure for the moment. It will continue to evolve, but I can say that it is already a rather straightforward process that does not get me entangled in technical issues. Particularly if a book is cleanly printed, creating an ebook is highly enjoyable. Naturally, I do end up reading the book more than once, so I take great care to select books that I enjoy so much that reading them again is a pleasure rather than a duty.