Save HTML webpage to a PDF Document

Why?

You get to access it offline. Its more readable. You can annotate, leave comments, etc from a PDF Client (Adobe Acrobat Reader anyone?). You can even track your progress inside the document. If your client is really friendly it can even reopen the document from where you left it.

Well for me its merely a matter of convenience. I always prefer a PDF manual instead of an HTML/CHM one for some reason (think: http://docs.python.org/download.html); Does it really matter if PDF is three times the size of the HTML archive? Storage isn’t really a pushing concern these days. Is it?

Idea:

Multiple HTML pages to a Single PDF.

There are a lot of websites which offer to convert an HTML document to PDF on the fly. Doesn’t serve the purpose and I not very big on registering everywhere!

For our little experiment lets pick a URL. I suggest http://www.catb.org/~esr/writings/homesteading/cathedral-bazaar/index.html.

(for further reading go to https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar by Eric S. Raymond).

#1 Download web pages recursively using wget

# create a directory under home; Think: less clutter
$ mkdir ~/our_little_experiment; cd ~/our_little_experiment;

# download the webpage and recursively download all those webpages which are linked from this page in the current directory.
$ wget -v -r http://www.catb.org/~esr/writings/homesteading/cathedral-bazaar/index.html

FINISHED –2012-10-07 20:15:47–
Total wall clock time: 2m 51s
Downloaded: 16 files, 164K in 1.9s (88.7 KB/s)

A little later …

# lets look at the files generated by wget (an awesome tool btw!)
$ cd ~/our_little_experiment/www.catb.org/~esr/writings/homesteading/cathedral-bazaar

$ ls -1
ar01s02.html
ar01s03.html
ar01s04.html
ar01s05.html
ar01s06.html
ar01s07.html
ar01s08.html
ar01s09.html
ar01s10.html
ar01s11.html
ar01s12.html
ar01s13.html
ar01s14.html
ar01s15.html
ar01s16.html
index.html

# index.html is actually chapter 1 so renaming it to ar01s01.html
$ mv index.html ar01s01.html

#2 install htmldoc

$ sudo apt-get install htmldoc

#3 create the PDF document

$ htmldoc --webpage -t pdf14 -v -f catb_cathedral_bazaar.pdf *.html

Output: catb_cathedral_bazaar.pdf

Voila!

All the links work perfectly.

I’d be happy to guide anyone doing it on w32.

Please don’t use “win” as an abbreviation for Microsoft Windows in GNU software or
documentation. In hacker terminology, calling something a “win” is a form of praise. If you
wish to praise Microsoft Windows when speaking on your own, by all means do so, but not
in GNU software. Usually we write the name “Windows” in full, but when brevity is very
important (as in file names and sometimes symbol names), we abbreviate it to “w”. For
instance, the files and functions in Emacs that deal with Windows start with ‘w32’.

— GNU Standards

http://www.gnu.org/prep/standards/standards.html#Trademarks-1

 

Happy Hacking!
References:

http://freecode.com/projects/htmldoc

http://www.htmldoc.org/software.php

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s