Version 0.1.0 commit

ncraun · Aug 23, 2013 · aa31f3e · aa31f3e
commit aa31f3e
Show file tree

Hide file tree

Showing 14 changed files with 2,098 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,24 @@
+# Binaries
+smoothscan
+*.o
+
+# autotools
+configure
+.deps/
+Makefile
+Makefile.in
+aclocal.m4
+autom4te.cache/
+config.h
+config.h.in
+config.log
+config.status
+depcomp
+install-sh
+missing
+stamp-h1
+
+# emacs
+\#*#
+*~
+
diff --git a/AUTHORS b/AUTHORS
@@ -0,0 +1,4 @@
+Project Lead
+------------
+
+Nate Craun <[email protected]>
diff --git a/COPYING b/COPYING
diff --git a/ChangeLog b/ChangeLog
@@ -0,0 +1,8 @@
+smoothscan 0.1.0
+
+For a high level overview of changes, added and deprecated features,
+and so on please see the NEWS file. 
+
+For a source level description of changes, check out a copy of the
+source code from the Project's github page, and examine the git commit
+log.
diff --git a/INSTALL b/INSTALL
@@ -0,0 +1,43 @@
+smoothscan 0.1.0
+
+Dependencies
+------------
+
+Please install the following dependencies before smoothscan.
+
+leptonica: http://leptonica.com/
+libharu: http://libharu.org/
+potrace: http://potrace.sourceforge.net/
+fontforge (compiled w/ python support): http://fontforge.org/
+python: http://www.python.org/
+
+In order for leptonica to be able to read and write image formats, you
+will have to install the library for that particular image format. See
+the leptonica documentation for more info. In order for smoothscan to
+work properly, leptonica must be compiled with at least tiff and png
+support.
+
+autotrace can be used in place of potrace, but potrace generally
+produces better quality vectorizations.
+
+Downloading
+-----------
+
+First make sure you always download the latest version of the
+software. You can get it from the project's homepage at
+<https://natecraun.net/projects/smoothscan/>
+
+Installation
+------------
+
+Installation follows the standard GNU Autotools procedure:
+
+./configure
+make
+sudo make install
+
+see configure --help for more information on options.
+
+Pay attention to the output from configure. It will warn you if you
+are missing any dependencies or if any problems were encountered with
+the build.
diff --git a/Makefile.am b/Makefile.am
@@ -0,0 +1,4 @@
+bin_PROGRAMS = smoothscan
+dist_bin_SCRIPTS = src/smoothscan-fontgen.py
+smoothscan_SOURCES = src/smoothscan.c src/smoothscan.h
+dist_man1_MANS = doc/smoothscan.1
diff --git a/NEWS b/NEWS
@@ -0,0 +1 @@
+smoothscan 0.1.0: first release
diff --git a/README b/README
@@ -0,0 +1,68 @@
+smoothscan 0.1.0
+
+Description
+-----------
+
+smoothscan is a tool to convert scanned text into a vectorized output
+form. Because printed text is assembled from fonts, each particular
+letter (like 'o') will have the same shape as every other 'o' in the
+document. We can take advantage of this, by building a table of such
+symbols, and represent each occurrence of a symbol with a reference to
+that symbol's table entry. This will save a lot of space, and a
+similar idea is used in djvu's jb2 mode and JBIG2 for PDF.
+
+smoothscan builds up this table, but instead of filling the table with
+the original raster images, it vectorizes each symbol. Vector images
+will look smoother than their raster equivalents, and can be scaled
+without introducing pixelation. These properties result in a smaller
+output file size, as well as making the scanned text images more
+readable.
+
+smoothscan saves the vectorized images into a custom TrueType font and
+embeds the font into the output pdf file. Currently each symbol is
+mapped to an arbitrary letter in the font, but in future versions you
+could run OCR on each symbol, and ensure that the 'o' image is
+associate with the 'o' character encoding in the generated font.
+
+To get good results, you must have good input. Higher resolution scans
+capture more detail about the shape of each symbol, so a higher
+quality vectorized version can be created. It's a good idea to process
+your scanned images using a tool like ScanTailor before running
+smoothscan.
+
+Current smoothscan can only process pure black and white 1bpp images,
+but in the future support will be added for other formats, especially
+ScanTailor's Mixed output mode.
+
+smoothscan is currently targeted at GNU/Linux based systems, but
+Windows and OS X will be supported in future versions.
+
+Downloading
+-----------
+
+You can always find the latest release on the project home page at
+<https://natecraun.net/projects/smoothscan/>. Developers can check out
+the latest bleeding edge source code on the project's Github page:
+<https://github.com/ncraun/smoothscan>
+
+Installation
+------------
+
+Please see the INSTALL file for installation instructions.
+
+News
+----
+
+Please see the NEWS file for a high level overview of changes since
+the last release.
+
+License
+-------
+
+smoothscan is licensed under the GPLv3+ Please see the COPYING file
+for the full text of this license.
+
+Authors
+-------
+
+Please see the AUTHORS file for the list of contributors.
diff --git a/TODO b/TODO
@@ -0,0 +1,13 @@
+Major TODO
+----------
+
+* Support for ScanTailor "Mixed" Images
+* Map symbols from OCR, not just arbitrary code points
+* Multithreaded font generation for speed increase
+
+Minor TODO
+----------
+
+* Windows Support
+* OS X Support
+* Generate other formats such as EPUB and HTML, with web fonts
diff --git a/configure.ac b/configure.ac
@@ -0,0 +1,49 @@
+
+AC_PREREQ([2.69])
+AC_INIT([smoothscan], [0.1.0], [[email protected]])
+AM_INIT_AUTOMAKE([gnu])
+AC_CONFIG_SRCDIR([src/smoothscan.h])
+AC_CONFIG_HEADERS([config.h])
+
+# Checks for programs.
+AC_PROG_CC
+
+AC_CHECK_PROG([FONTFORGE], [fontforge], 1, 0)
+if test $FONTFORGE = 0
+then
+   AC_MSG_ERROR([fontforge not found])
+fi
+AC_CHECK_PROG([PYTHON], [python], 1, 0)
+if test $PYTHON = 0
+then
+   AC_MSG_ERROR([python not found])
+fi
+AC_CHECK_PROG([POTRACE], [potrace], 1, 0)
+if test $POTRACE = 0
+then
+   AC_MSG_ERROR([potrace not found])
+fi
+# fontforge
+# python
+# potrace
+# leptonica
+# libharu
+
+# Checks for libraries.
+AC_CHECK_LIB([lept], [jbCorrelationInitWithoutComponents], [], [AC_MSG_ERROR([leptonica library not found or not usable])])
+AC_CHECK_LIB([hpdf], [HPDF_New], [], [AC_MSG_ERROR([libharu library not found])])
+# Checks for header files.
+AC_CHECK_HEADERS([stdlib.h string.h unistd.h])
+
+AC_CHECK_HEADERS([leptonica/allheaders.h], [], [AC_MSG_ERROR([Leptonica headers not found or not usable])])
+AC_CHECK_HEADERS([hpdf.h], [], [AC_MSG_ERROR([libharu headers not found or not usable])])
+
+# Checks for typedefs, structures, and compiler characteristics.
+AC_TYPE_SIZE_T
+
+# Checks for library functions.
+AC_FUNC_MALLOC
+AC_CHECK_FUNCS([mkdir strerror])
+
+AC_CONFIG_FILES([Makefile])
+AC_OUTPUT
diff --git a/doc/smoothscan.1 b/doc/smoothscan.1
@@ -0,0 +1,59 @@
+.TH smoothscan 1
+.SH NAME
+smoothscan \- convert scanned document pages to a vectorized font pdf
+.SH SYNOPSIS
+.B smoothscan 
+[debug-options] [options] -o output.pdf input_files
+.SH DESCRIPTION
+.B smoothscan 
+is a document processor. It will analyze the input page images, and create a dictionary of similar images. One 'o' on the page should have similar enough shape to another 'o' of the same font, so we can save space by only storing the data for 'o' once, and just referring to that stored data for all other 'o's on the pages. Then smoothscan will convert the dictionary from a set of raster glyphs to a vectorized truetype font, and create a pdf file with all necessary fonts embedded.
+.SH OPTIONS
+.TP
+.I input_files
+input files is the list of 1bpp TIFF files, one file per page.
+.PP
+.B Regular Options:
+.PP
+.TP
+\fB\-o, \-\-output\fR=\fIFILE\fR
+Specify output file
+.TP
+\fB\-t, \-\-thresh\fR=\fIVALUE\fR
+Specify the threshold value (value for correlation). 
+Valid input is from [0.40 - 0.98].
+Recommended values for scanned text from [0.80 - 0.85].
+Default is 0.85.
+.TP
+\fB\-w, \-\-weight\fR=\fIVALUE\fR
+Specify the weight value (correcting threshold for thick characters).
+Valid input is from [0.0 - 1.0]. 
+Recommended values for scanned text from [0.5 - 0.6]. 
+Default is 0.5.
+.TP
+.B\-h, \-\-help
+Display basic usage information.
+.TP
+.B\-v, \-\-version
+Display version information.
+.PP
+.B Debug Options:
+.TP
+\fB \-\-debug\-tmpdir\fR=\fITMPDIR\fR
+Use the specified tmpdir instead of creating a new directory in the system temp directory.
+.TP
+.B \-\-debug\-draw\-borders
+Draw red rectangles around the calculated text positions on the output pdf pages. Useful for making sure glyphs are being positioned correctly.
+.TP
+.B \-\-debug\-render\-pages
+Use the raster dictionary to generate raster images of each page in addition to the vectorized fonts. Useful for inspecting the glyph classification results.
+.TP
+.B \-\-debug\-skip\-font\-gen
+Skip regeneration of the fonts, and just use existing generated fonts in the specified tmpdir. Generating fonts is the longest part of the procedure, so if you aren't debugging the font generation code there is no need to regenerate the font for each test.
+.TP
+.B \-\-debug\-no\-clean\-tmpdir
+Don't delete the tmpdir after processing is complete. Useful for inspecting the generated temporary files (fonts and split characters)
+.PP
+Debug options are only useful if the program is misbehaving and you are trying to diagnose what the problem is. Debug options are also not considered stable, and are very subject to change. Do NOT rely on the presence of debug options in any extension, or script. If a debug option is particularly useful in the general case, it may be upgraded to a normal option, but as long as it has the \fB\-\-debug\-\fR prefix, it could be removed at any time.
+.PP
+.SH AUTHORS
+Nate Craun <[email protected]>
diff --git a/src/smoothscan-fontgen.py b/src/smoothscan-fontgen.py
@@ -0,0 +1,110 @@
+#!/usr/bin/python
+
+#  This file is part of smoothscan.
+#
+#  smoothscan is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published
+#  by the Free Software Foundation, either version 3 of the License,
+#  or (at your option) any later version.
+#
+#  smoothscan is distributed in the hope that it will be useful, but
+#  WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+#  General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with smoothscan. If not, see <http://www.gnu.org/licenses/>.
+
+
+# Fontforge has a c library interface, but there are no docs for it,
+# and even the author recommends using the python interface instead of
+# the C library interface.
+
+import fontforge
+import psMat
+import glob
+import os
+import sys
+import subprocess
+
+def removePrefix(text, prefix):
+    return text[len(prefix):] if text.startswith(prefix) else text
+
+def removePostfix(text, postfix):
+    return text[:len(text)-len(postfix)]
+
+ffVersion = fontforge.version()
+print ("Using Fontforge version: " + ffVersion)
+
+if (len(sys.argv) != 6):
+    print ("Usage: fontgen.py fontdir outname latticeh latticew fontnum")
+    exit (1)
+
+fontdir = sys.argv[1]
+outname = sys.argv[2]
+latticeh = int(sys.argv[3])
+latticew = int(sys.argv[4])
+fontnum = int(sys.argv[5])
+# command line args
+
+print ("Scaling to x: " + str(latticeh) + " y: " + str(latticeh))
+print ("Converting " + fontdir  + "/*.png to " + outname)
+
+
+newFont = fontforge.font() 
+newFont.encoding = "koi8-r"
+
+scaley = latticeh/100.0
+matrix = psMat.scale(scaley, scaley)
+
+newFont.layers[0].is_quadratic = True;
+vecNames = []
+
+for f in glob.glob(fontdir + "/*.png"):
+    vecNames.append(f)
+
+vecNames.sort()
+
+for f in vecNames:
+
+    cp = int(removePostfix(removePrefix(f, fontdir+"/"), ".png"))
+    newFont.createMappedChar(cp)
+    currGlyph = newFont[cp]
+    currGlyph.importOutlines(f)
+    # fontforge's autoTrace will invoke either autotrace or potrace
+    # depending on what is installed on the system. In general,
+    # potrace gives better results. This tracing step should probably
+    # be moved into the c component, using libpotrace.
+    currGlyph.autoTrace()
+    currGlyph.width = latticew
+    currGlyph.transform(matrix)
+    currGlyph.simplify()
+
+    # If fontforge sees a nearly blank character, it won't ouput it,
+    # which will cause errors in the resulting pdf. Setting the width
+    # manually should fix this, but this check is in here to make
+    # sure.
+    if (not currGlyph.isWorthOutputting()):
+        print (f + " not worth outputting, failed to render character")
+
+
+# Not sure about this part. Fontforge was complaining about invalid
+# cvt and prep tables, during autoInstr in the first loop, so we just
+# clear them, and autoInstr in a separate loop.
+newFont.setTableData('cvt', None)
+newFont.setTableData('prep', None)
+
+for currGlyph in newFont.glyphs():  
+    currGlyph.autoInstr()
+
+fn = "SmoothScans" + str(fontnum)
+newFont.fontname = fn
+newFont.fullname = fn
+newFont.familyname = fn
+newFont.comment = "Generated by smoothscan"
+# By default, fontforge includes the username in the copyright. We
+# want to respect our user's privacy, so we clear it for them.
+newFont.copyright = ""
+
+newFont.generate(outname)
+