open-source character recognition

GOCR/JOCR

ToDo/Task-List v0.2
Last update March 29, 2001

This file is in (eternal) progress.

This is the detailed ToDo or Task list for the SF developer. Happily the group of developers is growing fast. To manage development, I have created this file. Hope it helps.
Send your suggestions and comments if they are not listed here. If you are a developer, look for a topic of your interest. If you want to be a developer, ask Jörg to get CVS write permissions, etc. Choose a task for you and send an email (preferably to the mailing list) with your choice including comments:

what you chose
how long will it take to implement, deadline
comments, what do you plan to change or what have you changed already especially if you change the interface to other program parts

It's wise to ask first, and code later. So you will avoid a reply such as "Sorry, you wasted 5 hours of your life, it's already implemented.". And be sure to get the latest CVS version.
Uff, how to start? thinking... writing... look at TODO and REVIEW files too.

TOPICS:

1 discussions/road-map
- 1.0 name of the package
2 developpment
- 2.1 database-module
- 2.2 ocr0-engine
- ...
- 2.10 learn mode
- ...
3 man/doc-writing
- 3.1 developper-introducing
- 3.2 FAQ for users
- 3.3 this document (better HTML) ?
4 graphic design
- 4.1 how to design the frontend
5 Tester/Bug-finder (Mac/Win/... platforms)
- 5.1 speed optimizing
- 5.2 quality check
6 packager (rpm,Makefile,configure)

1 discussions/road-map

  I would like to give some points you should have in mind:
  - write simple and easy understandable code
  - do only use only few libaries especially for the command-line based gocr
    - Often users have poor systems. Gocr read a file and write a file.
      This should be possible with standard libs on any system.
  - make good comments, think that another programmer have to understand
    and change your code
  - be careful if using lot of memory or recursive functions.
     I can get xli or gimp on my 128MB-pc out-of-mem often. :(
  - ... (I remember that there are some guide lines for programmers on the web,
    may be I should insert some links to the "best" of them here.)
   

1.0 name of the package
  I am a bit unlucky about have gocr on freshmeat and jocr on SourceForge
  for the same thing. I think gocr is better name, more intentional.
  On sourceforge this name was allready in use.
  What now?

------------------------------------------------------------

2 development

 2.1. database-module
  [0.2 update]What's going to happen with database-module? I can't see a
  way to implement a general database in libgocr, so it probably will be
  filed in main module.

  The database module is well seperated from ocr0-engine. It is
  a directory containing special font (p.e. greek).
  We need a special routine, able to compare one unknown char with
  the chars of the database in a robust way.
  I have done a first implementation working more bad as good.
    see load_db(), ocr_db(), distance2()
  But the distance2() function does work not good enough.
  Try to write a better function in seperate file database.c (?).

  The distance funtion (distance2) has to be robust against
  size and small angles.

  UPDATE: It will probably be part of one of the engine plugins now. The
  development of this feature is frozen now.

 2.1.1 detect difficult regions

  It could be usefull to check each box/cluster for regions
  which are instable against shrinking. I mean if you shrink a character
  lines get smaller or break into two pieces. Or in the other case
  an end of a line is touching a black region in the nearest environment.
  If we are possible to detect such regions and mark or list them, it could
  be easier to correct errors by using filter and such things only in 
  the this regions.

  estimated time: depence harliy on the power of you, propably more than 100h

------------------------------------------------------------
 2.2. ocr0-engine
  This is the main part I think. I have a fast growing database of
  image files. 
  Some of them show, that characters are recognized in a wrong way.
  You have to take such a example and look at the engine, why the error
  occurs, and find a good way to fix the problem.
  After it, all other examples should work like before changes happen
  or even better.
  That is, what I have done most of the development time.

------------------------------------------------------------
 2.2.1 ocr0-elementary functions
  The ocr0 engine uses a small set of functions:
  loop(), turmite(), num_cross(), get_bw(), get_line()
  get_line2() function must be improved
   - should detect if 2 points are connected by a black line
   - should work for any resolution
   - tolerance parameter should be used
 
------------------------------------------------------------
 2.2.2 ocr0 special chars
  Unicode implementation is currently being done.

------------------------------------------------------------
 2.2.3 ocr0 splitting
  put ocr0-args into one structure which can be used as argument
  for ocr0a()...ocr0z(), good idea? I am not sure.
  splitting is good for: 
   - better reading the code (reducing number of lines, compilation time)
   - better working on code
   - better reorganization
  but worse
   - for speed and ... what else?

  votes: splitting=2              no_splitting=0


------------------------------------------------------------
 2.2.4 using propabilities and alternatives if characters are bad 
 ini_list(), exclude() and getresult() is a ansatz, look ...

  Used for other things like:

  list="abcde..."
  wert={100,100,100,100,100,...
 /* wert is german word for value; sorry, 100 is 100%, could be also 1000
    or other initial values */

  test_if_char_is_list[2]==c, function tells: it is never a 'c' => wert[2]=0
  test_if_char_is_list[0]==a, may be it is, but not sure => wert[0]=80

  at_the_end look at highest wert wert[i] and take list[i] as result.
  If there are problems look at the second highest wert[j]
 
  clear?

   hmm, think I have to be more precise, an example:

    list={ 'c', 'e' }
    wert={ 100, 100 } // 100 could be also 1000 or MAX_VALUE

 test_left_bow(box) // only a example function!
 result:
    wert={ 100, 100 }  // left bow detected, not changed (could be c or e)

 test_horizontal_midlie_line() // only a example function!
 result:
    wert={ 100,  50 }  // no midlie line detected, but could be bad scan
                       // old_val=100 weight=0.5 => 100*0.5=50

 test_crotchet_on_right_upper_end() // only example!
  result:
    wert={  80,  60 } // crotchet was detected, typical for e but not for
                      //  c, so propability for c is lowered, but for e is
                      //  enlarged to (MAX_VALUE-old_value*weight),
                      // weight=0..1


------------------------------------------------------------
 2.3. library
  In development, called libgocr, and already working. We still need to
  do/decide the following:
  - frontend communication architecture
     call_this_notifier_if_progress((int *)notifierfkt(char *what_happens,int percent ))
     ask_user(char *something)

------------------------------------------------------------
 2.4 ocr2-engine (feature based)
  What you think about completely different engines. The main program could
  switch from one to another engine if problems arise.
  The form based and data base engine is partly implemented.
  The third way should be a feature based engine.
  I have started on ocr1.cc ocr2(). This engine should find essentials of
  each char. A list of longest lines and bows found should be compared with a
  database. This database should contain essentials of letters like
  'A' is build by 3 lines between points p1,p2,p3,p4,... 
  Point p2 is lying "near" the midlie of p1,p3 or similar.
  UPDATE: will be a modular engine.

------------------------------------------------------------
 2.5 fax/screen-fonts
  What is the best way detecting mini fonts?
  Gocr could work together with screen grabber, translation programs or
  speach programs. 
 
------------------------------------------------------------
 2.6 detect lines and boxes
  - detect underlined text, frames arround boxes

------------------------------------------------------------
 2.7 store images
  if -m 4 used, pictures should be detected
  create a list of pictures and write functions:
   int get_num_pictures(), getpicture(int num,pix *dest,...)
  use this functions to create images o_imgXXX.pnm

------------------------------------------------------------
 2.8 font-type/serifs detection
  I would like to see a function able to distinguish between
  italic, bold, slanted, tt-fonts
  tt-fonts detection is important for inter character space detection
     distance between midlie axis are always multiple of tt-width
  UPDATE: bbg has some ideas.
 
 2.8.1 space between characters

  Write a function which is able to decide which type of following
  fonts is used:
   - fixed-width font
   - proportional-width font
  The algorithm should use the boxlist where informations about size
  and position of every detected character/glyph is stored.
  Extract the essential information to find space between words.
  The distance between midliepoints could be such a value (fixed-font).
  Also the distance between right side
  and left side of the following character could be the essential
  value (prop.-font).

  An extension could be to estimate the values and there tolerance for
  grouping characters to words and sentences (related variable = env.cs).

  Another extension could it be, to find single fixed-width-font words
  in a proportional-width-font text or vica versa.

  estimated time: 20-40h

------------------------------------------------------------
 2.9 
  what about outputing HTML or TeX? Not in a near future, but.
  [0.2] Solved by libgocr.

------------------------------------------------------------
 2.9.1 
    users wish to get the positions (absolute or relative?) of the chars,
    output of xfig or pdf format?
    sounds to me like a compression like function of textimages,
    every character can be seen as minipicture, some of them are equal

------------------------------------------------------------
 2.10 learn mode
    If we do not use a simple expandable database of master characters,
    it would be nice to have a "code morphing" algorithm for the
    hard coded engine. It should be possible,
    because we have the sources (Open Source) and other projects 
    also modify parts of its own code (MUDs, Transmeta, ...).
    The more simple variant is the database variant of engine.

------------------------------------------------------------------------
 2.11 page orientation

  Write a function which is able to decide if the picture is rotated
  by 90,180 or 270 degree.
  If possible the algorithm should use only the boxlist where informations
  about size and position of every detected character/glyph is stored.
  Therefore you have to play with scanned examples.
  Analyse characteristical quantities using fourier transformations
  or other things (look at the literature)

  After detection rotate the pixmap back if rotated and modify the lists.

  estimated time: 40-80h

--------------------------------------------------------------------
 2.12 detect math formulas
  [0.2]is supported, but not implemented, by libgocr.

---------------------------------------------------------------------
 2.13 improvement of essential functions

 2.13.1 improve speed and quality of frame_nn() algorithm
   [0.2]rewritten as gocr_charSetAllNearPixels().

 2.13.2 improve remove_melted_serifs() function
   On low resolution scans often two neighboured characters are
   glued together if they have serifs. This problem arises very often!
   I think serifs are easy detectable and it should be possible to
   write a algorithm which does a better work than the old function.
   The old function does not detect all serifs.
   If you write a better function be careful, do not remove to much.
   Removing means: change black pixels between the two chars to white or
   lightgray ones.

   estimated time: 40-60h or more


 2.13.3 improve remove_dust() function
   I mean scanned dusty pages. They often have speckles
   (hope its the right word). The speckle size and form should follow
   statistical laws, so you can estimate the largest size and
   do not remove to much. The existing function should be a
   good starting point.
   The function should remove the boxes from the boxlist and
   lighten the pixels on the pixmap.

   estimated time: 60h

------------------------------------------------------------

3 man/doc-writing

 3.1. developper-introducing
  [0.2]done in libgocr.

 3.1.1 definitions/english
  [0.2]done in libgocr.

4 graphical design

 4.1 graphical user interface
  Should be a seperate package (XGOCR ?) using the gocr-lib or exe.
  I would like to see a easy to use tcl/tk mini GUI (gocr.tcl).
  simple problem: how to present pgm-image on tk-canvas?
  UPDATE: A gtk frontend is done.
  
 4.2 graphical debug interface
  See section 5.

5 tester/bug-finder

 5.1. speed optimizing
  I think this task is not very important yet. But ...
  it could someone a eye to it to avoid dramatical slow down of the program.
  You should have experience with gprof. Make a list of speed killer,
  suggestions of possible changes. Or you can do simple changes.

 5.1.1 speed analyzis
  Make a documentation about speed of every function and a list of
  function sorted by runtime they need. Analyse where improvements
  could give a high speedup and make suggestions what could be changed
  and how much the improvement would be. Use gprof for testing speed.
  
  Of course the next step would be to make the changes and report
  the speedup.

  A list of speed of other programs could be extracted from the literature
  and a milestone could be set for gocr-package.

  estimated time: 20-40h

 5.1.2
  pixel() fkt contains lot of if then constructs v0.2.7
   and is often called by other fkts.
  speed up if filter is splitted into AND and OR part? If yes do it
  and report the speed up.

  Another possibility is to make a kind of morphing code.
  I mean: you should read the filter table (filt3[][] of pixel.c)
   and create source code which can be compiled and ...
  in ideal case loaded at run time. For first it would be enough to
  generate the code from the table at compilation time.

  estimated time: 40-80h for the table to c-code translator

 5.2 quality check
  Look at the REMARK.txt file. It is a list of test images some remarks
  and the number of errors. I would like to see a html-table.
  The user should see how good gocr is in comparision to other programs.
  I can give you access to some images with different fonts or
  some difficulties (noise etc.). All files are from users (direct from
  live).

 5.2.1 Tests could be automated if we used the following system for
  the image library: besides [imagename].ext, there should be a
  [imagename].txt, which was manually typed and is the correct
  text. We could make tests, comparisons and statistics quite
  easily that way. It could be managed by a Makefile easily.
  The image library should be well-organized, with different
  styles of text in different directories: fax, excellent
  quality, italic, bold, accents, greek, etc.

 5.2.2 Testfiles should be sorted in directories:
> text/quality/excellent
> text/quality/good
> text/quality/bad
> text/accents
>  (with the TeX, HTML, etc codes)
> chars/
>  (ASCII characters; only one type per file. First letter of the file is the
>  character. Examples: a0.ext, aitalic.ext, etc, are images only with lower
>  case  'a's. I think no text files are needed; perhaps the number of
>  characters in file? can be used as database.)
> symbols/
>  (text file should contain information such as name of character, code in
>  TeX, html, etc. Can be used as database)
 Put them all in a directory, called "examples" or something.
 UPDATE: being done by BBG.


 5.3 bmp files
  May be it is good for harddisk space and speed up if output
  of bmp(rle-packed) is used. See pcx.cc writebmp().

 5.4 graphical debug interface
  [0.2]Support is built into libgocr.

6 packager


 6.1 make a rpm package
  change Makefile.in to implement a "make rpm-package"
  - configure.in etc. should be updated for
    - using libpgm,libppm,libpbm (define USE_LIBPGM,...)

 6.2 make winexe
  I am able to make a winexe gocr using DJGPP. People is interested
  in it, I can count the downloads and from time to time I get
  EMAILs of Window users telling me that there are better programs. ;)
  I like that but I do not like working with windows. May be
  you are interested to gladden winusers and create the WINEXE of gocr.


Developpers:                dealiine:
Notes:
 We need sponsors:
  - What do you think of give prices for best contributions?
  - Or have money for paying students for programming?
  - Or could pay for a journey to a developer conference?
  - Or could pay for an online connection?
  - Or have some additional computer equipment (notebooks, scanners).
 Do we need www.gocr.org?