ACL RD-TEC 1.0 Summarization of N03-1018
Paper Title:
A GENERATIVE PROBABILISTIC OCR MODEL FOR NLP APPLICATIONS
A GENERATIVE PROBABILISTIC OCR MODEL FOR NLP APPLICATIONS
Authors: Okan Kolak and William Byrne and Philip Resnik
Primarily assigned technology terms:
- algorithm
- automatic extraction
- character conversion
- character recognition
- cross-language ir
- database
- decoding
- document translation
- dynamic programming
- error correcting
- error correction
- finite state
- finitestate
- fsm toolkit
- generation method
- generative probabilistic optical character recognition
- giza
- illustration
- image segmentation
- information retrieval
- language modeling
- language translation
- lexicon acquisition
- lexicon generation
- machine translation
- model definition and estimation
- modeling
- morphological parsing
- morphology
- nlp
- nlp technology
- ocr error correction
- optical character recognition
- parameter estimation
- parsing
- post-ocr correction
- post-ocr error correction
- post-processing
- preprocessing
- probabilistic optical character recognition
- probabilistic relaxation
- probabilistic segmentation
- re-training
- recognition
- recognizer
- relaxation labeling
- search
- searching
- segmentation
- smoothing
- spelling
- spelling correction
- spelling correction system
- statistical machine translation
- statistical parameter estimation
- tokenization
- transducer
- translation lexicon acquisition
- tuning
- viterbi
- word alignment
- word bigram
- word correction
- word recognition
- word segmentation
Other assigned terms:
- alphabet
- ambiguous word
- approach
- bigram
- boundary marker
- case
- character error rate
- character sequence
- characters
- chunks
- co-occurrence
- concept
- confusion model
- dictionary
- distribution
- document
- electronic form
- error rate
- estimation
- evaluation metrics
- evaluations
- experimental results
- fact
- finite state model
- foreign language
- french
- french text
- generation
- generative model
- glossary
- heuristics
- implementation
- joint probability
- knowledge
- labeling
- language information
- language model
- language resources
- latin alphabet
- lattice
- lexicon
- lexicon entries
- likelihood
- likelihood ratio
- machine translation model
- mapping
- method
- names
- ngram
- ngram language model
- nlp applications
- nlp task
- nlp tasks
- noisy channel
- ocr performance
- parallel text
- parse
- parse structure
- precision
- probabilistic models
- probabilities
- probability
- process
- punctuation
- recognition errors
- retrieval performance
- rewrite rules
- search space
- segment boundaries
- segment boundary
- segments
- style
- symbols
- technique
- technology
- test data
- test set
- text
- tokens
- toolkit
- training
- training and test data
- training corpus
- training data
- training size
- transformation
- translation lexicon
- translation model
- trigram
- trigram language model
- trigram model
- unigram
- unigram language model
- usability
- user
- vocabulary
- vocabulary size
- word
- word boundaries
- word boundary
- word co-occurrence
- word error rate
- word level
- word sequence
- word sequences
- words