ACL RD-TEC 1.0 Summarization of J00-3004
Paper Title:
A COMPRESSION-BASED ALGORITHM FOR CHINESE WORD SEGMENTATION
A COMPRESSION-BASED ALGORITHM FOR CHINESE WORD SEGMENTATION
Authors: W. J. Teahan and Rodger McNab and Yingying Wen and Ian H. Witten
Primarily assigned technology terms:
- adaptive text compression
- algorithm
- automatic checking
- automatic method
- automatic segmentation
- baum-welch algorithm
- chinese information retrieval
- chinese segmentation
- chinese word segmentation
- clustering
- coding
- computational linguistics
- decomposition
- dictionary-based method
- digital library
- document clustering
- error-driven learning
- forward match
- full-text indexing
- full-text retrieval
- full-text search
- hidden markov
- hidden markov modeling
- indexing
- information processing
- information retrieval
- information retrieval and storage
- information retrieval systems
- information technology
- keyphrase extraction
- language analysis
- language processing
- learning
- learning technique
- machine learning
- markov modeling
- matching
- maximum matching
- measuring
- modeling
- name recognition
- natural language analysis
- natural language processing
- phrasing
- preprocessing
- processing
- query expansion
- ranking
- recognition
- relevance ranking
- retrieval systems
- retrieving
- search
- search engines
- searching
- segmentation
- segmentation method
- segmenter
- space insertion
- speech recognition
- spell-checking
- statistical approaches
- statistical methods
- suffix tree
- summarization
- symbolic machine learning
- text compression
- text segmentation
- text summarization
- transformation-based error-driven learning
- word segmentation
- word segmenter
- word-based compression
- world wide web
Other assigned terms:
- ambiguity
- annotators
- approach
- automata
- bigram
- bigram model
- brown corpus
- case
- character sequence
- characters
- chinese characters
- chinese language
- chinese text
- chinese word
- chinese words
- coding scheme
- contextual information
- corpora
- corpus size
- dictionary
- distribution
- document
- document frequency
- english text
- error rate
- evaluations
- f-measure
- fact
- frequency distribution
- gold standard
- heuristic
- heuristics
- human judgment
- index
- input string
- input text
- interpretation
- keyphrase
- knowledge
- language model
- language thesaurus
- lexicon
- linguistic
- linguistic information
- linguistics
- mandarin chinese
- manual segmentation
- meaning
- measures
- method
- names
- natural language
- paragraphs
- ph corpus
- phrase
- precision
- probabilities
- probability
- probability estimates
- procedure
- process
- punctuation
- punctuation marks
- queries
- query
- relative frequency
- segmentation problem
- segmented corpus
- segments
- semantic
- semantic knowledge
- sentence
- sentences
- source text
- standard deviation
- statistics
- stem
- suffix
- technique
- technologies
- technology
- terms
- test data
- test material
- testing data
- text
- theory
- thesaurus
- tipster collection
- topics
- training
- training and test data
- training and testing data
- training corpus
- training data
- training text
- tree
- trigram
- typographical errors
- user
- word
- word boundaries
- word boundary
- word frequencies
- word meaning
- words