ACL RD-TEC 1.0 Summarization of J05-4005
Paper Title:
CHINESE WORD SEGMENTATION AND NAMED ENTITY RECOGNITION: A PRAGMATIC APPROACH
CHINESE WORD SEGMENTATION AND NAMED ENTITY RECOGNITION: A PRAGMATIC APPROACH
Authors: Jianfeng Gao and Mu Li and Andi Wu and Chang-Ning Huang
Primarily assigned technology terms:
- algorithm
- approximation
- automatic speech recognition
- automaton
- backoff smoothing
- binary classifier
- boosting
- bootstrapping
- bootstrapping approach
- boundary disambiguation
- candidate generation
- chinese language processing
- chinese morphology
- chinese morphology analysis
- chinese natural language processing
- chinese parser
- chinese sentence generation
- chinese word segmentation
- classification
- classifier
- classifiers
- computational linguistics
- computing
- crfs
- decoder
- decoding
- disambiguation
- editing
- em training
- empirical risk minimization
- entity recognition
- entropy classifier
- error analysis
- finite state
- finite state automata
- finite state machines
- finite-state automaton
- finite-state morphology
- full parsing
- good-turing method
- greedy approach
- greedy segmenter
- identification
- iterative learning
- iterative procedure
- iterative training
- language processing
- language processor
- learning
- learning algorithms
- learning approach
- learning approaches
- lexicalization
- lexicon representation
- lexicon word segmentation
- linear discriminant
- linear mixture model
- loss function
- machine translation
- machine translation system
- margin-based learning
- matching
- maximum entropy
- maximum entropy classifier
- maximum likelihood
- maximum matching
- maximum-entropy
- maximum-likelihood
- maximum-likelihood estimation
- model parameter estimation
- modeling
- monte carlo simulation
- morphological analysis
- morphological process
- morphology
- morphology analysis
- named entity recognition
- natural language processing
- ne recognizer
- nlp
- nlp systems
- normalization
- optimization
- optimization algorithm
- parameter estimation
- parser
- parsing
- pattern classification
- perceptron
- perceptron algorithm
- perceptron training
- pos tagging
- preprocessing
- processing
- processor
- recognition
- recognizer
- risk minimization
- rule-based approach
- sampling
- search
- search algorithm
- segmentation
- segmentation system
- segmenter
- sentence generation
- smoothing
- smoothing method
- speech recognition
- splitting
- statistical approaches
- stochastic approximation
- svm classifier
- svm light
- tagging
- text-to-speech
- training algorithm
- training method
- training procedure
- transformation-based learning
- translation system
- tuning
- unknown word detection
- unknown word identification
- unsupervised approach
- unsupervised learning
- vector space model
- viterbi
- voting
- word alignment
- word breaking
- word detection
- word identification
- word processing
- word segmentation
- word segmentation bakeoff
- word segmentation system
- word segmenter
Other assigned terms:
- abbreviation
- abbreviations
- adaptation paradigm
- affix
- affixation
- ambiguity
- analogy
- annotated corpus
- annotated training corpora
- annotated training corpus
- annotated training set
- annotation
- annotators
- approach
- array
- association for computational linguistics
- automata
- backoff
- bigram
- bigram model
- bilingual dictionary
- binary feature
- binary features
- case
- character bigram model
- character sequence
- characters
- chinese characters
- chinese language
- chinese nouns
- chinese sentence
- chinese text
- chinese treebank
- chinese word
- chinese words
- classification problem
- co-occurrence
- co-occurrence frequency
- concept
- concepts
- context dependency
- context information
- context model
- context models
- contextual information
- convergence
- corpora
- correlations
- data set
- data sets
- decision rule
- derivation
- dictionaries
- dictionary
- distribution
- document
- document frequency
- entropy
- entropy models
- error rate
- estimation
- evaluation measures
- evaluation methodology
- evaluations
- experimental results
- experimental setting
- f-measure
- fact
- feature
- feature value
- generation
- generation process
- generative model
- generative models
- generative probability
- gold test set
- grammar
- heuristic
- heuristic rules
- heuristics
- human annotation
- human annotators
- hypothesis
- implementation
- input string
- inverse document frequency
- keyword
- knowledge
- language model
- language models
- language processing applications
- large corpus
- large training
- lattice
- lexical word
- lexicon
- lexicon entry
- likelihood
- linguist
- linguistic
- linguistic knowledge
- linguistics
- linguists
- mapping
- maximum entropy models
- measure
- measures
- method
- methodology
- mixture models
- model parameter
- model parameters
- model probability
- morpheme
- morpheme boundary
- morphological rules
- msr gold test
- mutual information
- n-gram
- n-gram models
- named entities
- named entity
- names
- natural language
- natural language processing applications
- nlp applications
- nlp tasks
- nouns
- open test
- ordered list
- organization names
- parameter values
- person names
- plural noun
- precision
- probabilistic models
- probabilities
- probability
- probability distribution
- procedure
- process
- pronouns
- pronunciation
- punctuation
- raw text corpus
- regular expressions
- relation
- relative frequency
- schema
- search space
- seed
- segmentation bakeoff
- segmented corpus
- semantic
- sentence
- sentence boundaries
- sentences
- source language
- statistical approach
- statistical information
- statistical models
- statistical significance
- statistics
- stem
- stems
- stochastic model
- style
- substring
- suffix
- symbols
- syntactic level
- syntactic structure
- system description
- system evaluation
- tags
- taxonomy
- term
- term distribution
- term frequency
- terminals
- terms
- test corpus
- test data
- test set
- text
- text corpus
- theory
- time expressions
- tokens
- training
- training and test data
- training corpora
- training corpus
- training criterion
- training data
- training material
- training samples
- training set
- training size
- transformation
- translations
- tree
- tree structures
- treebank
- trigram
- trigram language model
- unigram
- upenn chinese treebank
- vector space
- verb
- word
- word boundaries
- word boundary
- word candidate
- word classes
- word formation
- word lattice
- word segmentation performance
- word sequence
- word type
- word types
- words
- wrapper