ACL RD-TEC 1.0 Summarization of N06-2037
Paper Title:
SELECTING RELEVANT TEXT SUBSETS FROM WEB-DATA FOR BUILDING TOPIC SPECIFIC LANGUAGE MODELS
SELECTING RELEVANT TEXT SUBSETS FROM WEB-DATA FOR BUILDING TOPIC SPECIFIC LANGUAGE MODELS
Authors: Abhinav Sethy and Panayiotis Georgiou and Shrikanth Narayanan
Primarily assigned technology terms:
- algorithm
- bagging
- boosting
- classi cation
- classi er
- comparative analysis
- computational linguistics
- crawling
- data selection
- decoder
- decoding
- entropy calculation
- greedy selection
- human language
- human language technology
- incremental algorithm
- kneser-ney smoothing
- language modeling
- language technology
- learning
- likelihood estimate
- likelihood estimation
- machine translation
- maximum likelihood
- maximum likelihood estimation
- modeling
- nlp
- normalization
- preprocessing
- randomization
- ranking
- recognition
- recognition system
- scoring
- scoring function
- search
- selection process
- semi-supervised learning
- smoothing
- speech recognition
- speech recognition system
- speech translation
- thresholding
- web crawling
- world wide web
Other assigned terms:
- association for computational linguistics
- bigram
- bleu
- bleu metric
- case
- conversational speech
- conversational speech language
- corpora
- data model
- distribution
- distributional similarity
- entropy
- error rate
- estimation
- evaluations
- experimental results
- fact
- interpolation
- language model
- language models
- learning problem
- likelihood
- linguistics
- maximum likelihood estimate
- measures
- method
- model parameters
- n-gram
- n-gram language model
- n-grams
- nlp applications
- noise
- performance comparison
- permutation
- perplexity
- perplexity reduction
- probabilities
- probability
- process
- queries
- query
- sentence
- sentence similarity
- sentences
- similarity measures
- statistical models
- style
- technology
- term
- terms
- test set
- text
- text corpus
- training
- trigram
- unigram
- unlabeled examples
- vocabulary
- vocabulary size
- word
- word error rate
- words