ACL RD-TEC 1.0 Summarization of W97-0118
Paper Title:
THE EFFECTS OF CORPUS SIZE AND HOMOGENEITY ON LANGUAGE MODEL QUALITY
THE EFFECTS OF CORPUS SIZE AND HOMOGENEITY ON LANGUAGE MODEL QUALITY
Primarily assigned technology terms:
- algorithm
- classification
- classification system
- comparative evaluation
- computing
- database
- encoding
- expression searching
- generic speech recognition
- geology
- hidden markov
- hidden markov model
- language modelling
- markov model
- measuring
- modelling
- personal computer
- reasoning
- recogniser
- recognition
- recognition systems
- regular expression
- search
- searching
- self-organising adaptation
- speech recognition
- speech recognition systems
- statistical techniques
- tagging
- text dictation
- text encoding
- top-down approach
- top-down classification
- transcription
- video mail
Other assigned terms:
- approach
- background corpus
- bigram
- bottom-up approach
- british national corpus
- case
- chunks
- classification scheme
- coefficient
- contingency table
- corpora
- corpus size
- correlation
- correlation coefficient
- correlations
- data sets
- distribution
- domain corpus
- domain information
- domain-specific corpora
- electronic information
- entropy
- error rate
- evaluation method
- evaluation metric
- evaluations
- events
- fact
- frequency list
- function words
- genre
- handwriting
- human judgement
- knowledge
- language data
- language model
- language model quality
- language models
- large corpus
- linguistic
- linguistic phenomena
- manual intervention
- measure
- measures
- method
- methodology
- n-grams
- noise
- normal distribution
- part-of-speech
- perplexity
- polarity
- probabilities
- probability
- process
- rank correlation
- seed
- similarity measure
- similarity measures
- similarity metric
- sparse data
- speech data
- spoken email
- standard deviation
- statistic
- subcorpus
- sublanguage
- tags
- technique
- terms
- test data
- text
- text encoding initiative
- textual similarity
- toolkit
- training
- training corpus
- training data
- training text
- transcriptions
- trigram
- unigram
- utterance
- vocabulary
- word
- word error rate
- word frequencies
- word frequency
- word types
- words