ACL RD-TEC 1.0 Summarization of E06-1028
Paper Title:
A FIGURE OF MERIT FOR THE EVALUATION OF WEB-CORPUS RANDOMNESS
A FIGURE OF MERIT FOR THE EVALUATION OF WEB-CORPUS RANDOMNESS
Authors: Massimiliano Ciaramita and Marco Baroni
Primarily assigned technology terms:
- approximation
- biased sampling
- biased sampling method
- boosting
- bootstrap
- classification
- computer science
- corpus building
- corpus comparison
- corpus construction
- crawling
- measuring
- nlp
- post-processing
- querying
- ranking
- retrieval method
- retrieving
- sampling
- scoring
- scoring function
- search
- search engine
- search engines
- seed selection
- smoothing
- stochastic approximation
- web crawling
Other assigned terms:
- alphabet
- american english
- approach
- bias
- british english
- british national corpus
- brown corpus
- case
- composition
- corpora
- data sparseness
- dictionary
- discipline
- distance matrix
- distribution
- document
- document collection
- empirical evaluation
- entropy
- estimation
- experimental setting
- fact
- finite alphabet
- frequency distribution
- frequency list
- function word
- function words
- genre
- heuristic
- hypothesis
- interpretation
- language model
- language models
- lexical resource
- linguistic
- linguistic corpora
- linguistic data
- linguistics
- linguists
- manual intervention
- measure
- measures
- method
- methodology
- n-grams
- navigational information
- pairs of words
- priori
- probabilities
- procedure
- qualitative analysis
- queries
- query
- random order
- random sample
- relative frequency
- russian
- search strategy
- seed
- seed words
- similarity measure
- sociology
- specialized corpora
- statistics
- sub-language
- subcorpus
- tags
- target language
- technical terms
- technique
- terms
- text
- tokens
- topics
- unigram
- vocabulary
- web corpus
- web documents
- web pages
- web-based corpora
- web-corpus randomness
- word
- word frequency
- word lists
- word model
- word types
- wordnet
- words