ACL RD-TEC 1.0 Summarization of W97-0122
Paper Title:
USING WORD FREQUENCY LISTS TO MEASURE CORPUS HOMOGENEITY AND SIMILARITY BETWEEN CORPORA
USING WORD FREQUENCY LISTS TO MEASURE CORPUS HOMOGENEITY AND SIMILARITY BETWEEN CORPORA
Primarily assigned technology terms:
- author identification
- categorisation
- chi-square test
- computational linguistics
- encoding
- factor analysis
- identification
- indexing
- information retrieval
- language engineering
- language modelling
- measuring
- modelling
- nlp
- nlp system
- parsers
- parsing
- processing
- sampling
- statistical language modelling
- statistical processing
- text categorisation
- text encoding
- tokenisation
- word-counting
Other assigned terms:
- american english
- approach
- break
- british english
- british national corpus
- brown corpus
- case
- chunks
- community
- computer program
- contingency table
- conversation
- corpora
- corpus similarity and homogeneity
- correlation
- document
- document frequency
- entropy
- evaluation methodology
- events
- fact
- frequency list
- genre
- hypotheses
- hypothesis
- interpretation
- inverse document frequency
- language corpora
- language model
- language type
- large corpora
- lexicography
- linguistic
- linguistic features
- linguistic structures
- linguistic theory
- linguistic variation
- linguistics
- linguists
- lob corpus
- meaning
- meanings
- measure
- measures
- message
- method
- methodology
- mutual information
- null hypothesis
- paragraph
- parsed corpus
- parts-of-speech
- perplexity
- punctuation
- rank correlation
- relation
- relative clauses
- representations
- similarity scores
- spearman rank correlation
- statistic
- statistics
- subcorpus
- sublanguage
- subtree
- subtrees
- syntactic categories
- syntactic category
- syntactic constructions
- term
- terms
- text
- text encoding initiative
- text type
- textbook
- theory
- transcript
- word
- word frequencies
- word frequency
- word senses
- words