ACL RD-TEC 1.0 Summarization of E06-2001
Paper Title:
LARGE LINGUISTICALLY-PROCESSED WEB CORPORA FOR MULTIPLE LANGUAGES
LARGE LINGUISTICALLY-PROCESSED WEB CORPORA FOR MULTIPLE LANGUAGES
Authors: Marco Baroni and Adam Kilgarriff
Primarily assigned technology terms:
- adobe pdf
- algorithm
- caching
- corpus construction
- crawler
- crawling
- database
- indexing
- lemmatization
- lemmatizer
- near-duplicate detection
- normalization
- part-of-speech tagging
- pornography filtering
- post-processing
- processing
- ranking
- retrieving
- search
- search engine
- search engines
- tagger
- tagging
- user interface
- word filtering
Other assigned terms:
- annotated corpus
- annotation
- apa corpus
- association measure
- bias
- bigram
- bigram language model
- case
- content words
- corpora
- corpus design
- corpus size
- data sparseness
- dictionary
- dictionary definitions
- disk
- distribution
- document
- function word
- function words
- genre
- german corpus
- graph structure
- keyword
- language model
- large corpora
- lexicography
- linguistic
- linguistic corpora
- linguistic data
- linguists
- log-likelihood
- log-likelihood ratio
- log-likelihood ratio association
- machine-generated text
- markup
- measure
- method
- methodology
- n-grams
- named entities
- navigational information
- nouns
- part-of-speech
- part-of-speech tag
- part-of-speech tags
- particles
- parts-of-speech
- precision
- procedure
- processing time
- queries
- query
- regular expressions
- relation
- seed
- sentence
- sentences
- server
- statistics
- suffixes
- tags
- target language
- temporal expressions
- terms
- text
- tokens
- topics
- user
- vocabulary
- web corpus
- web documents
- web page
- web pages
- word
- word types
- words