ACL RD-TEC 1.0 Summarization of W06-1704
Paper Title:
CUCWEB: A CATALAN CORPUS BUILT FROM THE WEB
CUCWEB: A CATALAN CORPUS BUILT FROM THE WEB
Authors: Gemma Boleda and Stefan Bott and Rodrigo Meza and Carlos Castillo and Toni Badia and Vicente López
Primarily assigned technology terms:
- automatic spelling
- automatic spelling correction
- bayes classifier
- classification
- classifier
- cluster analysis
- clustering
- commercial search engine
- computational linguistics
- computer science
- corpus construction
- corpus querying
- crawler
- crawling
- data collection
- data processing
- data selection
- disambiguation
- indexing
- information storage
- interfaces
- internet
- language classification
- language modeling
- language teaching
- learning
- lexical acquisition
- linguistic processing
- link analysis
- modeling
- naive bayes
- naive bayes classifier
- nlp
- parser
- pre-processing
- preprocessing
- processing
- processing tools
- processor
- pruning
- querying
- ranking
- search
- search engine
- search engines
- searching
- segmentation
- set disambiguation
- shallow parser
- spelling
- spelling correction
- splitting
- tagging
- text classification
- translators
- web interface
Other assigned terms:
- acquisition task
- ambiguous word
- annotated corpus
- annotation
- approach
- basque
- cache
- case
- catalan
- chunks
- cluster
- co-occurrence
- co-occurrence frequency
- community
- computational linguists
- constraint grammar
- constraint grammar formalism
- corpora
- corpus exploitation
- determiner
- determiners
- dictionary
- distribution
- document
- dutch
- encyclopedia
- f-score
- fact
- feature
- formalism
- frame
- french
- genre
- grammar
- grammar formalism
- grammars
- heuristic
- heuristics
- implementation
- index
- information content
- lemma
- lemmata
- lexical material
- linguist
- linguistic
- linguistic data
- linguistic filter
- linguistics
- linguists
- main verb
- manual tagging
- markup
- metadata
- methodology
- morphological features
- morphological information
- multilinguality
- names
- nlp community
- noise
- nouns
- pagerank
- part of speech
- parts of speech
- personal pronoun
- procedure
- process
- pronoun
- pronouns
- punctuation
- punctuation marks
- queries
- query
- relative frequency
- search results
- seed
- serbian
- size of the corpus
- statistical information
- statistics
- structural information
- suffix
- syntactic function
- syntactic functions
- syntactic information
- tagset
- teaching
- technology
- terms
- text
- text genre
- translations
- user
- verb
- verb form
- web corpus
- web documents
- web pages
- web site
- word
- word corpus
- word form
- word level
- word strings
- words