ACL RD-TEC 1.0 Summarization of C96-2110
Paper Title:
IDENTIFYING THE CODING SYSTEM AND LANGUAGE OF ON-LINE DOCUMENTS ON THE INTERNET
IDENTIFYING THE CODING SYSTEM AND LANGUAGE OF ON-LINE DOCUMENTS ON THE INTERNET
Primarily assigned technology terms:
- algorithm
- automatic language identification
- categorization
- character coding
- coding
- coding system
- content-based search
- decoding
- document processing
- encoding
- identification
- information extraction
- internet
- language identification
- machine translation
- matching
- pattern matching
- processing
- regular expression
- search
- text processing
- tile
- world-wide web
Other assigned terms:
- ambiguity
- approach
- case
- characters
- class name
- community
- corpora
- dictionaries
- document
- encoding scheme
- fact
- heuristic
- heuristic rules
- heuristics
- information infrastructure
- language models
- likelihood
- mapping
- maps
- message
- method
- names
- probability
- procedure
- sentence
- statistic
- text
- text corpora
- tokens
- unigram
- unigram model
- unigram probability
- word
- word boundaries
- words