lr,2-3-I05-4010,bq |
in detail . The resultant
<term>
bilingual
|
corpus
|
</term>
, 10.4 M
<term>
English words
</term>
|
#8255
The resultant bilingual corpus, 10.4M English words and 18.3M Chinese characters, is an authoritative and comprehensive text collection covering the specific and special domain of HK laws. |
lr-prod,17-4-H92-1074,bq |
definition and development of the
<term>
CSR pilot
|
corpus
|
</term>
, and examines the dynamic challenge
|
#19620
This paper presents an overview of the CSR corpus, reviews the definition and development of the CSR pilot corpus, and examines the dynamic challenge of extending the CSR corpus to meet future needs. |
lr,21-5-P03-1051,bq |
million
<term>
word
</term><term>
unsegmented
|
corpus
|
</term>
, and re-estimate the
<term>
model
|
#4728
To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. |
lr,6-3-P06-1052,bq |
</term>
. We evaluate the algorithm on a
<term>
|
corpus
|
</term>
, and show that it reduces the degree
|
#11183
We evaluate the algorithm on acorpus, and show that it reduces the degree of ambiguity significantly while taking negligible runtime. |
lr,3-3-P05-1034,bq |
component
</term>
. We align a
<term>
parallel
|
corpus
|
</term>
, project the
<term>
source dependency
|
#9248
We align a parallel corpus, project the source dependency parse onto the target sentence, extract dependency treelet translation pairs, and train a tree-based ordering model. |
lr-prod,7-4-H92-1074,bq |
paper presents an overview of the
<term>
CSR
|
corpus
|
</term>
, reviews the definition and development
|
#19609
This paper presents an overview of the CSR corpus, reviews the definition and development of the CSR pilot corpus, and examines the dynamic challenge of extending the CSR corpus to meet future needs. |
lr,13-1-N03-2006,bq |
</term>
based on a small-sized
<term>
bilingual
|
corpus
|
</term>
, we use an out-of-domain
<term>
bilingual
|
#3093
In order to boost the translation quality of EBMT based on a small-sized bilingual corpus, we use an out-of-domain bilingual corpus and, in addition, the language model of an in-domain monolingual corpus. |
lr,7-2-P05-2016,bq |
required is a
<term>
sentence-aligned parallel
|
corpus
|
</term>
. All other
<term>
resources
</term>
|
#9803
The only bilingual resource required is a sentence-aligned parallel corpus. |
lr,29-2-C88-2130,bq |
</term>
derived through analysis of our
<term>
|
corpus
|
</term>
.
<term>
Chart parsing
</term>
is
<term>
|
#15495
The model is embodied in a program, APT, that can reproduce segments of actual tape-recorded descriptions, using organizational and discourse strategies derived through analysis of ourcorpus. |
lr,23-2-C04-1116,bq |
each author 's text as a coherent
<term>
|
corpus
|
</term>
. Our approach is based on the idea
|
#6137
This paper proposes a new methodology to improve the accuracy of a term aggregation system using each author's text as a coherentcorpus. |
lr,28-2-P03-1051,bq |
</term>
from a large
<term>
unsegmented Arabic
|
corpus
|
</term>
. The
<term>
algorithm
</term>
uses a
|
#4668
Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. |
lr,50-3-C04-1147,bq |
phrases
</term>
at any distance in the
<term>
|
corpus
|
</term>
. The framework is flexible , allowing
|
#6400
In comparison with previous models, which either use arbitrary windows to compute similarity between words or use lexical affinity to create sequential models, in this paper we focus on models intended to capture the co-occurrence patterns of any pair of words or phrases at any distance in thecorpus. |
lr,19-2-N03-4010,bq |
candidates
</term>
from the given
<term>
text
|
corpus
|
</term>
. The operation of the
<term>
system
|
#3682
The demonstration will focus on how JAVELIN processes questions and retrieves the most likely answer candidates from the given text corpus. |
lr,34-5-P03-1051,bq |
<term>
vocabulary
</term>
and
<term>
training
|
corpus
|
</term>
. The resulting
<term>
Arabic word
|
#4741
To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. |
lr,19-5-C90-3063,bq |
that were randomly selected from the
<term>
|
corpus
|
</term>
. The results of the experiment show
|
#16689
An experiment was performed to resolve references of the pronoun it in sentences that were randomly selected from thecorpus. |
lr,30-2-C04-1192,bq |
for the
<term>
languages
</term>
in the
<term>
|
corpus
|
</term>
. The
<term>
wordnets
</term>
are aligned
|
#6480
The method exploits recent advances in word alignment and word clustering based on automatic extraction of translation equivalents and being supported by available aligned wordnets for the languages in thecorpus. |
lr-prod,26-4-H90-1060,bq |
</term>
from the
<term>
DARPA Resource Management
|
corpus
|
</term>
. This
<term>
performance
</term>
is
|
#17099
With only 12 training speakers for SI recognition, we achieved a 7.5% word error rate on a standard grammar and test set from the DARPA Resource Management corpus. |
lr,29-5-J05-4003,bq |
and exploiting a large
<term>
non-parallel
|
corpus
|
</term>
. Thus , our method can be applied
|
#9098
We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. |
lr,15-2-C90-3063,bq |
co-occurrence patterns
</term>
in a large
<term>
|
corpus
|
</term>
. To a large extent , these
<term>
|
#16631
This paper presents an automatic scheme for collecting statistics on co-occurrence patterns in a largecorpus. |
lr-prod,15-3-H94-1014,bq |
word
</term><term>
Wall Street Journal text
|
corpus
|
</term>
. Using the
<term>
BU recognition system
|
#21261
The models were constructed using a 5K vocabulary and trained using a 76 million word Wall Street Journal text corpus. |