other,12-3-P03-1051,bq |
</term>
to determine the most probable
<term>
|
morpheme sequence
|
</term>
for a given
<term>
input
</term>
. The
|
#4682
The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. |
model,1-4-P03-1051,bq |
for a given
<term>
input
</term>
. The
<term>
|
language model
|
</term>
is initially estimated from a small
|
#4690
The language model is initially estimated from a small manually segmented corpus of about 110,000 words. |
lr,34-5-P03-1051,bq |
expanded
<term>
vocabulary
</term>
and
<term>
|
training corpus
|
</term>
. The resulting
<term>
Arabic word
|
#4740
To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus . |
lr,15-6-P03-1051,bq |
<term>
exact match accuracy
</term>
on a
<term>
|
test corpus
|
</term>
containing 28,449
<term>
word tokens
|
#4758
The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. |