#4711To improve the segmentation accuracy , we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus.
lr,34-5-P03-1051,ak
expanded
<term>
vocabulary
</term>
and
<term>
training corpus
</term>
. The resulting
<term>
Arabic word
#4742To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus .
lr,15-6-P03-1051,ak
<term>
exact match accuracy
</term>
on a
<term>
test corpus
</term>
containing 28,449
<term>
word tokens
#4760The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens.
other,12-3-P03-1051,ak
</term>
to determine the most probable
<term>
morpheme sequence
</term>
for a given
<term>
input
</term>
. The
#4684The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input.
model,1-4-P03-1051,ak
for a given
<term>
input
</term>
. The
<term>
language model
</term>
is initially estimated from a
<term>
#4692The language model is initially estimated from a small manually segmented corpus of about 110,000 words.