Corpus ACL ARC 2.0 segmented, PoS tagged and cleaned (to an extent) – statistics and info
ACL ARC 2.0 pre-processed
Counts |
Tokens | 96737944 |
Words | 78151628 |
Sentences | 4052234 |
Paragraphs | 0 |
Documents | 25504 |
General info |
Language | English |
Encoding | UTF-8 |
Compiled | 05/01/2016 18:58:56 |
Tagset doc |
Description |
Infolink |
More info |
Lexicon sizes |
word | 1049494 |
lemma | 975733 |
tag | 46 |
lc | 914330 |
lemma_lc | 888148 |
Structures and attributes
doc
25504
authors
24009
section
157199
-
type
12
method |
70996 |
abstract |
22386 |
introduction |
19310 |
conclusions |
17702 |
acknowledgments |
10588 |
evaluation |
8002 |
opening |
3596 |
related work |
3264 |
discussions |
879 |
background |
237 |
general terms |
165 |
references |
74 |
-
position
67
section1
118497
sectiontitle
275696
paragraph
347967
equation
73116
s
4052234