Cleansed Segmented Text Files | ACL RD-TEC

Cleansed Text Files in XML Format

As described elsewhere, the ACL ARC documents are processed using PDFBox and ParsCit libraries. The output of ParsCit is further processed and the text files are organized into sections and paragraphs. Sections are further annotated by their type (Note the output is noisy). These files are organised by their publication date(year) and can be downloaded from the list given below:


Text Files in XML Format


Size: Name: Description:
90.760.177   XML.ZIP Cleansed text files in XML format.
18.084   I05-3019_CLN.XML Example of a cleansed text file which can be found in the XML.ZIP file.
<DIR>   XML_BY_SECTION/ The above XML files broken into a set of files, each file contains specific type of sections, e.g. abstract, acknowledgement, etc.
91.454.499   XML_BY_SECTION.ZIP The contents of the XML_BY_SECTION/ folder in one zip file.

Directory contains 182.232.760 Bytes in 3 Files

Index of: XML_BY_SECTION/


The extracted text from publications organized by section type and publication date.
<Up to the higher level directory>
Size: Name: Description:
10.118.757   ABSTR.ZIP Extracted abstract sections
1.784.803   ACKNO.ZIP Extracted acknowledgement sections.
6.633.949   CONCL.ZIP Extracted conclusion sections.
4.624.872   EVALU.ZIP Extracted evaluation sections.
12.545.505   INTRO.ZIP Extracted introduction sections.
53.735.039   METHO.ZIP Extracted method sections.
950.314   RELAT.ZIP Extracted related work sections.

Directory contains 90.393.239 Bytes in 7 Files

Total: 272.625.999 Bytes in 10 Files

This page last edited on 21 April 2017.

*** ***