Cleansed Segmented Text Files | ACL RD-TEC
Cleansed Text Files in XML Format
As described elsewhere, the ACL ARC documents are processed using PDFBox and ParsCit libraries. The output of ParsCit is further processed and the text files are organized into sections and paragraphs. Sections are further annotated by their type (Note the output is noisy). These files are organised by their publication date(year) and can be downloaded from the list given below:
Text Files in XML Format |
|||
Size: | Name: | Description: | |
90.760.177 | XML.ZIP | Cleansed text files in XML format. | |
18.084 | I05-3019_CLN.XML | Example of a cleansed text file which can be found in the XML.ZIP file. | |
<DIR> | XML_BY_SECTION/ | The above XML files broken into a set of files, each file contains specific type of sections, e.g. abstract, acknowledgement, etc. | |
91.454.499 | XML_BY_SECTION.ZIP | The contents of the XML_BY_SECTION/ folder in one zip file. | |
Directory contains 182.232.760 Bytes in 3 Files | |||
Index of: XML_BY_SECTION/ |
|||
The extracted text from publications organized by section type and publication date. | |||
<Up to the higher level directory> | |||
Size: | Name: | Description: | |
10.118.757 | ABSTR.ZIP | Extracted abstract sections | |
1.784.803 | ACKNO.ZIP | Extracted acknowledgement sections. | |
6.633.949 | CONCL.ZIP | Extracted conclusion sections. | |
4.624.872 | EVALU.ZIP | Extracted evaluation sections. | |
12.545.505 | INTRO.ZIP | Extracted introduction sections. | |
53.735.039 | METHO.ZIP | Extracted method sections. | |
950.314 | RELAT.ZIP | Extracted related work sections. | |
Directory contains 90.393.239 Bytes in 7 Files | |||
Total: 272.625.999 Bytes in 10 Files |