SEPID Corpus

SEPID Corpus: Segmented, Pre-Processed, Indexed ACL ARC 1.0

Skip descriptions and go to

SEPID CORPUS is the segmented processed ACL ARC documents that are represented in a data model as shown in the figures shown below. In this representation, each linguistically well defined unit, i.e. lexemes (part-of-speech-tagged, lemmatized words), sentences, paragraphs, sections, etc., is identified by a unique identifier. Moreover, units of higher granularity than lexemes consists of a combination of linguistic units of a finer level of granularity. For instance, a sentences consists of a list of lexemes and their position in the sentence; paragraphs are lists of sentences and so on. These representation of text is then serialized using a set of tab-separated text files; each text file represent a particular linguistic unit (data-entity in the given diagrams).


The data-entity relationships diagram
Text data-entity relationships diagram at levels finer than paragraph

In order to model nested sections and subsections, text units of a granularity level higher than paragraphs are all consider as a content-unit of specific content-type. Each of these units is then a list of content-unit and at a specific position. Further information about the content_unit can be found in the relevant file to that text unit.


The data-entity relationships diagram
Text data-entity relationships diagram at levels higher than paragraph

The presented data in the listed files here are derived from processing the cleansed text documents using the Stanford tokenizer and part-of-speech tagger (version release date 9 July 2012), the Apache OpenNLP's sentence splitter and Chunker(version 1.5) and MaltParser(version 1.6), a data-driven dependency parser.

Each of the data-entities in the above figures are represented by a tab-separated text file. The first line of each file starts with character "#" and describe the content of records in the file. The corpus files can be downloaded from the list given below:


SEPID CORPUS


Size: Name: Description:
554.505.209   sepid_corpus.zip All the files listed below in one zip file.
7.543.215   _all_lexicon.zip All the extracted lexemes, i.e. part-of-speech tagged, lemmatized words, that are extracted from the ACL ARC. The structure of this tab-separated file is as follows:
  • LEXEME_ID;
  • LEXEME_STRING: the extracted string/word as appeared in the corpus;
  • LEMMA: the assigned lemma to the word;
  • POS: the assigned part-of-speech tag to the word (description of the employed penn-style part-of-speech tags can be found in the Stanford tagger documentations);
  • FREQUENCY: the frequency of this lexeme in the corpus.
2.358.739   _all_sentence.zip All the extracted sentences from the ACL ARC. This file contains only one column, i.e. the list of employed integers as SENTENCE_ID.
139.071.858   _all_sentence_lexeme.zip This file defines the extracted sentences from the corpus as a list of the tuples (lexeme_id, lexeme_position). The structure of this tab-delimited file is as follows:
  • SENTENCE_ID: the employed id to identify individual sentences, which are also listed in the above _all_sentence file;
  • LEXEME_ID: the lexeme_ids of the words in the sentence. These lexeme_ids come from the above _all_lexicon file;
  • POSITION: the position of the lexeme in the sentence.
24.908.979   _all_chunk.zip All the extracted chunks (phrases) from the ACL ARC. This tab-separated file has records in the form of:
  • CHUNK_ID;
  • TYPE: the type of chunk, e.g. NP, VP, etc;
  • LIST_OF_LEXEME_IDS: the list of lexeme_ids in the same order as they are appeared in the chunk. These lexeme_ids are separated by the space character and are coming from the above _all_lexicon file;
  • FREQUENCY: the frequency of the chunk in the corpus.
85.975.964   _all_sentence_chunk.zip This file maps the extracted chunks to extracted sentences. This tab-separated file has records in the form of:
  • SENTENCE_ID;
  • CHUNK_ID: from the above _all_chunk file;
  • START_POSITION: the start position for the chunk in the sentence (i.e. token offset: the number of tokens from the beginning of the sentence);
  • FREQUENCY: the end position of the chunk in the sentence.
54.683.904   _all_dependency.zip All the extracted syntactic relations (dependencies) between lexemes in the corpus. The structure of the records in this tab-separated file is as follows:
  • DEPENDENCY_ID;
  • GOVERNOR_LEXEME_ID: the lexeme_id of the lexeme (i.e. part-of-speech tagged word) appeared in the governor position in the syntactic relation;
  • REGENT_LEXEME_ID: the lexeme_id of the lexeme (i.e. part-of-speech tagged word) appeared in the regent position in the syntactic relation;
  • DEPENDENCY_TYPE: the type of syntactic relation, e.g. auxpass, det, etc. (for further information on the type of syntactic relations please see MaltParser documentations);
  • FREQUENCY: the frequency of the specified syntactic relations between the given two lexemes in the corpus.
226.281.573   _all_sentence_dependency_parse.zip The extracted syntactic relations are mapped into sentences. The structure of the records in this tab-separated file is as follows:
  • DEPENDENCY_ID;
  • SENTENCE_ID;
  • GOVERNOR_POSITION: the position of governor in the sentence;
  • REGENT_POSITION: the position of regent in the sentence;
  • DEPENDENCY_ID: the dependncy_id from the above _all_dependency file;
  • DEPENDENCY_TYPE: the type of dependency ( which is redundant as it can be obtained from _all_dependency file).
5.330.203   _all_paragraph_sentence.zip This file identifies the extracted paragraphs from the corpus. Each paragraph is formed of a list of sentences are certain position, i.e. the list of tuples (sentence_id, sentence_position). The structure of the records in this tab-separated file is as follows:
  • PARAGRAPH_ID;
  • SENTENCE_ID;
  • POSITION: the position of the sentence in the paragrpah.
439.649   _all_section.zip All the extracted sections from the corpus. The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the assigned integer id to the section;
  • SECTION_TYPE: the type of the section, e.g. abstract, method, sub_section, etc.;
  • SENTENCE_ID: a sentenc_id which can be used to obtain the title of the section.
In order to retrieve the section text, it is necessary to use the file _all_content_content (listed below) and traverse it recursively.
1.026.572   _all_content_type.zip The list of all text units other than lexeme, sentence and paragraphs, i.e. sections, documents, figures, tables and equations. This file is used to recover the text and the structure of documents. The structure of the records in this tab-separated file is as follows:
  • CONTENT_ID: the assigned id to the content: these ids are coming from the files all_document, _all_section, _all_equation, and _all_paragraph;
  • CONTENT_TYPE: the type of the content, i.e. document, section, etc. In other words, the origin of the listed id.
1.722.830   _all_content_content.zip This file is used to retrieve/recover the structure and text for sections and documents. The structure of the records in this tab-separated file is as follows:
  • CONTENT_ID (SUPER_CONTENT): the content id of the text unit of higher level of granularity; for instance, for a section with a number of subsections, the section_id is listed as CONTENT_ID (SUPER_CONTENT);
  • CONTENT_ID (SUB_CONTENT): the content id of the text unit of finer level of granularity; for instance, for a section with a number of subsections, the sub_section_ids are listed as CONTENT_ID (SUB_CONTENT);
  • POSITION_OF_SUB_CONTENT_IN_SUPER_CONTENT: this determines the position of the sub-content in the content of higher level of granularity; e.g., the position of subsections in the section.
47.260   _all_document.zip The list of documents from the ACL ARC that are processed and indexed successfully. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: an integer number that identifies the document in the corpus;
  • SENTENCE_ID: the sentence_id that can be used to retrieve the document title.
106.842   _all_equation_caption.zip All the extracted equations from the ACL ARC's sections. The structure of the records in this tab-separated file is as follows:
  • EQUATION_ID: an integer number that identifies the equation;
  • EQUATION_PARAGRAPH_ID: the paragrpah_id that can be used to retrieve the text that may have accompanied the equation.
60.542   _all_figure_caption.zip All the extracted figures from the ACL ARC. Please note the figure themselves nor their position are not stored. The structure of the records in this tab-separated file is as follows:
  • FIGURE_ID;
  • PARAGRAPH_ID: the paragrpah_id that can be used to retrieve the caption of the figure.
56.392   _all_section_figure.zip Mapping between figures and documents in the corpus; The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the id of the origin section;
  • FIGURE_ID: from the above _all_figure_caption file.
40.852   _all_table_caption.zip The extracted table captions from the corpus. The structure of the records in this tab-separated file is as follows:
  • TABLE_ID;
  • CAPTION_PARAGRAPH_ID: the assigned id to the caption paragraph; this paragraph_id can be used to retrieve the caption text.
37.576   _all_section_table.zip Mapping between tables and documents in the croups; The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the id of the origin section;
  • TABLE_ID: from the above _all_table_caption file.
41.977   _id_map_to_acl_arc.zip This file gives the mapping between the employed integer ids for documents in the SEPID_CORPUS to the original ACL ARC ids. The structure of the records in this tab-separated file is as follows:
  • DOC_ID: the integer id of a document in the SEPID_CORPUS;
  • ACL_ARC_ID: the id in the ACL ARC (publications' original ACL ID).
98.043   _all_affiliation.zip Extracted affiliations from the ACL ARC corpus. This tab-separated file contains the following information:
  • AFFILIATION_ID;
  • AFFILIATION: text that is used to represent the affiliation.
1.004.429   _all_author.zip The list of extracted authors from the ACL ARC. This tab-separated file contains the following information:
  • AUTHOR_ID;
  • FIRST_NAME;
  • MIDDLE_NAME;
  • LAST_NAME.
43.115   _all_author_affiliation.zip Extracted Affiliations for the authors appeared in the corpus. This tab-separated file has records in the form of:
  • AUTHOR_ID: the ids from the above _all_author file.
  • AFFILIATION_ID: these ids are from the above _all_author_affiliation file.
62.079   _all_email.zip All the extracted email addresses from the ACL ARC. This tab-separated file has records in the form of:
  • EMAIL_ID
  • EMAIL
24.031   _all_author_email.zip Extracted email addresses from the ACL ARC are assigned to authors. This tab-separated file has records in the form of:
  • AUTHOR_ID: these ids are from the file _all_author.
  • EMAIL_ID: these ids are from the file _all_email
2.759.385   _all_citation.zip The list of all extracted citations from the ACL ARC. The structure of the records in this tab-separated file is as follows:
  • CITATION_ID: the assigned id to the citation entry.
  • TITLE: a string that shows the title of the entry.
  • DATE: publication date.
Please note that a more reliable citation network is represented in the accompanied meta-data in the ACL ARC distribution.
558.776   _all_citation_author.zip The list of authors of the extracted citations. The structure of the records in this tab-separated file is as follows:
  • CITATION_ID: from the above _all_citation file.
  • AUTHOR_ID: from the above _all_author file.
220.974   _all_document_citation.zip Indicate the list of citations for each document in the corpus. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: from the above _all_document file.
  • CITATION_ID: from the above _all_citation file.
<DIR>   sepid_corpus_examples/ A short-truncated version of all the above listed files can be found in this folder.
<DIR>   redundant_index/ A Set of additional redundant indexes that may come handy, or make it easier to process the corpus, can be found in this folder.
<DIR>   sepid_corpus_by_year_type/ This folder contains the same data as the files listed above. However, each text file in this folder is broken into 34 different files, each file repents the text units that are extracted from articles published in a particular year, e.g. 67, 78, 92, and so on.

Directory contains 1.109.010.968 Bytes in 27 Files


Redundant Index Files

A set of redundant index files that can help to manipulate, search and process the corpus are provided. The set of available files for download are listed below.


Index of: redundant_index/


<Up to the higher level directory>
Size: Name: Description:
136.001.881   _redundant_index.zip All the files listed below in one Zip file.
63.877.459   _all_sentence_text.zip All the extracted sentences from the corpus; each record is one line of the text file in which the field values in the record are separated by tab_character+<s>+tab_character. Each record has the following fields:
  • SENTENCE_ID: the assigned id to the sentence in SEPID_CORPUS.
  • SENTENCE_STRING: extracted string for the sentence.
59.163.252   _all_paragraph_text.zip All the extracted paragraphs from the corpus; each record is one line of the text file in which the field values in the record are separated by tab_character+<p>+tab_character. Each record has the following fields:
  • PARAGRAPH_ID: the assigned id to the paragraph in the SEPID_CORPUS.
  • PARAGRAPH_STRING: extracted string for the paragraph.
5.516.793   _all_document_sentence.zip Mapping between extracted sentences and documents. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: the assigned id to the document in the SEPID_CORPUS.
  • SENTENCE_ID: the assigned id to the sentence in the SEPID_CORPUS.
  • POSITION: the absolute position of the sentence in the document (i.e. the order of appearance of sentences in the document, caption sentences are also included).
1.449.532   _all_document_paragraph.zip Mapping between extracted paragraphs and documents. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: the assigned id to the document in the SEPID_CORPUS.
  • PARAGRAPH_ID: the assigned id to the paragraph in the SEPID_CORPUS.
  • POSITION: the absolute position of the paragraph in the document (i.e. the order of appearance of paragraphs in the document, caption paragraphs are also included).
5.921.040   _all_section_sentence.zip Mapping between extracted sentences and sections. The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the assigned id to the section in the SEPID_CORPUS.
  • SENTENCE_ID: the assigned id to the sentence in the SEPID_CORPUS.
  • POSITION: the absolute position of the sentence in the section (i.e. the order of appearance of sentences in the section, caption sentences are also included).
73.579   _all_orphan_sentence.zip This file lists all the sentence ids that are not connected to any document. This problem is due to the incomplete indexing of some of the documents, e.g. because of a bug in codes, appearance of illegal characters in documents etc. This problem will be addressed in the release. The reocrds of this file are thus only SENTENCE_ID.

Directory contains 272.003.536 Bytes in 7 Files

SEPID CORPUS Sectioned and Grouped by the Publication Year of Documents

Here you can download all the above listed text units in the SEPID CORPUS, however, when files are sectioned and organized by the year of publication of their origin documents. The structure of these files are exactly the same as the descriptions given in the table above.

Each of the files listed in the SEPID CORPUS (the above table) are broken down into 34 different files, each file represent the text units that are extracted from the documents published in a particular year. For example, the _all_lexicon file is broken down into 34 files, each file starts with a two digit number, e.g. 98, 87, 67 and so on, which shows the year of publication, followed by "_lexicon". In this way, the file 87_lexicon contains all the lexemes that are extracted from the documents published in the year 87 and the file "98_lexicon" contains all the extracted lexemes form documents published in year 98.

In the current release these 34 years are: '06', '05', '04', '03', '02', '01', '00', '99', '98', '97', '96', '95', '94', '93', '92', '91', '90', '89','88', '87', '86', '85', '84', '83', '82', '81', '80', '79', '78','75', '73','69', '67', '65' .


Index of: sepid_corpus_by_year_type/


<Up to the higher level directory>
Size: Name: Description:
623.815.172   _sepid_corpus_by_year_type.zip All the files listed below in one Zip file.
132.336   affiliation.zip Extracted affiliations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.049.391   author.zip Extracted author names, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
53.147   author_affiliation.zip Extracted mappings between authors and affiliations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
31.103   author_email.zip Extracted mappings between author names and email addresses, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
30.850.371   chunk.zip Extracted chunks, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
2.806.382   citation.zip Extracted citations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
596.417   citation_author.zip Extracted mappings between authors and citations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.960.407   content_content.zip Extracted content mappings, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.043.264   content_type.zip Extracted contents marked by their type, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
107.218.828   dependency.zip Extracted syntactic relations between words, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
59.717   document.zip Extracted documents, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
250.900   document_citation.zip Extracted document citations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
85.957   email.zip Extracted email addresses, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
132.992   equation_caption.zip Extracted equations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
77.716   figure_caption.zip Extracted figure captions, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
17.150.076   lexicon.zip Extracted lexemes (part-of-speech tagged, lemmatized words), grouped by the year of publication of their source documents. For the structure of the records see the description given above.
5.570.395   paragraph_sentence.zip Extracted mapping between paragraphs and sentences, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
458.957   section.zip Extracted text sections, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
68.211   section_figure.zip Extracted mappings between sections and figure captions, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
47.434   section_table.zip Extracted mappings between sections and tables, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.706.071   sentence.zip Extracted sentences, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
86.363.133   sentence_chunk.zip Extracted mappings between sentences and chunks, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
226.439.393   sentence_dependency_parse.zip Extracted mappings between syntactic dependencies and sentences, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
139.609.923   sentence_lexeme.zip Extracted mappings between sentences and lexemes, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
53.179   table_caption.zip Extracted table captions, grouped by the year of publication of their source documents. For the structure of the records see the description given above.

Directory contains 1.247.630.872 Bytes in 26 Files

Example of Records in the SEPID CORPUS Index Files

You can explore the index files' structure in the truncated example files listed below.


Index of: sepid_corpus_examples/


<Up to the higher level directory>
Size: Name: Description:
27.501   _affiliation Example of Affiliation Index File
153.585   _author Example of Author Index File
9.369   _author_affiliation Example of Mapping Between Author and Affiliation Index Files
4.761   _author_email Example of Mapping Between Author and Email Index Files
395.388   _chunk Example of Chunk Index File
8.856   _citation Example of Citation Index File
2.421   _citation_author Example of Mapping Between Citation and Author Index Files
20.783   _content_content Example of Mapping Between Contents and Sub-Contents Index Files
22.735   _content_type Example of Content Index File
4.806   _dependency Example of Syntactic Dependency Index File
378   _document Example of Document Index File
1.451   _document_citation Example of Mapping Between Document and Citation Index Files
4.985   _email Example of Email Index File
1.401   _equation_caption Example of Equation Index File
1.564   _figure_caption Example of Figure Caption Index File
5.017   _lexicon Example of Lexeme Index File
3.463   _paragraph_sentence Example of Mapping Between Paragraphs and Sentences
2.243   _section Example of Section Index File
1.088   _section_figure Example of Mapping between Section and Figure Caption Index Files
1.492   _section_table Example of Mapping between Section and Table Caption Index Files
22.038   _sentence Example of Sentence Index File
3.589   _sentence_chunk Example of Mapping Between Sentence and Chunk Index Files
5.010   _sentence_dependency_parse Example of Mapping between Sentences and Indexed Syntactic Dependencies
3.306   _sentence_lexeme Example of Mapping Between Lexeme Indices and Sentences
2.004   _table_caption Example of Table Caption Index File

Directory contains 709.234 Bytes in 25 Files

Total: 2.629.354.610 Bytes in 85 Files

© Behrang QasemiZadeh Some Rights Reserved.

Creative Commons License

This page last edited on 21 April 2017.

*** ***