Annotations | ACL RD-TEC

Annotation Files

This folder contains all the candidate terms that are manually annotated as either technology term (marked by 2), valid term (marked by 1) or invalid term (marked by 0).


Annotation Files available for download:


Size: Name: Description:
719.306   _ALL_ANNOTATED_
CANDID_TERM.ZIP
The list of more than 80,000 manually annotated terms. Each line of the file represents the following information:
  • TERM_ID: the given unique identifier to the (candidate) term. As mentioned elsewhere, the unique ids are the same across all the lists of candidate terms and can be used to trace and map an annotated term in all of these lists and provided index files.
  • TERM_STRING: lexical form that verbalizes the term, e.g. "automatic document categorization".
  • TERM_ANNOTATION: currently limited to 0 for invalid, 1 for valid and 2 for technology term. Please note that "technology terms" are kind of valid terms; thus, if a term is annotated as "technology term" it is also a "valid term".
83.081.599   _ALL_ANNOTATION_
IN_SENTENCE.ZIP
This file lists all the sentence ids from the SEPID_CORPUS that contain at least one valid or technology terms. If a sentence contains more than one valid or technology term, then it is listed more than once. We exclude annotated invalid terms from this file. The structure of this file is as follows:
  • SENTENCE_ID: the id of the sentence
  • SENTENCE: the sentence string, in which exactly one term is annotated by placing the tag <term id="candid_term_id" ann="annotation, either 1 for valid or 2 for technology term"></term> around the term. Please note that all the sentences are not verified manually. You can see a few examples of this file's entries here before downloading the zip file.
2.530.845   _ALL_ANNOTATION_
MAP_TO_
ACL_ARC_ID.ZIP
This file gives a mapping between the publications in ACL ARC and annotated valid terms as well as technology terms. The structure of the file is as follows:
  • ACL_ARC_ID: standard ACL ARC ID for a publication.
  • TERM_ID: ID of an annotated valid or technology term appeared in this publication.
2.858.754   _ALL_ANNOTATION_
MAP_TO_
ACL_ARC_ID_
HUMAN_READABLE.ZIP
This file has the same content as the above _ALL_ANNOTATION_MAP_TO_ACL_ARC_ID.ZIP file, however, in a human readable format. Technology terms as well as valid terms are mapped onto ACL ARC identifies, the title of papers and their authors are also presented.

Additional Notes:

  • In order to locate terms in the corpus at any level of text segment granularity (e.g. paragraph, section, etc.), or to locate them by a specific location (e.g. in the topic sentences only), please use the provided index files for lists of candidate terms.the candidate terms.
  • Using the additional provided indices in SEPID_CORPUS, you can map the annotated terms onto publications (such as the above the above _ALL_ANNOTATION_MAP_TO_ACL_ARC_ID.ZIP), people, institute, etc. Therefore, this annotations can be further combined with the provided annotation layers, e.g. the citation network, in the ACL ARC.

This page last edited on 21 April 2017.

*** ***