The ACL RD-TEC (v 1.0)
The ACL Reference Dataset for Terminology Extraction and Classification (ACL RD-TEC), ver. 1.0, was developed for benchmarking automatic term recognition algorithms (see QasemiZadeh and Handschuh, 2014). A manually validated terminology is the main component of ACL RD-TEC 1.0. It embraces more than 80,000 manually annotated candidate terms which are annotated either as valid, invalid or technology terms. More than 25,000 candidates are valid terms, of which about 13,000 are technology terms. In short, "technology terms" are those computational linguistics jargons that signal processes, method, and algorithms: Terms that signify practical solutions to NLP problems.
The complete resource including the preprocessed segmented ACL ARC is available from this link. This includes the complete list of annotated candidate terms, list of annotated terms, and list of technology terms. The annotation process for validating terms was carried out by Behrang QasemiZadeh as part of his PhD research. The annotation was done on the output of a set of extraction methods such as one explained in this paper.
Note that in another effort, Qasemizadeh and Schumann introduced the ACL RD-TEC 2.0 which complements this resource by providing annotation of terms in context.
For commercial use, the ACL RD-TEC 1.0 is now also available through ELRA.
Human-Readable Examples
- Examples of terms and their assignment to documents from the ACL Anthology corpus.
- Main index of 13 semantically related terms: both on ACL ARC (v 1.0 and 2.0).
License
Please attribute this dataset by citing Q. Zadeh and Handschuh (2014).
Based on the ACL Anthology Reference Corpus (ACL ARC) at http://acl-arc.comp.nus.edu.sg/. This dataset is also available via ELRA under reference ELRA-T0375. Permissions beyond the scope of this license may be available; for inquiries please contact Behrang QasemiZadeh or ELRA.