The ACL RD-TEC 1.0
The ACL Reference Dataset for Terminology Extraction and Classification (ACL RD-TEC), ver. 1.0, was developed for benchmarking automatic term recognition algorithms (see QasemiZadeh and Handschuh, 2014). A manually validated terminology is the main component of ACL RD-TEC 1.0. It embraces more than 80,000 manually annotated candidate terms which are annotated either as valid, invalid or technology terms. More than 25,000 candidates are valid terms, of which about 13,000 are technology terms. In short, "technology terms" are those computational linguistics jargon that signal processes, method and algorithms: terms that signify practical solutions to NLP problems.
The complete resource including the preprocessed segmented ACL ARC is available from this link. This includes the complete list of annotated candidate terms, list of annotated terms, and list of technology terms. The annotation process for validating terms was carried out by Behrang QasemiZadeh as part of his PhD research. The annotation was done on the output of a set of extraction methods such as one explained in Zadeh and Handschuh (2014b).
Note that in another effort, Qasemizadeh and Schumann introduced the ACL RD-TEC 2.0 which complements this resource by providing annotation of terms in context.
For commercial use, ACL ARC 1.0 is now also available through ELRA.
Please attribute this dataset by citing Zadeh and Handschuh (2014).
Based on the ACL Anthology Reference Corpus (ACL ARC) at http://acl-arc.comp.nus.edu.sg/. This dataset is also available via ELRA under reference ELRA-T0375. Permissions beyond the scope of this license may be available; for inquiries please contact Behrang QasemiZadeh or ELRA.