Candidate Terms | ACL RD-TEC

The Set of Extracted Candidate Terms

All the extracted candidate terms can be found here. There are currently 4 inner directories, each represents candidate terms that are extracted using a specific method.


List of Extracted Candidate Terms


Size: Name: Description:
<DIR>   POS_BASED/ This directory contains extracted candidate terms using a set of part-of-speech patterns.
<DIR>   NP_CHUNK_BASED_1/ This directory contains extracted candidate terms using NP chunking technique. All noun phases (NP chunks) of maximum length 5 (after removing determiners) are considered as candidate term. The frequency of candidate terms are then calculated independently of the NP chunk boundaries.
<DIR>   NP_CHUNK_BASED_2/ The same technique as the "np_chunk_based_1" is used to extract candidate terms (identical structure). However, for the candidate terms listed in this folder, the search for the occurrences of the candidate terms is limited by the NP chunk boundaries that they are appeared in. That is, instead of the simple search for candidate term strings, the NP chunks that they are derived from are indexed.
<DIR>   N_GRAM_BASED/ This directory contains extracted candidate terms using an n-gram based technique (n = 1 to 5).
<DIR>   STOP_WORD/ This folder contains the list of stop words that are employed to filter candidate terms extracted using the n-gram technique as well as np-chunk-based method.

Additional Notes

  • The provided frequencies for candidate terms in these lists can be easily used to calculate and sort them using classic methods such as TF/IDF, C-VALUE, etc.
  • The given identifiers to candidate terms are unique across all the lists of extracted candidate terms. That is, e.g., if term "ABC" is given unique_id "1" in the list X, it is also assigned to unique_id "1" in the list Y.
  • Additional set of candidate terms may be added here.

This page last edited on 21 April 2017.

*** ***