n-Gram-based Candidate Terms | ACL RD-TEC

Candidate Terms Extracted Using an n-Gram-based Technique

This directory contains extracted candidate terms using an n-gram based technique (n = 1 to 5). The structure of files and folders listed here are exactly the same as candid_term/pos_based/.


Extracted Candidate Terms Using n-Gram-based Technique


Size: Name: Description:
83.753.704   _ALL_CANDID_TERM_
BY_NGRAM.ZIP
This file contains all the extracted candidate terms. Each line of the file represent the following information:
  • TERM_ID: an assigned universal integer id to the candidate term (note that if a term appears in other lists of extracted candidate terms (e.g. pos-based extracted candidate terms), then its assigned integer id is the same across these lists)
  • STRING LENGTH: length of candidate term
  • CORPUS_FREQ: the number of occurrences of the candidate term in the segmented pre-processed corpus (i.e. SEPID_CORPUS), in other words the term frequency (tf).
  • DOCUMENT_FREQ: the number of documents in which the candidate term has been occurred, i.e. the term document frequency which can be used for calculating the inverse document frequency.
  • SECTION_FREQ: the number of sections in which the term has been occurred.
  • PARAGRPAH_FREQ: the number of paragraphs in which the term has been occurred.
53.565.570   _ALL_CANDID_TERM_
BY_NGRAM_
DOCUMENT_INDEX.ZIP
An inverted index file that maps terms to documents in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by DOCUMENT_ID (tab separated). Please note DOCUMENT_ID corresponds to an integer id that is assigned to each document in the SEPID_CORPUS.
59.079.540   _ALL_CANDID_TERM_
BY_NGRAM_
SECTION_INDEX.ZIP
Similar as above, however, for sections: an inverted index file that maps terms to sections in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by SECTION_ID (tab separated).
67.404.253   _ALL_CANDID_TERM_
BY_NGRAM_
PARAGRAPH_INDEX.ZIP
Similar as above, however, for paragraphs, i.e. TERM_ID followed by PARAGRAPH_ID (tab separated from SEPID_CORPUS).
133.688.063   _ALL_CANDID_TERM_
BY_NGRAM_
SENTENCE_INDEX.ZIP
Similar as above however for sentences. The format of the file is TERM_ID followed by SENTENCE_ID followed by START and END positions of the term. START and END are the token numbers in the sentence.
<DIR>   CANDID_TERM_
BY_NGRAM_
SENTENCE_INDEX/
The (candidate-term-id, sentence-id) indices (i.e. in _all_candid_term_by_ngram_sentence_index.zip) are grouped by the date(year) of publication of source documents. The first two letters of filenames show the year of publication. For instance, the file "84_candid_term_by_ngram_sentence_index.zip" contains all sentence--term-id mapping from the corpus in which the sentences are from the publications in the year 84. These files together with the additional provided index files in SEPID_CORPUS can be used to organize candidate terms in a chronological order. There are currently 34 files, representing publications from 67 (i.e. 1967) to 06 (i.e. 2006).

Directory contains 397.491.130 Bytes in 5 Files

Index of: CANDID_TERM_BY_NGRAM_SENTENCE_INDEX/

(term-sentence index files grouped by publication date)
<Up to the higher level directory>

To download all these files in one zip file click here.
Size: Name: Description:
8.904.206   00_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2000.
5.077.971   01_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2001.
7.364.957   02_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2002.
8.531.814   03_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2003.
15.716.976   04_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2004.
8.932.110   05_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2005.
17.449.374   06_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2006.
316.706   65_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1965.
359.915   67_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1967.
718.613   69_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1969.
235.237   73_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1973.
269.580   75_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1975.
357.955   78_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1978.
1.349.004   79_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1979.
977.118   80_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1980.
411.632   81_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1981.
877.947   82_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1982.
873.192   83_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1983.
885.106   84_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1984.
896.840   85_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1985.
1.744.513   86_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1986.
1.182.829   87_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1987.
2.299.342   88_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1988.
1.785.755   89_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1989.
2.679.766   90_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1990.
2.504.422   91_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1991.
7.344.710   92_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1992.
3.266.949   93_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1993.
4.054.788   94_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1994.
1.803.095   95_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1995.
3.560.826   96_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1996.
6.815.681   97_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1997.
9.045.267   98_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1998.
4.924.792   99_CANDID_TERM_BY_NGRAM_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1999.

Directory contains 133.518.988 Bytes in 34 Files

Total: 531.010.118 Bytes in 39 Files

This page last edited on 21 April 2017.

*** ***