n-Gram-based Candidate Terms | ACL RD-TEC
Candidate Terms Extracted Using an n-Gram-based Technique
This directory contains extracted candidate terms using an n-gram based technique (n = 1 to 5). The structure of files and folders listed here are exactly the same as candid_term/pos_based/.
Extracted Candidate Terms Using n-Gram-based Technique |
|||
Size: | Name: | Description: | |
83.753.704 | _ALL_CANDID_TERM_ BY_NGRAM.ZIP |
This file contains all the extracted candidate terms. Each line of the file represent the following information:
|
|
53.565.570 | _ALL_CANDID_TERM_ BY_NGRAM_ DOCUMENT_INDEX.ZIP |
An inverted index file that maps terms to documents in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by DOCUMENT_ID (tab separated). Please note DOCUMENT_ID corresponds to an integer id that is assigned to each document in the SEPID_CORPUS. | |
59.079.540 | _ALL_CANDID_TERM_ BY_NGRAM_ SECTION_INDEX.ZIP |
Similar as above, however, for sections: an inverted index file that maps terms to sections in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by SECTION_ID (tab separated). | |
67.404.253 | _ALL_CANDID_TERM_ BY_NGRAM_ PARAGRAPH_INDEX.ZIP |
Similar as above, however, for paragraphs, i.e. TERM_ID followed by PARAGRAPH_ID (tab separated from SEPID_CORPUS). | |
133.688.063 | _ALL_CANDID_TERM_ BY_NGRAM_ SENTENCE_INDEX.ZIP |
Similar as above however for sentences. The format of the file is TERM_ID followed by SENTENCE_ID followed by START and END positions of the term. START and END are the token numbers in the sentence. | |
<DIR> | CANDID_TERM_ BY_NGRAM_ SENTENCE_INDEX/ |
The (candidate-term-id, sentence-id) indices (i.e. in _all_candid_term_by_ngram_sentence_index.zip) are grouped by the date(year) of publication of source documents. The first two letters of filenames show the year of publication. For instance, the file "84_candid_term_by_ngram_sentence_index.zip" contains all sentence--term-id mapping from the corpus in which the sentences are from the publications in the year 84. These files together with the additional provided index files in SEPID_CORPUS can be used to organize candidate terms in a chronological order. There are currently 34 files, representing publications from 67 (i.e. 1967) to 06 (i.e. 2006). | |
Directory contains 397.491.130 Bytes in 5 Files |
Index of: CANDID_TERM_BY_NGRAM_SENTENCE_INDEX/(term-sentence index files grouped by publication date) |
|||
<Up to the higher level directory> | |||
To download all these files in one zip file click here. | |||
Size: | Name: | Description: | |
8.904.206 |
00_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2000. | |
5.077.971 |
01_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2001. | |
7.364.957 |
02_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2002. | |
8.531.814 |
03_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2003. | |
15.716.976 |
04_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2004. | |
8.932.110 |
05_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2005. | |
17.449.374 |
06_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2006. | |
316.706 | 65_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1965. | |
359.915 | 67_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1967. | |
718.613 | 69_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1969. | |
235.237 | 73_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1973. | |
269.580 | 75_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1975. | |
357.955 | 78_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1978. | |
1.349.004 | 79_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1979. | |
977.118 | 80_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1980. | |
411.632 | 81_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1981. | |
877.947 | 82_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1982. | |
873.192 | 83_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1983. | |
885.106 | 84_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1984. | |
896.840 | 85_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1985. | |
1.744.513 | 86_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1986. | |
1.182.829 | 87_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1987. | |
2.299.342 | 88_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1988. | |
1.785.755 | 89_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1989. | |
2.679.766 | 90_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1990. | |
2.504.422 | 91_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1991. | |
7.344.710 | 92_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1992. | |
3.266.949 | 93_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1993. | |
4.054.788 | 94_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1994. | |
1.803.095 | 95_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1995. | |
3.560.826 | 96_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1996. | |
6.815.681 | 97_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1997. | |
9.045.267 | 98_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1998. | |
4.924.792 | 99_CANDID_TERM_BY_NGRAM_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1999. | |
Directory contains 133.518.988 Bytes in 34 Files | |||
Total: 531.010.118 Bytes in 39 Files |