NP Chunk-based Candidate Terms (2) | ACL RD-TEC
Candidate Terms Extracted Using NP-Chunking (2)
The same technique as the "np_chunk_based_1" is used to extract candidate terms (identical structure). However, for the candidate terms listed in this folder, the search for the occurrences of the candidate terms is limited by the NP chunk boundaries that they are appeared in. That is, instead of the simple search for candidate term strings, the NP chunks that they are derived from are searched.
As stated for the "np_chunk_based_1", sentences in the corpus are chunked using the Apache OpenNLP chunker (Release version 1:5:2 (http://opennlp.apache.org/)). All noun phases (NP chunks) of maximum length 5 (after removing determiners and stop words) are considered as candidate term.
|
|||
Size: | Name: | Description: | |
24.726.493 | _ALL_CANDID_TERM_ BY_NP_CHUNK_2.ZIP |
This file contains all the extracted candidate terms. Each line of the file represent the following information:
|
|
8.650.790 |
_ALL_CANDID_TERM_ BY_NP_CHUNK_2 _DOCUMENT_INDEX.ZIP |
An inverted index file that maps terms to documents in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by DOCUMENT_ID (tab separated). Please note DOCUMENT_ID corresponds to an integer id that is assigned to each document in the SEPID_CORPUS. | |
9.203.508 | _ALL_CANDID_TERM_BY_ NP_CHUNK_2 _SECTION_INDEX.ZIP |
Similar as above, however, for sections: an inverted index file that maps terms to sections in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by SECTION_ID (tab separated). | |
10.181.828 | _ALL_CANDID_TERM_ BY_NP_CHUNK_2 _PARAGRAPH_INDEX.ZIP |
Similar as above, however, for paragraphs, i.e. TERM_ID followed by PARAGRAPH_ID (tab separated from SEPID_CORPUS). | |
15.869.711 | _ALL_CANDID_TERM_ BY_NP_CHUNK_2 _SENTENCE_INDEX.ZIP |
Similar as above however for sentences. The format of the file is TERM_ID followed by SENTENCE_ID followed by START and END positions of the term. START and END are the token numbers in the sentence. | |
542 | README.TXT | A note on collecting frequencies. | |
<DIR> | CANDID_TERM_ BY_NP_CHUNK_2 _SENTENCE_INDEX/ |
The (candidate-term-id, sentence-id) indices (i.e. in _all_candid_term_by_ngram_sentence_index.zip) are grouped by the date(year) of publication of source documents. The first two letters of filenames show the year of publication. For instance, the file "84_candid_term_by_ngram_sentence_index.zip" contains all sentence--term-id mapping from the corpus in which the sentences are from the publications in the year 84. These files together with the additional provided index files in SEPID_CORPUS can be used to organize candidate terms in a chronological order. There are currently 34 files, representing publications from 67 (i.e. 1967) to 06 (i.e. 2006). | |
Directory contains 68.632.872 Bytes in 6 Files |
Index of: CANDID_TERM_BY_NP_CHUNK_2_SENTENCE_INDEX/ |
|||
<Up to the higher level directory> | |||
To download all these files in one zip file click here. | |||
Size: | Name: | Description: | |
880.857 | 00_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 2000. | |
466.720 | 01_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 2001. | |
677.691 | 02_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 2002. | |
774.579 | 03_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 2003. | |
1.417.449 | 04_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2004. | |
799.038 | 05_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 2005. | |
1.560.183 | 06_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 2006. | |
35.371 | 65_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1965. | |
42.235 | 67_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1967. | |
79.274 | 69_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1969. | |
62.929 | 73_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1973. | |
59.103 | 75_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1975. | |
74.225 | 78_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1978. | |
406.722 | 79_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1979. | |
207.656 | 80_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1980. | |
85.977 | 81_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP | Term-Sentence indices from articles published in year 1981. | |
182.388 | 82_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1982. | |
180.000 | 83_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1983. | |
187.816 | 84_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1984. | |
198.365 | 85_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1985. | |
355.766 | 86_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1986. | |
224.923 | 87_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1987. | |
471.667 | 88_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1988. | |
319.428 | 89_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1989. | |
497.403 | 90_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1990. | |
433.317 | 91_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1991. | |
732.359 | 92_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1992. | |
558.362 | 93_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1993. | |
835.715 | 94_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1994. | |
299.990 | 95_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1995. | |
677.865 | 96_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1996. | |
645.948 | 97_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1997. | |
848.718 | 98_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1998. | |
458.312 | 99_CANDID_TERM_BY_ NP_CHUNK_2_ SENTENCE_INDEX.ZIP |
Term-Sentence indices from articles published in year 1999. | |
Directory contains 15.738.351 Bytes in 34 Files | |||
Total: 84.371.223 Bytes in 40 Files |