NP Chunk-based Candidate Terms (1) | ACL RD-TEC

Candidate Terms Extracted Using NP-Chunking (1)

This directory contains extracted candidate terms using NP-chunks. Sentences in the corpus are chunked using the Apache OpenNLP chunker (Release version 1:5:2 (http://opennlp.apache.org/)). All noun phases (NP chunks) of maximum length 5 (after removing determiners and stop words) are then considered as candidate term. For the candidate terms listed here, the frequency of candidate terms in the corpus is computed independently of the NP chunk boundaries.

The structure of files and folders is similar to candid_term/pos_based/. However, the listed candidate terms in the file "_all_candid_term_by_np_chunk_1.zip" are also marked with the CHUNK_IDs that have been employed to extract the terms. The CHUNK_IDs are listed after PARAGRPAH_FREQ as described below.


Index of: CANDIDATE_TERM/


Size: Name: Description:
25.787.427   _ALL_CANDID_TERM_BY
_NP_CHUNK_1.ZIP
This file contains all the extracted candidate terms. Each line of the file represent the following information:
  • TERM_ID: an assigned universal integer id to the candidate term (note that if a term appears in other lists of extracted candidate terms (e.g. pos-based extracted candidate terms), then its assigned integer id is the same across these lists)
  • STRING LENGTH: length of candidate term
  • CORPUS_FREQ: the number of occurrences of the candidate term in the segmented pre-processed corpus (i.e. SEPID_CORPUS), in other words the term frequency (tf). As stated above, the boundaries of NP chunks are ignored when collecting term frequencies.
  • DOCUMENT_FREQ: the number of documents in which the candidate term has been occurred, i.e. the term document frequency which can be used for calculating the inverse document frequency.
  • SECTION_FREQ: the number of sections in which the term has been occurred.
  • PARAGRPAH_FREQ: the number of paragraphs in which the term has been occurred.
  • CHUNK_ID: the integer id of the origin NP chunk from the SEPID_CORPUS.
33.777.930   _ALL_CANDID_TERM_BY_
NP_CHUNK_1
_DOCUMENT_INDEX.ZIP
An inverted index file that maps terms to documents in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by DOCUMENT_ID (tab separated). Please note DOCUMENT_ID corresponds to an integer id that is assigned to each document in the SEPID_CORPUS.
40.835.147   _ALL_CANDID_TERM_BY
_NP_CHUNK_1
_SECTION_INDEX.ZIP
Similar as above, however, for sections: an inverted index file that maps terms to sections in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by SECTION_ID (tab separated).
51.301.685   _ALL_CANDID_TERM_BY_
NP_CHUNK_1
_PARAGRAPH_INDEX.ZIP
Similar as above, however, for paragraphs, i.e. TERM_ID followed by PARAGRAPH_ID (tab separated from SEPID_CORPUS).
125.578.997   _ALL_CANDID_TERM_BY_
NP_CHUNK_1
_SENTENCE_INDEX.ZIP
Similar as above however for sentences. The format of the file is TERM_ID followed by SENTENCE_ID followed by START and END positions of the term. START and END are the token numbers in the sentence.
236   README.TXT A note on collecting frequencies.
<DIR>   CANDID_TERM_BY
_NP_CHUNK_1
_SENTENCE_INDEX/
The (candidate-term-id, sentence-id) indices (i.e. in _all_candid_term_by_ngram_sentence_index.zip) are grouped by the date(year) of publication of source documents. The first two letters of filenames show the year of publication. For instance, the file "84_candid_term_by_ngram_sentence_index.zip" contains all sentence--term-id mapping from the corpus in which the sentences are from the publications in the year 84. These files together with the additional provided index files in SEPID_CORPUS can be used to organize candidate terms in a chronological order. There are currently 34 files, representing publications from 67 (i.e. 1967) to 06 (i.e. 2006).

Directory contains 277.281.422 Bytes in 6 Files

Index of: CANDID_TERM_BY_NP_CHUNK_1_SENTENCE_INDEX/


<Up to the higher level directory>

To download all these files in one zip file click here.
Size: Name: Description:
6.869.371   00_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2000.
3.900.547   01_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2001.
5.641.808   02_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2002.
6.517.638   03_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2003.
12.071.310   04_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2004.
6.864.500   05_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2005.
13.309.338   06_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2006.
271.362   65_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1965.
218.459   67_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1967.
532.014   69_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1969.
355.245   73_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1973.
450.201   75_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1975.
585.869   78_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1978.
2.260.052   79_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1979.
1.512.396   80_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1980.
643.966   81_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1981.
1.309.543   82_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1982.
1.327.300   83_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1983.
1.353.830   84_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1984.
1.390.042   85_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1985.
2.661.252   86_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1986.
1.793.222   87_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1987.
3.528.650   88_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1988.
2.610.335   89_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1989.
3.933.601   90_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1990.
3.567.428   91_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1991.
5.468.993   92_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1992.
4.606.487   93_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1993.
5.943.821   94_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1994.
2.537.697   95_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1995.
5.079.275   96_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1996.
5.400.770   97_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1997.
6.998.228   98_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1998.
3.872.754   99_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1999.

Directory contains 125.387.304 Bytes in 34 Files

Total: 402.668.726 Bytes in 40 Files

This page last edited on 21 April 2017.

*** ***