NP Chunk-based Candidate Terms (2) | ACL RD-TEC

Candidate Terms Extracted Using NP-Chunking (2)

The same technique as the "np_chunk_based_1" is used to extract candidate terms (identical structure). However, for the candidate terms listed in this folder, the search for the occurrences of the candidate terms is limited by the NP chunk boundaries that they are appeared in. That is, instead of the simple search for candidate term strings, the NP chunks that they are derived from are searched.

As stated for the "np_chunk_based_1", sentences in the corpus are chunked using the Apache OpenNLP chunker (Release version 1:5:2 (http://opennlp.apache.org/)). All noun phases (NP chunks) of maximum length 5 (after removing determiners and stop words) are considered as candidate term.


Index of: CANDIDATE_TERM/


Size: Name: Description:
24.726.493   _ALL_CANDID_TERM_
BY_NP_CHUNK_2.ZIP
This file contains all the extracted candidate terms. Each line of the file represent the following information:
  • TERM_ID: an assigned universal integer id to the candidate term (note that if a term appears in other lists of extracted candidate terms (e.g. pos-based extracted candidate terms), then its assigned integer id is the same across these lists)
  • STRING LENGTH: length of candidate term
  • CORPUS_FREQ: the number of occurrences of the candidate term in the segmented pre-processed corpus (i.e. SEPID_CORPUS), in other words the term frequency (tf). As stated above, the boundaries of NP chunks are ignored when collecting term frequencies.
  • DOCUMENT_FREQ: the number of documents in which the candidate term has been occurred, i.e. the term document frequency which can be used for calculating the inverse document frequency.
  • SECTION_FREQ: the number of sections in which the term has been occurred.
  • PARAGRPAH_FREQ: the number of paragraphs in which the term has been occurred.
  • CHUNK_ID: the integer id of the origin NP chunk from the SEPID_CORPUS.
8.650.790   _ALL_CANDID_TERM_
BY_NP_CHUNK_2
_DOCUMENT_INDEX.ZIP
An inverted index file that maps terms to documents in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by DOCUMENT_ID (tab separated). Please note DOCUMENT_ID corresponds to an integer id that is assigned to each document in the SEPID_CORPUS.
9.203.508   _ALL_CANDID_TERM_BY_
NP_CHUNK_2
_SECTION_INDEX.ZIP
Similar as above, however, for sections: an inverted index file that maps terms to sections in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by SECTION_ID (tab separated).
10.181.828   _ALL_CANDID_TERM_
BY_NP_CHUNK_2
_PARAGRAPH_INDEX.ZIP
Similar as above, however, for paragraphs, i.e. TERM_ID followed by PARAGRAPH_ID (tab separated from SEPID_CORPUS).
15.869.711   _ALL_CANDID_TERM_
BY_NP_CHUNK_2
_SENTENCE_INDEX.ZIP
Similar as above however for sentences. The format of the file is TERM_ID followed by SENTENCE_ID followed by START and END positions of the term. START and END are the token numbers in the sentence.
542   README.TXT A note on collecting frequencies.
<DIR>   CANDID_TERM_
BY_NP_CHUNK_2
_SENTENCE_INDEX/
The (candidate-term-id, sentence-id) indices (i.e. in _all_candid_term_by_ngram_sentence_index.zip) are grouped by the date(year) of publication of source documents. The first two letters of filenames show the year of publication. For instance, the file "84_candid_term_by_ngram_sentence_index.zip" contains all sentence--term-id mapping from the corpus in which the sentences are from the publications in the year 84. These files together with the additional provided index files in SEPID_CORPUS can be used to organize candidate terms in a chronological order. There are currently 34 files, representing publications from 67 (i.e. 1967) to 06 (i.e. 2006).

Directory contains 68.632.872 Bytes in 6 Files

Index of: CANDID_TERM_BY_NP_CHUNK_2_SENTENCE_INDEX/


<Up to the higher level directory>

To download all these files in one zip file click here.
Size: Name: Description:
880.857   00_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2000.
466.720   01_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2001.
677.691   02_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2002.
774.579   03_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2003.
1.417.449   04_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2004.
799.038   05_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2005.
1.560.183   06_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2006.
35.371   65_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1965.
42.235   67_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1967.
79.274   69_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1969.
62.929   73_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1973.
59.103   75_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1975.
74.225   78_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1978.
406.722   79_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1979.
207.656   80_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1980.
85.977   81_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1981.
182.388   82_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1982.
180.000   83_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1983.
187.816   84_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1984.
198.365   85_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1985.
355.766   86_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1986.
224.923   87_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1987.
471.667   88_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1988.
319.428   89_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1989.
497.403   90_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1990.
433.317   91_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1991.
732.359   92_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1992.
558.362   93_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1993.
835.715   94_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1994.
299.990   95_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1995.
677.865   96_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1996.
645.948   97_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1997.
848.718   98_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1998.
458.312   99_CANDID_TERM_BY_
NP_CHUNK_2_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1999.

Directory contains 15.738.351 Bytes in 34 Files

Total: 84.371.223 Bytes in 40 Files

This page last edited on 21 April 2017.

*** ***