Text Mining Practical
Analysis of natural language---amongst them the major themes of natural language understanding, information retrieval, information extraction and text classification---has been one of the mainstream research in computational linguistics and artificial intelligence. The text mining course reviews basic concepts and major algorithms in natural language processing (NLP) and text analytics.
This course is based on the book Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loeper. Following the style of the book, the aim is to learn both Python and natural language processing techniques in one go. The goal is to provide the participants with essential knowledge and tools to discover and extract useful information from unstructured text to address a range of real world applications, particularly in a hands-on fashion using the natural language processing toolkit (NLTK) of Python.
In the first 7 weeks of the semester, participants will learn basics of NLP and Python programming with a strong focus on the NLTK toolkit. Thereafter, students are offered project titles, in which they will have a chance to apply the knowledge they have obtained in the first few weeks of the semester to real world applications, and to compare their work with the state-of-the-art algorithms.
Slides for this course can be downloaded here:
- Session 1: Introductory Session
- Session 2: Setting Up NLTK
- Session 3: Processing Raw Text
- Session 4: Writing Structured Programs
- Session 5: Python's Dictionary Data Type
- Session 6: Introduction to Part-of-Speech Tagging
- Session 7: Methods for Automatic Part-of-Speech Tagging
- Session 8: Text Classification (1)
- Session 9: Text Classification (2)
- Session 10: More on Maximum Entropy and Classification (slides from Mark Johnson course)
- Session 11: Information Extraction (1)
- Session 12: Information Extraction (2)
Some project ideas, and more information about the assessment procedure as well as the structure of the final report can be found in the following document(s):