- Deborah Tannen
- Research Papers
- Keywords Research
Automatic generation of sets of keywords for theme characterization and detection
The paper describes a system that automatically detects themes in a textual corpus and characterizes them by sets of keywords, that is, words whose co-occurrence in a paragraph indicates that this paragraph tackles a certain theme. (Pichon and Sébillot, 2000) presents a first version of it where those sets are obtained with the help of the CHAVL hierarchical clustering algorithm, grouping words that have a similar repartition over paragraphs. The weaknesses of the system (quality of the classes highly dependent on manual parameter settings, relevant classes in the classification tree hardly pointed out automatically) are largely reduced here by using a combined classification of the paragraphs based on their lexical cohesion. This new classification first allows to densify the processed data, thus helping CHAVL produce more satisfying classes; it also gives a means to establish an original statistical quality measure that can be exploited both to point out the relevant classes in the tree and to reorganize some of the mergings proposed by CHAVL
- Hits: 759