Glossary extraction and utilization in the information search and delivery system for IBM Technical Support (19 Sep 2004)
The same concept may appear in text in a number of different variations, such as misspellings or abbreviations. We attempt to identify all conceptually identical expressions of a candidate glossary item and aggregate them into one glossary item, so that they can be treated by applications as one.
GlossEx currently identifies and aggregates inflectional variants, orthographic variants, compounding variants, misspellings, and abbreviations. We select one of the forms as the canonical form and make the other forms its variants. The aggregation step also combines the frequencies of the different forms so that glossary items with many variant occurrences may be assigned higher confidence values.
Inflectional variants: singular-plural forms and different tenses (human-performance criterion and human performance-criteria)
Orthographic variants: glossary items with special characters such as hyphens or dashes (audio/visual input and audio-visual input)
Compounding variants: compounding form and lexicalized form (passenger airbag and passenger air bag)
Misspelling variants: correct spelling and misspelling or alternative spelling (accelarator and accelerator, nitroglycerine and nitroglycerin)
Abbreviations: abbreviated form and full form (R1H and radial first harmonic)
Note that GlossEx currently does not perform deep semantic processing, and thus it cannot identify synonyms nor handle polysemous glossary items. Instead, we provide a GUI (graphical user interface) tool for users to manually add or aggregate synonymous glossary items for their applications.
Article URL: http://www.research.ibm.com/journal/sj/433/kozakov.html
Read 201 more articles from IBM sorted by
Next Article: Market Intelligence Portal: An entity-based system for managing market intelligence