Extracting People’s Names From RSS Feeds Using WordNet
Price
Free (open access)
Volume
33
Pages
10
Published
2004
Size
240 kb
Paper DOI
10.2495/DATA040041
Copyright
WIT Press
Author(s)
K. Durant & M. Smith
Abstract
using WordNet Department of Engineering and Applied Sciences Abstract Proper names are rich in content and it is essential we successfully identify them to provide an accurate representation of the news. Identifying people’s names is essential for article representation, representing the event of the story, identifying the actors and receivers of action in the story, and for identifying the trends happening within the news. Because of the importance names play, a name extraction algorithm must be accurate and precise. It must also be efficient because of the magnitude of the news corpora size and the timeliness of the data. Typical name extraction systems uses supervised learning to identify names. We take a different approach to name identification. We use WordNet to identify the popular names within the corpora. The algorithm tracks the unidentified words and uses standard templates to identify potential names within the unidentified words. We also address the problem of names being common words found within WordNet. We have created four gazetteers of words that may be a first name, last name, title or suffix of a name. We use these lists along with the surrounding text and simple templates to identify names. The algorithm can simultaneously be performed when mapping the words to the terms within the corpora. Our corpora are RSS news feeds, in particular the item element, which is a semantic representation of a news article. We identify the names found within the title and description elements of an RSS item. We exploit the fact that the title element and the description element have the same topic for a news article. Our algorithm has achieved a recall rate of 96% and a precision rate of 91%. We believe our approach performs well on this corpus because of the simple vocabulary and succinct writing published in newspaper headlines and lede statements. Extracting people’s names from RSS feeds Keywords: proper nouns, WordNet, RSS, gazetteer, template description. Harvard University, USA K. Durant & M. Smith
Keywords
proper nouns, WordNet, RSS, gazetteer, template description