BY Sean wang
Google search is increasingly capable of answering natural-sounding questions, Apple’s Siri is able to understand a wide variety of questions, and more and more companies are using (reasonably) intelligent chat and phone bots to communicate with customers. But how does this seemingly smart software really work?
The article will walk you through the example process of building a news relevance analyzer. Imagine you have a stock portfolio, and you would like an app to automatically crawl through popular news websites and identify articles that are relevant to your portfolio. For example, if your stock portfolio includes companies like Microsoft,
BlackStone, and Luxottica, you would want to see articles that mention these three companies.
Getting Started with the Stanford NLP Library
To begin, we’ll create a new Java project (you can use your favorite IDE) and add the Stanford NLP library to the list of dependencies. If you are using Maven, simply add it to your
Scraping and Cleaning Articles
The Boilerpipe library comes with built-in support for scraping web pages. It can fetch the HTML from the web, extract text from HTML, and clean the extracted text. You can define a function,
extractFromURL, that will take a URL and use Boilerpipe to return the most relevant text as a string using
ArticleExtractorfor this task:
Add the following code to your main function:
Tagging Parts of Speech
Here is a simple implementation:
Processing the Tagged Output into a Set
Below is the function that implements the splitting and storing of proper nouns. Place this function in your
Now the function should return a set
with the individual proper nouns and the consecutive proper nouns (i.e., joined by spaces). If you print the
propNounSet, you should see something like the following:
Comparing the Portfolio against the PropNouns Set
The implementation is very simple. Add the following code in your
Building an NLP App Doesn't Need to be Hard
Share your perspective
Share your achievement or new finding or bring a new tech idea to life. Your IT community is waiting!