Text Mining

Earlier projects (from 2002-2010) in Natural Language Understanding and Text Mining areas..

Topics: Text Clustering, Product Taxonomy, Entity-Relation Mining, Lexical Semantics, Recommendation Systems, Sentiment Analysis, Email Prioritization, Search Engine

Master’s Thesis at University of Minnesota

Implemented and tested statistical machine-learning models for word sense disambiguation to identify the correct meaning of ambiguous words, like ‘amazon’, ‘jaguar’, ‘apple’, ‘shell’ etc, using contextual features. For example, when terms like ‘tropical rainforests’, ‘jungle’, ‘deforestation’, ‘trees’, ‘tribes’, ‘climate change’, ‘vegetation’ etc appear around the word ‘amazon’ in the given text, it’s likely to refer to the ‘amazon rainforests’, as against to the e-commerce website Amazon.com. Similarly, when n-grams like “gas station”, “petrol pump”, “fuel charges”, “crude oil”, “renewable energy sources” etc appear near the ambiguous word “shell”, it’s likely to refer to the Shell Oil & Gas Company, as against to the sea-shells.

Internship with Amazon.com

Explored data mining and text clustering algorithms to automatically create semantic grouping of words related to product items, e.g.

kitchen-appliances => [oven, refrigerator, microwave, toaster, coffee maker,..],
music-instruments => [violin, guitar, drums, clarinet, piano, flute,..],
furniture => [table, chair, sofa, bed, recliner, mattress, futon,..]
apparel => [shirt, pants, t-shirt, trousers, jeans, jacket, skirt, capris,..]

etc using word2vec utility.

Hierarchical clustering is used to arrange words into taxonomy, e.g.
[Electronics -> Home Appliances -> Kitchen Appliances],
[Home Decor -> Furniture -> [Living Room, Bedroom, Study, Dining, Outdoor]]

Internship with University of Southern California (USC)

Implemented lexical semantics algorithms to extract similar or related entities from natural language texts. These entities can be named-entities (names of people, companies, locations) or common nouns..

Similarity-based algorithms create groups of similar entities, e.g.

(potato, tomato, carrot, onion, spinach, celery..),
(salad, soup, pizza, pasta, sandwich, burger, pastry..),
(Japan, Korea, Singapore, China, Malaysia..),
(Honda, Toyota, Mazda, Suzuki, Mitsubishi, Yamaha..),
(Larry Page, Mark Zuckerberg, Jeff Bezos, Steve Jobs)

Entity-relations are binary, and typically connect two noun-phrases with a verb-phrase, e.g.
[New Delhi] – is capital of – [India],
[Shakespeare] – is author of – [King Lear],
[Yahoo] – acquired – [Flickr],
[Pixar] – is subsidiary of – [Walt Disney],
[Jeff Bezos] – founded – [Amazon.com]

Internship with SONY, Japan

Built a large-scale entity-relation database for actors, singers, musicians by mining web-pages and biographies on Wikipedia. This relational database contains entity-relations between a given artist and other entities mentioned on their page (e.g. song / movie / award titles, production studios, record labels, producers, directors, other famous artists etc).

For example, “X was born in Y”, “X was nominated for Y”, “X played lead role in movie Y”, “X married Y”, “X graduated from Y”, “X signed contract with Y”, “X is influenced by Y” etc.

Semantically similar relations like:
[“X was awarded with Y”, “X won Y”, “X is the winner of Y”], or
[“X is influenced by Y”, “X is inspired by Y”, “X is motivated by Y”]
etc are also extracted, along with implications like:
[“X graduated from Y” => “X was enrolled in Y”], or
[“X won Y” => “X was nominated for Y”]

This meta-data is used in applications like search & recommendation, e.g. to compute artist-artist similarity, or to search catalog with queries like “grammy award winners”, “actors on friends tv-show”, “artists similar to Ricky Martin”, “artists related to Michael Jackson” etc.

Research at Singapore Management University

Developed machine-learning algorithms for predicting the email reply order for Automatic Email Prioritization. The model analyzes user behavior and inter-personal relationships among users, along with the features extracted from emails to predict the email-reply order for prioritization. While most prior work on email prioritization uses manual annotations, we used the actual order in which users respond to their emails for deciding the priority labels. The algorithm predicts if the email requires a response and if so, by when.

Features include:

User Behavior: volume of emails sent by the sender & receiver per day (or week), average response time to emails

User Influence: average number of users the sender & receiver interacts with on a daily basis

User Affinity: frequency or volume of emails exchanged between the sender & receiver per day (or week), average response time to each other

Email Characteristics: length of email, time-stamp, #recipients on TO / CC / BCC lists etc.

Project with Persistent Systems, Pune

Implemented brand-clustering algorithm to identify companies that work in the same industry sectors or offer similar products and services, by mining features from online news-text using word2vec utility. The output shows semantic grouping of popular brands, e.g.

[Nestle, Kellogg’s, Parle, Britannia, Cadbury, Hershey’s ..],
[Lakme, Revlon, Ponds, Palmolive, L’Oreal Paris ..],
[Sony, LG, Samsung, Panasonic, Toshiba, Hitachi ..],
[Disney, Pixar, HBO, ESPN, BBC, CNN, MTV ..],
[Lee Cooper, Bare Denim, Peter England, GAP, Levi’s]

Project with Make-My-Trip

Text Mining from Hotel Reviews and Online Travel Articles..

[1] Built a destination search-engine to search domestic and international cities, by selected activity (e.g. scuba diving, horse riding, ice skating, cross-country skiing, trekking etc), or theme (amusement parks, world heritage sites, wild life safari, beach resorts etc), by mining articles from Wiki Travel or Lonely Planet websites.

[2] Built a hotel search-engine by mining reviews from Trip Advisor, to search hotels near specific Points-of-Interests (metro station, airport, popular landmarks, local attractions), or by specialty services (fine Italian Dining, Private Beach, Infinity Pool etc).

[3] Performed sentiment analysis on text snippets (phrases) extracted from hotel reviews to score and rank hotels on various dimensions like location, ambiance, cleanliness, quality of food, service, amenities etc. For example,

POS (+): “pristine pool”, “panoramic view”, “spotlessly clean rooms”, “delicious food”, “friendly staff”, “walking distance to beach”, “near shops and restaurants” ..

NEG (-): “cobwebs in every corner”, “dingy floors”, “rude staff”, “next to garbage dump” etc.

[4] Trained a word2vec model on Trip Advisor reviews to automatically discover concepts related to travel domain from text. Also created 3D visualizations of word vectors using Embedding Projector visualization tool from Google / Tensorflow.

For example, to describe that the hotel property is unclean, travelers can use many different language expressions like: [dust bunnies, cobwebs, cockroaches, dingy floors, dumpster, litter, garbage, trash, stray dogs, dirty smell, carpet stains ..] etc.

Such expressions are automatically extracted from review-text and mapped into n-dimensional space using Cosine Similarity or Euclidean Distance.