Using Word Embeddings to Categorize Search Results

PUBLISHED ON JAN 6, 2018

A few weeks ago, I wrote a post on how I used python and D3 to visualize my google search queries over time. I figured it’d be a good way to learn how my interests have evolved and perhaps even shed light on new areas of learning I could explore.

Given my search results for a particular day, it was surprising how easy it was to transport myself into the state of mind that I was in. For example, one spike in the number of searches for one day brought me back to when my Co-Founder and I were frantically preparing for the launch of our mobile app in tandem with a blog post through the popular data science and machine learning blog at Yhat, which has since been acquired by alteryx. My activity went back as far as 2009, when I was a junior at Cornell and worried about what I was going to do after I graduated.

While it was fun to peek into the mind of a past me, I thought it’d be cool to reduce the dimensionality of the data and see if it was possible to generalize my search results and visualize how my interest in certain areas such as art, science, and technology changed over time. I needed to find a list of topics that were specific enough to be interesting, but broad enough to cover a wide range of self and came across google’s content categories, which are used to classify news content for search at scale. A quick scrape of the website and parse of the top-level categories gave me this:

  • Arts And Entertainment
  • Autos And Vehicles
  • Beauty And Fitness
  • Books And Literature
  • Business And Industrial
  • Computers And Electronics
  • Food And Drink
  • Games
  • Health
  • Home And Garden
  • Hobbies And Leisure
  • Internet And Telecom
  • Jobs And Education
  • Law And Government
  • News
  • People And Society
  • Pets And Animals
  • Real Estate
  • Science
  • Sports
  • Travel

Now that I had a list of categories I could work with, it was time to find a way to map a search query to a specific category. But how can we write a function that maps a search query such as “1966 Shelby 427 Cobra” to “Autos and Vehicles” or “Tim Berners-Lee” to “Internet and Telecom”? Enter google’s word2vec, which is a group of related neural networks that are trained to reconstruct linguistic contexts of words. Cool. Let’s write some code.

Create and open a new python file.

touch categories.py | subl categories.py

Import the libraries we’ll need.

import gensim.models.keyedvectors as word2vec
import numpy as np
import math
import csv

Google’s pre-trained news corpus should be a good model to use and Radim Rehurek’s gensim library is great for loading it.

model = word2vec.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

Even though the word2vec model contains 3 million words, many of our search results will not have their own vector space because they are either not included or contain more than one word. We’ll need a way to quickly index into our model and to create our own vector when the search query is longer than one word.

Let’s call our indexed model index2word_set.

index2word_set = set(model.index2word)

And write a utility function that returns the average of our search query.

def avg_feature_vector(search_query, model, num_features, index2word_set):
  words = search_query.split()
  feature_vec = np.zeros((num_features, ), dtype='float32')
  n_words = 0
  for word in words:
    if word in index2word_set:
      n_words += 1
      feature_vec = np.add(feature_vec, model[word])
  if n_words > 0:
    feature_vec = np.divide(feature_vec, n_words)
  return feature_vec

We’ll need a list of our categories and our feature vectors so initialize them.

categories = []
feature_vectors = []

Then parse the categories.txt file and append it to the category and feature vectory arrays.

with open('categories.txt', mode='r') as categories_f:
  for line in categories_f:
      category = line.strip('\n')
      categories.append(category)
  for category in categories:
    feature_vector = avg_feature_vector(category, model=model, num_features=300, index2word_set=index2word_set)
    feature_vectors.append(feature_vector)

We need the transpose of an average feature matrix representing our list of category feature vectors.

afm = np.vstack(feature_vectors).T

Now we need to import the data.csv file created in our previous work and append a mapping of each search query to a specific topic. In order to do so, we’ll need three utility functions.

One that calculates cosine similarity between vectors.

def cos_sim(v1, v2):
  dot_product = np.dot(v1, v2)
  norm_v1 = np.linalg.norm(v1)
  norm_v2 = np.linalg.norm(v2)
  return dot_product / (norm_v1 * norm_v2)

Another to return the best topic given a search query vector representation.

def return_best_topic(topics, output_vector):
  top_five = []
  index = output_vector.argmax(axis=0)
  indices = np.argpartition(output_vector, -4)[-4:]
  for i in reversed(indices):
    topic = topics[i]
    score = output_vector[i]
    top_five.append((topic, score))
  return topics[index], output_vector[index], top_five

And another to map the search query given the categories average feature matrix.

def map_query(query, afm, categories, model, index2word_set, threshold):
  query_afm = avg_feature_vector(query, model=model, num_features=300, index2word_set=index2word_set)
  output_vector = cos_sim(query_afm, afm)
  topic, score, top_five = return_best_topic(categories, output_vector)
  if not math.isnan(score):
    score = float(score)
    if score > threshold:
      return topic
    else:
      return ''
  else:
    return ''

Now we’re ready to add the categories to our data.csv file and save it to use later. Note that some search queries simply should not map to any of our categories, so after playing around with the data, I found that threshold=0.03 returns strong enough results such that it’s high enough to only return if there is a strong correlation, but not so low as too unwanted data.

with open('data.csv', mode='rb') as csvfile:
  with open('data_with_topics.tsv', mode='wb') as csv_out:
    next(csvfile)
    reader = csv.reader(csvfile)
    for row in reader:
      topics = []
      for query in queries:
        if query is not '':
          query = query.strip().strip('\"').strip('\'')
          topic = map_query(query=query, afm=afm, categories=categories, model=model, index2word_set=index2word_set, threshold=0.03)
          topics.append(topic)
      csv_out.write('{}\t{}\t{}\t{}\n'.format(date, searches, queries, topics))

Finally, let’s test out some the the results and see what we get.

search_query = '1966 Shelby 427 Cobra'
test_afv = avg_feature_vector(search_query, model=model, num_features=300, index2word_set=index2word_set)
output_vector = cos_sim(test_afv, afv)
return_best_topic(categories, output_vector)
>>>('Autos And Vehicles', 0.037955116)

Sure enough, “1966 Shelby 427 Cobra” returns “Autos And Vehicles”!

“Tim Berners-Lee”, however, returns nothing since the top result score is too low. This is not exactly desired behavior, but better than returning a false positive. The reason is that no word embedding exists for “Tim” or “Berners-Lee” that can translate back to a vector representation of the well known computer scientist, best known as the inventor of the World Wide Web. My friend and co-founder, Lance Legel, wrote a nice function that actually executes the search and returns a string representation of the top pages. It works wonderfully for abstracting strings that do not exist within the pretrained model. Unfortunately, using it with as much data as we are here, it would cost money, since it google search result parsing requires an API key after 100 searches.

So there you have it! Mapping search results to predefined categories using word2vec word embeddings!