Building an Enterprise-Grade Patent Search System with Semantic Search

Introduction:

Patents are issued by a specific government, granting exclusivity rights to the owner of an invention. This protects the inventor's intellectual property (IP) from being copied or stolen. Since many businesses today are built on novel technologies, it is useful to protect these "trade secrets" by filing a patent, supporting a multi-billion-dollar industry.

Filing a patent is a complex process and requires a "prior art" search, where a patent officer looks for similar, previously filed patents to ensure that the invention is genuinely unique. The patent search process typically begins with a multi-page document describing the invention, and patent search professionals are skilled at creating extensive queries using both keywords and metadata. A patent examiner generally formulates a set of search queries for a new patent and reviews around 100 patents retrieved from approximately 15 different queries on average. This validation process takes about 12 hours to complete.

The Challenge:

Off-the-shelf search and indexing methods face significant challenges in processing patent documents due to the frequent use of neologisms (newly coined terms) and intentionally broad or vague language. Patent applicants often employ these tactics to claim a larger scope of innovation, ensuring their invention encompasses as much intellectual territory as possible. This results in text that is purposefully difficult to parse and interpret, as it is designed to be inclusive of various potential adaptations and applications of the invention.

Such language makes traditional keyword-based search tools inadequate for patent analysis, as they may struggle to match synonyms, recognize emerging terms, or account for nuanced expressions. Consequently, these systems can miss critical information, fail to retrieve relevant patents, or generate irrelevant results, underscoring the need for more sophisticated, context-aware search techniques like semantic search. By interpreting the meaning behind words rather than relying solely on exact matches, semantic search can bridge the gap created by these unique linguistic challenges in patent documentation.

Beyond patent filing, there are additional use cases, such as tracking new inventions or monitoring the development of specific technologies by business competitors. Historically, options for this research include using Google Patent Search or the United States Patent and Trademark Office (USPTO) database. However, keyword search has limitations here as well, and we are now interested in exploring approaches that go beyond keyword search to enhance analysis.

Datasets:

Patent data can be obtained directly from the patent offices, or from research collections. US patents corpus has 10 million documents which we can get from IFI Claims patent database or Google public patents data. USPTO has its full text database available online for manual search[11].

Pre-Processing:

‘A document is represented by the features extracted from it, not by its “surface” string form’.

The patenting process requires documents to follow a specific structure, typically including an abstract, description, and claims. Building a search system involves creating a data pipeline to clean, transform, and model the data. We use Gensim and Spark to preprocess our data. Following standard preprocessing steps for text data, we perform the following tasks:

Tokenization: The text is split into sentences, and each sentence is further split into words. Words are converted to lowercase, and punctuation is removed.
Words with fewer than three characters are removed.
All stopwords are removed.
Lemmatization: Words are lemmatized, meaning words in the third person are converted to the first person, and verbs in past and future tenses are changed to the present tense.
Stemming: Words are reduced to their root form.

The aim of this pre-processing step is to preprocess the data by cleaning it, removing stop words, handling missing entries, and more. There are several ways to preprocess text data to develop the right set of features. We begin by parsing the XML content (stored in PostgreSQL) to extract the title, description, and claims text. These three sections can be combined to create a single text blob. For patent documents, the claims text can be separated to create a distinct claims index, which is often done to improve search accuracy.

At the end of this process, we generate two different indexes: the description_text index, which comprises description text objects, and the claim_text index, which consists of claim text documents. Next, we apply a general document representation technique called Bag-of-Words (unigram) for each document, creating an unordered list of words as document features. We tokenize the documents by splitting on whitespace, resulting in a Python list of words. This approach mirrors the Deerwester experiment [1].

Several other document representation strategies also exist and may yield better results. In this step, using the bag-of-words list as input, we create a sparse vector representation for each document. In this representation, each document is represented by a single vector, where each element is a (word, frequency) pair. It is advantageous to represent words only by their (integer) IDs. The mapping between words and IDs is called a dictionary in Gensim terminology, and Gensim provides a dictionary class to create this vector representation.

At the end of this step, we have a sparse vector (a list of tuples, where each tuple is a word_id, frequency pair) representing each document in the corpus.

Transformations and modeling: Given the bag of words extracted from the document, we create a feature vector where each feature represents a word (term) and the feature's value is a term weight. In our case, this term weight is a TF-IDF value. Thus, the entire document is represented as a feature vector, with each vector corresponding to a point in a vector space.

The model for this vector space has an axis for every term in the vocabulary, making it V-dimensional, where V is the size of the vocabulary. Conceptually, the vector should also be V-dimensional, with a feature for every term in the vocabulary. However, because the vocabulary can be large (on the order of V = 100,000 terms), a document's feature vector typically includes only the terms that actually appear in that document, omitting terms that do not. This results in a feature vector that is considered sparse.

Building the Core-Engine:

Off-the-shelf search and indexing methods face challenges due to neologisms and intentionally over-generalized expressions, because the applicant of the patent wants to claim a big chunk of innovation to their name.

Latent Semantic Indexing (LSI) is a machine learning technique that takes a collection of documents as input and produces a vector space representation of the documents and terms within that collection. The core feature of LSI is the use of Singular Value Decomposition (SVD), which enables large-scale dimensionality reduction to address the problem more effectively.

LSI overcomes two major limitations of Boolean keyword queries: synonymy (multiple words with similar meanings) and polysemy (words with multiple meanings). Synonymy often leads to mismatches between the vocabulary used by document authors and that used by users of information retrieval systems. As a result, Boolean or keyword-based queries frequently return irrelevant results or miss information that is actually relevant.

LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique which called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows[4].

Evaluation:

We need to assess the effectiveness of our semantic search system in terms of returning relevant documents that fulfill specific information needs. Evaluating real-world systems is challenging, especially when the user’s information needs are dynamic. In practice, to compare different retrieval modeling strategies, we aim for relative performance evaluation between systems. Establishing a baseline is beneficial, allowing us to use scores to rank modeling approaches (or IR systems) relative to one another. It is also crucial to understand the information needs of professional users of the patent search service, who generally have a lower tolerance for errors compared to typical web search users.

Trippe and Ruthven for instance, answer the very fundamental question of “What is success?” with risk minimization. The argument is that what the searcher is ultimately trying to achieve is to minimize the risk of having a patent application rejected, infringing on someone else’s in force patent, being excluded from a market because of a new patent, etc. They map the fundamental metrics of IR effectiveness, precision and recall, to the different search tasks and how they are ordered on the “risk scale.” They observe that precision is more appropriate as a match, insofar as it orders the tasks in the same way as the importance of risk minimization.

Simulating user Information Needs: Generally speaking in a lab setting we follow the Cranfield approach to measure the effectiveness of an IR retrieval strategy. This requires the use of test collections, which can be used to create an experimental setup where various modeling approaches can be benchmarked for comparison. A test collection consists of a document collection, a test suite of information needs (expressed as queries), and a set of relevance judgments, typically a binary assessment indicating whether each query-document pair is relevant or non-relevant [7]. The patent research community developed evaluation campaigns like CLEF, TREC, and NTCIR to support this need.

Sanity Checks (when test collection isn't available):

About DeerWester collection: This is a tiny corpus of nine documents, each consisting of only a single sentence.


>>> documents = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

We use LSI to extract two topics out of this corpus. After the model and index has been created on this dataset we perform the following query:

SearchQuery: "Human Computer Interaction"

Family members Test: Patents often have "family members" which share the same priority date, the same description text but have different claims text. While the claims are different between family member to family member, they are very likely to be very very semantically similar.For any size corpus, submit the full text of each patent document in the corpus.For each search identify what % of family members are returned in

#1 search result
Top 20 search result
Top 21-100
Not in top 100

Film Unit with lens:
This test may lean more toward being a heuristic rather than a purely statistical test, but I believe it is an essential evaluation for any successful algorithm to pass. In this particular Chinese patent, when translated to English, the word "camera" appears only as an adjective or in combination with the more obscure term "camera obscura." The word "camera" is never used in the claims. Instead, the primary noun in the document is “film unit with lens,” which essentially describes a "camera" as we understand it but is never directly referred to as such in the claims set.

https://patents.google.com/patent/CN1357791A/en

What questions are you trying to answer? Here's the test:"Any successful semantic search engine should return "camera" patents when this chinese patent is used as a search seed." For example, it should return many patents owned by Kodak, Canon, Fuji Film, etc. Our current search engine does a good job of this.

Performance Evaluation:

Apart from search accuracy we have to do performance evaluation where we ask: How long does it take to train/model, index the data and secondly what is the Query performance(i.e how long does it take to process one or more queries)? Some of these can be seen as non-functional requirements.

Basic Experiment:

To evaluate the performance of our system we replicated this experiment done by gensim authors : https://radimrehurek.com/gensim/wiki.html, where we’ll apply LSA to a corpus containing full set of articles contained in Wikipedia.

Differences in dataset:

Gensim Authors used a corpus that contains 3.9M documents, 100K features (distinct tokens) and 0.76G non-zero entries in the sparse TF-IDF matrix. The Wikipedia corpus contains about 2.24 billion tokens in total and the size is 8GB compressed.
We used this dataset: https://dumps.wikimedia.org/enwiki/20170820/ which is about 5 Million documents and about 46 GB of raw text - For comparison, the size when compressed is 13 GB.

Parameters(same for both experiments):

A LSI model was created with 400 topics with 2^17 vocabulary size

Hardware Differences:

Gensim Experiment - Hardware Details(2012):

MacBook Pro, Intel Core i7 2.3GHz, 16GB DDR3 RAM, OS X with libVec.
In distributed mode - four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM with ATLAS)

Our Hardware Details:

Experiment was run on 5 instances of r4.4xlarge.
They were using ATLAS (Designed for Matrix operation on large dataset) Our system supports ARPACK but we never used it in this evaluation. Using Arpack can lead to 2-3x performance gains at minimum.

Step	Gensim	Our Implemetation
Data Cleaning + PreProcessing (results in TF_IDF representation)	9 hrs (Can only run on MacBook)	~30mins (Spark)
Modeling	5.25h(on Macbook) or 1 hr 40 mins(on Cluster)	~ 2hrs (Spark)
Indexing	N/A	~3hrs (Gensim)
Total	~14	~5 hrs

Conclusion:

In this project we built a semantic search engine allows us to understand user's intent through contextual meaning. We started by discussing some of the specificities of the patent domain, its rules and customs, and learned how patents are formed as documents, pre-processed and modeled. The core technique behind LSA is singular-value decomposition in which we decompose a term document matrix into a set of orthogonal vectors.

References:

[1] http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf

[2] https://lirias.kuleuven.be/bitstream/123456789/321960/1/MSI_1114.pdf

[3] http://pages.cs.wisc.edu/~jerryzhu/cs769/text_preprocessing.pdf

[4] detailed discussion on advanced topics: Moens, M. F. (2006). Information extraction: Algorithms and prospects in a retrieval context (The Information Retrieval Series 21).

[5] Paper: 'So many topics, so little time':
http://sigir.org/files/forum/2009J/2009j-sigirforum-roda.pdf

[6] Patent Information Retrieval: http://www.ir-facility.org/c/document_library/get_file?uuid=a176a998-bfc6-41a4-b0d2-852784e3e720&groupId=10156

[7] http://www.informationr.net/ir/18-2/paper582.html#.WHSxs7Z96QM

[8] https://www.ccs.neu.edu/home/vip/teach/IRcourse/IR_surveys/fntir_2013.pdf

[9] http://patft.uspto.gov/