Today I’d like to discuss the workings of Latent Semantic Indexing a bit further. It will help you understand the importance of LSI even better and it will also help explain why I wrote a free mini-tool that reduces “noise” out of an article and leaves you with only semantically-relevant words (or “semantic words” as I’ll be referring to them from here on).
To recap my introduction to LSI article and refresh your memory; old-school search engine techniques approached keyword searches with sort of an accountant mentality: a word is either found in an article / on a web page or it is not, there is no middle ground.
In addition to indexing the sites that DO contain a specific keyword, Latent Semantic Indexing looks at an entire collection of documents as a whole, that means a subsection of your website, your entire website, and sometimes even your website and several other websites that have multiple links to yours.
While attempting to assign a value / rank to your specific page in question, LSI looks at all the other documents for the same or related words. LSI tries to simulate a human being when it comes to judging relevancy among a set of documents.
In part due to the complex nature of the English language, LSI does not understand what the words mean, although the patterns LSI picks up on can make it seem incredibly intelligent while in fact it’s still a ‘dumb computer’.
What does an LSI algorithm look at?
An LSI algorithm will index a set of documents (which can be a handful or thousands of documents) and calculates similarity values for every semantic word (more on this later). Obviously, the formulas used in determining similarity and relevancy values are extremely complex and above all kept secret by the search engines.
When comparing one document or page to another, LSI algorithms are not looking for an exact match of a specific keyword to determine relevancy. The two documents or pages therefore do not have to contain the same keywords in order to be relevant in the eyes of an LSI algorithm. This makes an LSI-powered keyword search much better than a plain keyword search (or old-school keyword search as I called it earlier).
To use an example, let’s say a collection of diabetes-related articles is indexed by an LSI search engine. If the words diabetes, insulin, and glucose appear together in enoguh articles, the LSI search algorithm will figure out that the three terms are ’semantically close’ or in plain english: the terms are related to eachother.
As such, a search for ‘diabetes’ will return a set of articles containing that phrase BUT also articles that contain just the word insulin (and not diabetes). Think of an article explaining what insulin is and what it does to your body but does not mention the word diabetes even once, an LSI algorithm will still agree on the fact that it is indeed relevant to ‘diabetes’ even though the search engine doesn’t know anything about ‘diabetes’ like a human being does.
So by examining enough documents, an LSI algorithm teaches itself that these three terms are related. It uses this information to provide more sophisticated and natural search results.
So an LSI Search Engine Bot visits my site, what does it do?
Let’s assume the following scenario. You have a website that covers the topic of say… diabetes! On that site you have a mere 10 articles with related content. You have an article that explains the symptons, another article discussing treatment, an article explaining the different types of diabetes, etc.
An LSI-powered search engine spider like Google’s Bot will visit your website and index your frontpage and your 10 subpages and consider this a ’set of related documents’. The algorithm will however determine how well-related your documents are.
Let’s have a closer look on how it looks at your site.
Noise-reduction: Tossing words that don’t carry Semantic Meaning
As you learned earlier, an LSI algorithm will search your set of documents for patterns of word distribution and the co-occurence of words.
To make its job easier, an LSI algorithm will start by filtering out words that don’t carry any semantic meaning. To explain this, note that natural language is full of redundancies, and so not every word that appears in a document carries semantic meaning.
Think about the most frequently used words in English (the, of, to, and, or, etc.) and consider that they don’t really mean anything. As you probably know, search engines even discard these type of words when you enter them in a keyword search.
An LSI algorithm has a huge set of words it filters from a document, leaving only “content words” or “semantic words”.
Though not a huge issue, a slight pitfall of this approach is that depending on the context, a word can be a semantic word or a junk word. For example, consider an article or advertisement for a car that contains “rolls royce phantom in good condition” - this is where the word good is not junk as it is a relatively important aspect of the content. On the other hand, consider an article that mentions “the good news is that…” - this is where “good” is just another junk word and can be safely tossed.
LSI Noise Reduction in action
As you might have noticed by now, when I explain things I prefer to explain them with a detailed example. Better yet, I take a ’see for yourself’ approach wherever I can. With that in mind, I recently spent an entire week collecting and analyzing the type of words an LSI algorithm ignores and created a tool that shows you exactly that.
My Mr. LSI’s Semantic Article Cleaner, much like an LSI-algorithm, will filter the following words:
- Common Adjectives (big, small, low)
- Common Numerals (two, tenth, millions)
- Common Verbs (do, be, see)
- (Compound) Prepositions (after, near, with)
- Conjunctions (and, for, than, where)
- Conjunctive Adverbs (also, consequently, nevertheless)
- Contractions (can’t, i’ll, wasn’t, wouldn’t)
- Pronouns (i, his, whomever, yourselves)
- Interjections (awesome! whoops! good grief! etc.)
- Articles (a, an, the)
- Frilly Words (albeit, however, moreover, moreso, therefore, thus)
You can try out Mr. LSI’s Semantic Article Cleaner here.
Try out my tool above and be amazed how ‘clean’ your article looks after it’s been filtered. Note you can also pre-populate the fields in case you don’t have an article of your own to test handy right this moment.
Back to the LSI algorithm as it sees your set of 10 diabetes articles. To determine the Top XX relevant words or word phrases, the algorithm will discard any words or sentences that appear in every article or page (mainly useful to ignore navigation menus, footers, etc.)
Additionally, it will discard any words that appear in only one document accross your entire set of articles.
This process condenses your articles into sets of semantic words that the search engine will now use to index our collection.
So what does that mean for Me as a Webmaster?
LSI is something that’s here to stay and will only improve. I personally expect Google to either make internal breakthroughs or license a third-party patent/company with regards to Artificially Intelligent Search Indexing within the next 6 years.
As you read earlier, the only problem with LSI is that computers still cannot understand the context of words and articles it indexes. As research is being done into Natural Language Processing and Artificial Intelligence, it is only a matter of time before Latent Semantic Indexing moves onto the next level.
The bottom-line is that Google will see right through “unnatural” content more and more, making today’s article spinners completely worthless.
Moreover, it is now more important than ever to structure your website properly with LSI in mind. This means your website should be populated with the most relevant content and structured logically as a human being would expect your site to be structured.
You have the greatest tool of all to structure your website and that is your brains. When it comes to a topic you’re not an expert on however it is easy to miss out on important sub-topics of a niche you’re making a website in. This is where my “KeyWord Excavator” tool comes in. You can try this tool for free and there’s no sign-up nor download required. Try Mr. LSI’s KeyWord Excavator here.